Getting error while plotting the dendrogram for the spearmanr correlation
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
When a dendrogram fails after a Spearman correlation step, the problem is usually not the ranking calculation itself. The usual issue is that hierarchical clustering expects distances, while the code is handing it raw correlations or a malformed square matrix. Once the data is converted into a clean condensed distance vector, SciPy's clustering functions behave much more predictably.
Correlation Output Is Not Linkage Input
scipy.stats.spearmanr returns correlation coefficients. scipy.cluster.hierarchy.linkage expects either:
- raw observations, or
- a condensed distance vector
Passing a square correlation matrix directly into linkage is a common mistake. Even if the matrix is symmetric, it is still not the condensed distance vector that linkage expects when you already computed pairwise relationships.
For clustering variables by similarity, the usual flow is:
- compute Spearman correlation
- convert correlation into distance
- convert the square distance matrix into condensed form
- call
linkage - call
dendrogram
A Working Example
The example below clusters dataframe columns using Spearman rank correlation and average linkage.
That is the pattern to copy. If you skip squareform, or if the matrix contains invalid values, the dendrogram step is where the failure usually shows up.
Choosing the Distance Formula
For correlations, a common first choice is:
This works when you want perfectly correlated variables close together and negatively correlated variables farther apart. If you want positive and negative correlations to count as similar in magnitude, use:
That choice is domain-specific. The key is to define whether strong negative correlation should mean "far apart" or "highly related in opposite directions."
Handle Constant Columns and Missing Values Early
Spearman correlation can produce nan values when a column is constant or when missing values are not handled. Those nan values later cause squareform, linkage, or dendrogram to fail.
A defensive preprocessing pass helps a lot:
If missing values are expected, decide whether to drop rows, impute values, or use a different similarity strategy before clustering. Do not leave that decision implicit.
Why squareform Matters
linkage does not want a full square distance matrix unless you are deliberately passing observations instead. When you already have pairwise distances, use squareform to compress the upper triangle into the condensed vector format SciPy expects.
That means this is wrong:
and this is right:
If the matrix is not symmetric or the diagonal is not zero, squareform will usually tell you before linkage fails in a more confusing way.
Common Pitfalls
- Passing the raw correlation matrix into
linkage. Fix: convert correlation into distance, then usesquareform. - Forgetting to zero the diagonal. Fix: normalize with
np.fill_diagonal(distance, 0). - Leaving
nanvalues in the matrix. Fix: remove constant columns and handle missing data before correlation. - Using labels that do not match the clustered dimension. Fix: pass column labels when clustering columns and row labels when clustering rows.
- Choosing a distance definition without thinking about negative correlation. Fix: decide explicitly between
1 - corrand1 - abs(corr)based on the domain.
Summary
- Spearman correlation output must be converted into a distance matrix before hierarchical clustering.
- Use
squareformto turn the square distance matrix into the condensed format expected bylinkage. - Clean constant columns and missing values before computing correlations.
- Set the diagonal to zero and keep the matrix symmetric.
- If the dendrogram fails, inspect data shape,
nanvalues, and label length before debugging Matplotlib.

