error-handling
dendrogram
spearman-correlation
data-visualization
python

Getting error while plotting the dendrogram for the spearmanr correlation

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When a dendrogram fails after a Spearman correlation step, the problem is usually not the ranking calculation itself. The usual issue is that hierarchical clustering expects distances, while the code is handing it raw correlations or a malformed square matrix. Once the data is converted into a clean condensed distance vector, SciPy's clustering functions behave much more predictably.

Correlation Output Is Not Linkage Input

scipy.stats.spearmanr returns correlation coefficients. scipy.cluster.hierarchy.linkage expects either:

  • raw observations, or
  • a condensed distance vector

Passing a square correlation matrix directly into linkage is a common mistake. Even if the matrix is symmetric, it is still not the condensed distance vector that linkage expects when you already computed pairwise relationships.

For clustering variables by similarity, the usual flow is:

  1. compute Spearman correlation
  2. convert correlation into distance
  3. convert the square distance matrix into condensed form
  4. call linkage
  5. call dendrogram

A Working Example

The example below clusters dataframe columns using Spearman rank correlation and average linkage.

python
1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4
5from scipy.stats import spearmanr
6from scipy.spatial.distance import squareform
7from scipy.cluster.hierarchy import linkage, dendrogram
8
9df = pd.DataFrame(
10    {
11        "a": [1, 2, 3, 4, 5],
12        "b": [2, 4, 6, 8, 10],
13        "c": [5, 4, 3, 2, 1],
14        "d": [3, 3, 4, 4, 5],
15    }
16)
17
18# Spearman correlation between columns
19corr, _ = spearmanr(df, axis=0)
20
21corr = np.asarray(corr)
22
23# Convert similarity into distance
24distance = 1 - corr
25
26# Numerical cleanup for symmetry and diagonal
27distance = (distance + distance.T) / 2
28np.fill_diagonal(distance, 0)
29
30# Convert square matrix to condensed distance vector
31condensed = squareform(distance, checks=True)
32
33Z = linkage(condensed, method="average")
34
35plt.figure(figsize=(8, 4))
36dendrogram(Z, labels=df.columns.tolist())
37plt.tight_layout()
38plt.show()

That is the pattern to copy. If you skip squareform, or if the matrix contains invalid values, the dendrogram step is where the failure usually shows up.

Choosing the Distance Formula

For correlations, a common first choice is:

python
distance = 1 - corr

This works when you want perfectly correlated variables close together and negatively correlated variables farther apart. If you want positive and negative correlations to count as similar in magnitude, use:

python
distance = 1 - np.abs(corr)

That choice is domain-specific. The key is to define whether strong negative correlation should mean "far apart" or "highly related in opposite directions."

Handle Constant Columns and Missing Values Early

Spearman correlation can produce nan values when a column is constant or when missing values are not handled. Those nan values later cause squareform, linkage, or dendrogram to fail.

A defensive preprocessing pass helps a lot:

python
1clean_df = df.dropna(axis=0)
2clean_df = clean_df.loc[:, clean_df.nunique() > 1]
3
4corr, _ = spearmanr(clean_df, axis=0)
5corr = np.asarray(corr)
6
7if np.isnan(corr).any():
8    raise ValueError("Correlation matrix contains NaN values")

If missing values are expected, decide whether to drop rows, impute values, or use a different similarity strategy before clustering. Do not leave that decision implicit.

Why squareform Matters

linkage does not want a full square distance matrix unless you are deliberately passing observations instead. When you already have pairwise distances, use squareform to compress the upper triangle into the condensed vector format SciPy expects.

That means this is wrong:

python
Z = linkage(distance, method="average")

and this is right:

python
condensed = squareform(distance, checks=True)
Z = linkage(condensed, method="average")

If the matrix is not symmetric or the diagonal is not zero, squareform will usually tell you before linkage fails in a more confusing way.

Common Pitfalls

  • Passing the raw correlation matrix into linkage. Fix: convert correlation into distance, then use squareform.
  • Forgetting to zero the diagonal. Fix: normalize with np.fill_diagonal(distance, 0).
  • Leaving nan values in the matrix. Fix: remove constant columns and handle missing data before correlation.
  • Using labels that do not match the clustered dimension. Fix: pass column labels when clustering columns and row labels when clustering rows.
  • Choosing a distance definition without thinking about negative correlation. Fix: decide explicitly between 1 - corr and 1 - abs(corr) based on the domain.

Summary

  • Spearman correlation output must be converted into a distance matrix before hierarchical clustering.
  • Use squareform to turn the square distance matrix into the condensed format expected by linkage.
  • Clean constant columns and missing values before computing correlations.
  • Set the diagonal to zero and keep the matrix symmetric.
  • If the dendrogram fails, inspect data shape, nan values, and label length before debugging Matplotlib.

Course illustration
Course illustration

All Rights Reserved.