nDCG
sklearn
machine learning
ranking metrics
Python libraries

inputs for nDCG in sklearn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

ndcg_score in scikit-learn is used to evaluate ranking quality, not plain classification accuracy. The function answers a specific question: given a set of relevance labels and a model's predicted ranking scores, how close is the model's ordering to the ideal ordering?

Most confusion comes from the inputs. Developers often pass class labels or a single ranked list when ndcg_score actually expects two aligned score matrices: true relevance values and predicted scores.

What ndcg_score Expects

The core call looks like this:

python
from sklearn.metrics import ndcg_score

score = ndcg_score(y_true, y_score)

Both y_true and y_score should be array-like objects of shape (n_samples, n_items). Each row represents one query, user, or ranking task. Each column represents one candidate item for that row.

The meaning of each input is different:

  • 'y_true contains the ground-truth relevance values.'
  • 'y_score contains the model's predicted ranking scores for the same items.'

The function sorts items by y_score and then measures how well that ranking aligns with the higher values in y_true.

A Minimal Example

Suppose each row is a search query and each column is a document candidate:

python
1import numpy as np
2from sklearn.metrics import ndcg_score
3
4y_true = np.array([
5    [3, 2, 0, 1],
6    [0, 1, 2, 3],
7])
8
9y_score = np.array([
10    [0.9, 0.8, 0.1, 0.3],
11    [0.2, 0.4, 0.7, 0.95],
12])
13
14print(ndcg_score(y_true, y_score))

Here, higher numbers in y_true mean more relevant items. Higher numbers in y_score mean the model believes those items should rank earlier.

You do not pass the predicted order directly. You pass scores, and scikit-learn computes the order from those scores.

Binary And Graded Relevance

y_true does not need to be binary. In fact, nDCG is especially useful when relevance is graded. For example, values such as 0, 1, 2, and 3 let you distinguish irrelevant, somewhat relevant, and highly relevant items.

Binary relevance also works:

python
1import numpy as np
2from sklearn.metrics import ndcg_score
3
4y_true = np.array([[1, 0, 1, 0, 1]])
5y_score = np.array([[0.7, 0.2, 0.9, 0.1, 0.4]])
6
7print(ndcg_score(y_true, y_score, k=3))

Using k=3 means only the top three predicted positions contribute to the final score. That is useful when your application only shows the first few results.

Shape Rules That Matter

A frequent source of errors is passing one-dimensional arrays. Even for one query, use shape (1, n_items) rather than (n_items,). In practice, that means wrapping the row in an extra list or using reshape(1, -1).

For one ranking task:

python
1import numpy as np
2from sklearn.metrics import ndcg_score
3
4relevance = np.array([3, 1, 0, 2]).reshape(1, -1)
5predicted = np.array([0.8, 0.2, 0.1, 0.7]).reshape(1, -1)
6
7print(ndcg_score(relevance, predicted))

The number of columns must match exactly because each predicted score must correspond to the same item position in y_true.

Interpreting The Inputs Correctly

Think of y_true as item utility and y_score as ranking confidence. The absolute values in y_score do not matter as much as their relative order. A score vector of 0.9, 0.8, 0.1 produces the same ranking as 90, 80, 10.

By contrast, the values in y_true do matter because they define how much reward the metric assigns when relevant items appear near the top.

Common Pitfalls

The most common mistake is passing predicted class labels instead of predicted ranking scores. nDCG is about ranking order, so hard labels usually throw away the information the metric needs.

Another pitfall is using mismatched shapes, especially one-dimensional arrays. If you are evaluating one query at a time, reshape both arrays to two dimensions.

A third issue is mixing up candidate order between y_true and y_score. If column two refers to different items in the two arrays, the metric output is meaningless even though the code runs.

Summary

  • 'ndcg_score expects y_true and y_score with shape (n_samples, n_items).'
  • 'y_true contains relevance labels, while y_score contains predicted ranking scores.'
  • Use graded relevance when item quality has more than two levels.
  • Use k when only the top part of the ranking matters.
  • Keep item order aligned across both inputs or the result will be invalid.

Course illustration
Course illustration

All Rights Reserved.