inputs for nDCG in sklearn
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
ndcg_score in scikit-learn is used to evaluate ranking quality, not plain classification accuracy. The function answers a specific question: given a set of relevance labels and a model's predicted ranking scores, how close is the model's ordering to the ideal ordering?
Most confusion comes from the inputs. Developers often pass class labels or a single ranked list when ndcg_score actually expects two aligned score matrices: true relevance values and predicted scores.
What ndcg_score Expects
The core call looks like this:
Both y_true and y_score should be array-like objects of shape (n_samples, n_items). Each row represents one query, user, or ranking task. Each column represents one candidate item for that row.
The meaning of each input is different:
- '
y_truecontains the ground-truth relevance values.' - '
y_scorecontains the model's predicted ranking scores for the same items.'
The function sorts items by y_score and then measures how well that ranking aligns with the higher values in y_true.
A Minimal Example
Suppose each row is a search query and each column is a document candidate:
Here, higher numbers in y_true mean more relevant items. Higher numbers in y_score mean the model believes those items should rank earlier.
You do not pass the predicted order directly. You pass scores, and scikit-learn computes the order from those scores.
Binary And Graded Relevance
y_true does not need to be binary. In fact, nDCG is especially useful when relevance is graded. For example, values such as 0, 1, 2, and 3 let you distinguish irrelevant, somewhat relevant, and highly relevant items.
Binary relevance also works:
Using k=3 means only the top three predicted positions contribute to the final score. That is useful when your application only shows the first few results.
Shape Rules That Matter
A frequent source of errors is passing one-dimensional arrays. Even for one query, use shape (1, n_items) rather than (n_items,). In practice, that means wrapping the row in an extra list or using reshape(1, -1).
For one ranking task:
The number of columns must match exactly because each predicted score must correspond to the same item position in y_true.
Interpreting The Inputs Correctly
Think of y_true as item utility and y_score as ranking confidence. The absolute values in y_score do not matter as much as their relative order. A score vector of 0.9, 0.8, 0.1 produces the same ranking as 90, 80, 10.
By contrast, the values in y_true do matter because they define how much reward the metric assigns when relevant items appear near the top.
Common Pitfalls
The most common mistake is passing predicted class labels instead of predicted ranking scores. nDCG is about ranking order, so hard labels usually throw away the information the metric needs.
Another pitfall is using mismatched shapes, especially one-dimensional arrays. If you are evaluating one query at a time, reshape both arrays to two dimensions.
A third issue is mixing up candidate order between y_true and y_score. If column two refers to different items in the two arrays, the metric output is meaningless even though the code runs.
Summary
- '
ndcg_scoreexpectsy_trueandy_scorewith shape(n_samples, n_items).' - '
y_truecontains relevance labels, whiley_scorecontains predicted ranking scores.' - Use graded relevance when item quality has more than two levels.
- Use
kwhen only the top part of the ranking matters. - Keep item order aligned across both inputs or the result will be invalid.

