Accuracy, precision, and recall for multi-class model
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Accuracy, precision, and recall still apply in multi-class classification, but you have to be explicit about how they are aggregated. A single multi-class model predicts one label from several classes, so precision and recall are usually computed one class at a time using a one-vs-rest view, then combined with macro, micro, or weighted averaging.
Start with the Confusion Matrix
A confusion matrix is the foundation for all of these metrics. For a three-class problem with labels A, B, and C, the matrix counts how often each true label was predicted as each class.
For one specific class, say A, define:
- true positives: actual
A, predictedA - false positives: predicted
A, but actual class was something else - false negatives: actual
A, but predicted something else
That one-vs-rest framing lets you reuse the standard precision and recall formulas.
Accuracy in Multi-Class Problems
Accuracy is the simplest metric:
- '
correct predictions / total predictions'
If the confusion matrix has large values on the diagonal, accuracy is high.
Accuracy is useful when:
- classes are reasonably balanced
- all mistakes cost about the same
It becomes less informative when some classes are rare. A model can achieve high accuracy by doing well on common classes while performing badly on important minority classes.
Precision and Recall per Class
For each class k:
- precision for
k= of the items predicted ask, how many were reallyk - recall for
k= of the items that were reallyk, how many did the model find
Suppose class cat has:
- true positives =
40 - false positives =
10 - false negatives =
20
Then:
- precision =
40 / (40 + 10) = 0.80 - recall =
40 / (40 + 20) = 0.67
That tells you the model is fairly precise when it predicts cat, but it still misses many actual cats.
Macro, Micro, and Weighted Averaging
Because multi-class models have multiple per-class precision and recall values, people often report an averaged summary.
Macro Average
Macro averaging computes the metric independently for each class, then takes the plain average.
This treats every class equally, regardless of how often it appears.
Use macro averages when minority classes matter and you do not want frequent classes to dominate the report.
Micro Average
Micro averaging pools all true positives, false positives, and false negatives across classes first, then computes the metric once.
This gives more influence to common classes because they contribute more total examples.
Weighted Average
Weighted averaging computes each class metric separately, then averages them using class support as the weight.
This often lands between macro and micro in interpretation.
Runnable Example with scikit-learn
This prints per-class precision, recall, and F1, plus macro and weighted averages.
That is often the most practical way to inspect a multi-class model: do not stop at one global accuracy number.
When Each Metric Matters
Use accuracy when all classes matter similarly and the dataset is not badly imbalanced.
Use precision when false positives are costly. For example, if predicting the wrong category triggers an expensive human review, precision matters.
Use recall when false negatives are costly. If missing a rare but critical class is unacceptable, recall matters more.
In many real systems, you report all three plus F1.
Multi-Class Does Not Mean Multi-Label
A common confusion is mixing multi-class and multi-label classification.
- multi-class: one label per example
- multi-label: multiple labels can be correct at once
The evaluation setup changes. The metrics have similar names, but the confusion matrix interpretation is different.
Common Pitfalls
The biggest mistake is reporting only accuracy for an imbalanced multi-class problem.
Another mistake is quoting a precision or recall value without saying whether it is per-class, macro, micro, or weighted.
A third issue is mixing up multi-class and multi-label evaluation and applying the wrong metric interpretation.
Summary
- Accuracy is overall correctness, but it can hide poor minority-class behavior
- Precision and recall are computed per class using a one-vs-rest view
- Macro average treats classes equally, micro average pools all decisions, and weighted average uses class frequency
- Multi-class evaluation is usually better with a full classification report than with a single number
- Always match the reported metric to the real cost of false positives and false negatives in the application

