Machine Learning
Multi-class Classification
Model Evaluation
Accuracy
Precision and Recall

Accuracy, precision, and recall for multi-class model

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Accuracy, precision, and recall still apply in multi-class classification, but you have to be explicit about how they are aggregated. A single multi-class model predicts one label from several classes, so precision and recall are usually computed one class at a time using a one-vs-rest view, then combined with macro, micro, or weighted averaging.

Start with the Confusion Matrix

A confusion matrix is the foundation for all of these metrics. For a three-class problem with labels A, B, and C, the matrix counts how often each true label was predicted as each class.

For one specific class, say A, define:

  • true positives: actual A, predicted A
  • false positives: predicted A, but actual class was something else
  • false negatives: actual A, but predicted something else

That one-vs-rest framing lets you reuse the standard precision and recall formulas.

Accuracy in Multi-Class Problems

Accuracy is the simplest metric:

  • 'correct predictions / total predictions'

If the confusion matrix has large values on the diagonal, accuracy is high.

Accuracy is useful when:

  • classes are reasonably balanced
  • all mistakes cost about the same

It becomes less informative when some classes are rare. A model can achieve high accuracy by doing well on common classes while performing badly on important minority classes.

Precision and Recall per Class

For each class k:

  • precision for k = of the items predicted as k, how many were really k
  • recall for k = of the items that were really k, how many did the model find

Suppose class cat has:

  • true positives = 40
  • false positives = 10
  • false negatives = 20

Then:

  • precision = 40 / (40 + 10) = 0.80
  • recall = 40 / (40 + 20) = 0.67

That tells you the model is fairly precise when it predicts cat, but it still misses many actual cats.

Macro, Micro, and Weighted Averaging

Because multi-class models have multiple per-class precision and recall values, people often report an averaged summary.

Macro Average

Macro averaging computes the metric independently for each class, then takes the plain average.

This treats every class equally, regardless of how often it appears.

Use macro averages when minority classes matter and you do not want frequent classes to dominate the report.

Micro Average

Micro averaging pools all true positives, false positives, and false negatives across classes first, then computes the metric once.

This gives more influence to common classes because they contribute more total examples.

Weighted Average

Weighted averaging computes each class metric separately, then averages them using class support as the weight.

This often lands between macro and micro in interpretation.

Runnable Example with scikit-learn

python
1from sklearn.metrics import classification_report, accuracy_score
2
3y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
4y_pred = [0, 2, 2, 0, 1, 1, 0, 1, 2]
5
6print("accuracy:", accuracy_score(y_true, y_pred))
7print(classification_report(y_true, y_pred, target_names=["A", "B", "C"]))

This prints per-class precision, recall, and F1, plus macro and weighted averages.

That is often the most practical way to inspect a multi-class model: do not stop at one global accuracy number.

When Each Metric Matters

Use accuracy when all classes matter similarly and the dataset is not badly imbalanced.

Use precision when false positives are costly. For example, if predicting the wrong category triggers an expensive human review, precision matters.

Use recall when false negatives are costly. If missing a rare but critical class is unacceptable, recall matters more.

In many real systems, you report all three plus F1.

Multi-Class Does Not Mean Multi-Label

A common confusion is mixing multi-class and multi-label classification.

  • multi-class: one label per example
  • multi-label: multiple labels can be correct at once

The evaluation setup changes. The metrics have similar names, but the confusion matrix interpretation is different.

Common Pitfalls

The biggest mistake is reporting only accuracy for an imbalanced multi-class problem.

Another mistake is quoting a precision or recall value without saying whether it is per-class, macro, micro, or weighted.

A third issue is mixing up multi-class and multi-label evaluation and applying the wrong metric interpretation.

Summary

  • Accuracy is overall correctness, but it can hide poor minority-class behavior
  • Precision and recall are computed per class using a one-vs-rest view
  • Macro average treats classes equally, micro average pools all decisions, and weighted average uses class frequency
  • Multi-class evaluation is usually better with a full classification report than with a single number
  • Always match the reported metric to the real cost of false positives and false negatives in the application

Course illustration
Course illustration

All Rights Reserved.