Keras
machine learning
per-class accuracy
deep learning
model evaluation

How to output per-class accuracy in Keras?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Overall accuracy can hide important failure modes, especially on imbalanced datasets. Per-class accuracy shows how well the model performs for each label individually, which makes it much easier to see whether one class is consistently being ignored or confused.

Start with Predictions and a Confusion Matrix

Keras does not usually print per-class accuracy automatically during standard training logs, so the common approach is to run predictions on a validation or test set and compute class-wise metrics afterward.

python
1import numpy as np
2from sklearn.metrics import confusion_matrix
3
4probs = model.predict(x_test)
5y_pred = np.argmax(probs, axis=1)
6y_true = np.argmax(y_test, axis=1)
7
8cm = confusion_matrix(y_true, y_pred)
9print(cm)

The confusion matrix gives you the raw counts needed to compute per-class accuracy.

Compute Per-Class Accuracy Explicitly

For class i, per-class accuracy is the number of correctly predicted samples of that class divided by the number of true samples of that class.

python
1import numpy as np
2
3per_class_acc = cm.diagonal() / cm.sum(axis=1)
4print(per_class_acc)

If class 0 has 40 correct predictions out of 45 true examples, its per-class accuracy is 40 / 45.

This number is often more informative than overall accuracy because it isolates class-specific behavior.

A simple loop makes the output easier to interpret.

python
1class_names = ["cat", "dog", "rabbit"]
2
3for idx, acc in enumerate(per_class_acc):
4    print(f"{class_names[idx]}: {acc:.4f}")

This is usually enough for offline evaluation scripts and experiment notebooks.

Use classification_report for a Broader View

Per-class accuracy is useful, but it is often best interpreted alongside precision, recall, and F1 score.

python
from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred, target_names=class_names))

In multi-class classification, recall for each class is closely related to what many people informally mean by per-class accuracy, because it measures how many true members of that class were recovered correctly.

Add It at the End of Each Epoch with a Callback

If you want per-class results while training, use a custom callback that runs on validation data at the end of each epoch.

python
1import numpy as np
2from sklearn.metrics import confusion_matrix
3from keras.callbacks import Callback
4
5class PerClassAccuracyCallback(Callback):
6    def __init__(self, x_val, y_val, class_names):
7        super().__init__()
8        self.x_val = x_val
9        self.y_val = y_val
10        self.class_names = class_names
11
12    def on_epoch_end(self, epoch, logs=None):
13        probs = self.model.predict(self.x_val, verbose=0)
14        y_pred = np.argmax(probs, axis=1)
15        y_true = np.argmax(self.y_val, axis=1)
16        cm = confusion_matrix(y_true, y_pred)
17        accs = cm.diagonal() / cm.sum(axis=1)
18        print(f"Epoch {epoch + 1}")
19        for idx, acc in enumerate(accs):
20            print(f"  {self.class_names[idx]}: {acc:.4f}")

This is helpful during experimentation, though it can slow training if the validation set is large.

Be Careful with Label Encoding

The metric logic depends on how your labels are represented.

  • if labels are one-hot encoded, use argmax
  • if labels are integer-encoded already, use them directly

For example:

python
y_true = y_test  # if y_test already contains class ids such as 0, 1, 2

Mixing one-hot and integer assumptions is one of the easiest ways to compute the wrong metric silently.

Common Pitfalls

A common mistake is calling precision or recall "per-class accuracy" without being clear about the definition. Be explicit about what you are reporting.

Another is using overall accuracy on an imbalanced dataset and assuming the model is healthy when one minority class has almost zero recovery.

Developers also sometimes compute the confusion matrix from shuffled labels and predictions that no longer align, which makes every downstream metric meaningless.

Summary

  • Keras usually does not print per-class accuracy automatically in standard logs.
  • The common solution is to compute it from predictions and a confusion matrix.
  • Per-class accuracy is the diagonal count divided by the number of true samples for each class.
  • 'classification_report is a useful companion because it adds precision, recall, and F1 score.'
  • A custom callback can print per-class results after each epoch when needed.

Course illustration
Course illustration

All Rights Reserved.