Huggingface transformers trainer output not giving any predictions?

transformers

Huggingface

machine learning

Python

predictions

Huggingface transformers trainer output not giving any predictions?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When the Hugging Face Trainer.predict() method returns empty or unexpected predictions, the cause is usually a misconfigured dataset, a missing compute_metrics function, or the model returning logits instead of labels. The Trainer returns raw model outputs (logits) by default — you must post-process them to get actual class predictions. Other common causes include passing an empty dataset, mismatched tokenization, or not calling predict() at all (confusing it with evaluate()).

Basic Prediction Setup

python

1from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
2from datasets import load_dataset
3
4# Load model and tokenizer
5model_name = "distilbert-base-uncased-finetuned-sst-2-english"
6tokenizer = AutoTokenizer.from_pretrained(model_name)
7model = AutoModelForSequenceClassification.from_pretrained(model_name)
8
9# Prepare test data
10dataset = load_dataset("glue", "sst2", split="validation[:100]")
11
12def tokenize(examples):
13    return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)
14
15test_dataset = dataset.map(tokenize, batched=True)
16
17# Get predictions
18trainer = Trainer(model=model)
19predictions = trainer.predict(test_dataset)
20
21print(predictions.predictions.shape)  # (100, 2) — logits, not labels
22print(predictions.label_ids.shape)    # (100,) — true labels

trainer.predict() returns a PredictionOutput with .predictions (raw logits), .label_ids (ground truth), and .metrics (if compute_metrics was provided).

Fix 1: Convert Logits to Predictions

python

1import numpy as np
2
3output = trainer.predict(test_dataset)
4
5# Raw logits — NOT class labels
6print(output.predictions[:3])
7# [[ 3.2, -2.8], [-1.5,  2.1], [ 0.3, -0.1]]
8
9# Convert to class predictions
10predictions = np.argmax(output.predictions, axis=-1)
11print(predictions[:3])  # [0, 1, 0]
12
13# For regression tasks, predictions are already scalar
14# predictions = output.predictions.squeeze()

The most common "no predictions" issue is that developers expect class labels but get logit arrays. np.argmax() converts logits to class indices.

Fix 2: Add compute_metrics

python

1from sklearn.metrics import accuracy_score, f1_score
2import numpy as np
3
4def compute_metrics(eval_pred):
5    logits, labels = eval_pred
6    predictions = np.argmax(logits, axis=-1)
7    return {
8        "accuracy": accuracy_score(labels, predictions),
9        "f1": f1_score(labels, predictions, average="weighted"),
10    }
11
12training_args = TrainingArguments(
13    output_dir="./results",
14    per_device_eval_batch_size=16,
15)
16
17trainer = Trainer(
18    model=model,
19    args=training_args,
20    compute_metrics=compute_metrics,
21)
22
23output = trainer.predict(test_dataset)
24print(output.metrics)
25# {'test_loss': 0.32, 'test_accuracy': 0.91, 'test_f1': 0.90}

Without compute_metrics, output.metrics only contains the loss. Adding this function populates metrics with accuracy, F1, and any other metrics you compute.

Fix 3: Ensure Dataset Has Correct Columns

python

1# Check what columns the dataset has
2print(test_dataset.column_names)
3# ['sentence', 'label', 'input_ids', 'attention_mask']
4
5# Trainer needs: input_ids, attention_mask, (optionally) labels
6# Remove unnecessary columns
7test_dataset = test_dataset.remove_columns(["sentence"])
8
9# Set format to PyTorch tensors
10test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
11
12# Verify
13print(test_dataset[0])

The Trainer expects specific columns. If input_ids or attention_mask are missing, the model receives no input and produces no predictions. Always verify the dataset format after tokenization.

Fix 4: Rename Label Column

python

1# Some datasets use 'labels' (plural), some use 'label' (singular)
2# Trainer expects 'labels' by default
3
4# If your column is named differently
5test_dataset = test_dataset.rename_column("target", "labels")
6
7# Or map it
8def add_labels(example):
9    example["labels"] = example["sentiment"]
10    return example
11
12test_dataset = test_dataset.map(add_labels)

If the label column is not named labels, the Trainer cannot compute the loss or pass labels to the model. Rename it to labels for compatibility.

Fix 5: Handle Empty Predictions

python

1output = trainer.predict(test_dataset)
2
3# Check if predictions are empty
4if output.predictions is None or len(output.predictions) == 0:
5    print("No predictions — check dataset")
6    print(f"Dataset size: {len(test_dataset)}")
7    print(f"Columns: {test_dataset.column_names}")
8else:
9    print(f"Got {len(output.predictions)} predictions")
10    predictions = np.argmax(output.predictions, axis=-1)

If predictions are empty, the dataset is likely empty or incorrectly formatted. Always check the dataset size and column names first.

Fix 6: evaluate() vs predict()

python

1# evaluate() returns only metrics, not predictions
2eval_output = trainer.evaluate(test_dataset)
3print(eval_output)  # {'eval_loss': 0.32, 'eval_accuracy': 0.91}
4# No predictions!
5
6# predict() returns predictions AND metrics
7pred_output = trainer.predict(test_dataset)
8print(pred_output.predictions.shape)  # (100, 2)
9print(pred_output.metrics)  # {'test_loss': 0.32, ...}

evaluate() only computes metrics — it does not return per-example predictions. Use predict() when you need the actual model outputs for each input.

Token Classification (NER) Predictions

python

1import numpy as np
2
3output = trainer.predict(test_dataset)
4
5# predictions shape: (num_examples, seq_length, num_labels)
6# Need argmax on last dimension
7predictions = np.argmax(output.predictions, axis=-1)
8
9# Map label IDs to label names
10label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG"]
11for i, pred in enumerate(predictions[:3]):
12    labels = [label_names[p] for p in pred]
13    print(f"Example {i}: {labels[:10]}")

For token classification, predictions have three dimensions. Apply argmax along the last axis, then map indices to label names.

Common Pitfalls

Expecting labels instead of logits: trainer.predict() returns raw logits (or probabilities for some models). Apply np.argmax() for classification or .squeeze() for regression to get usable predictions.
Confusing evaluate() with predict(): evaluate() returns only metrics, not per-example predictions. Always use predict() when you need the model's output for each input.
Missing tokenization columns: If the dataset lacks input_ids and attention_mask, the model receives no input. Always verify columns with dataset.column_names after tokenization.
Wrong label column name: The Trainer expects a column named labels. If your dataset uses label, target, or sentiment, rename it. Without the labels column, loss computation fails silently.
Not setting dataset format to torch: After tokenization, call dataset.set_format("torch") to convert columns to PyTorch tensors. Without this, the Trainer may fail to batch the data correctly.

Summary

trainer.predict() returns raw logits — use np.argmax() to convert to class predictions
Add compute_metrics to the Trainer to get accuracy, F1, and other metrics in the output
Use predict() for per-example predictions, not evaluate() (which only returns metrics)
Ensure the dataset has input_ids, attention_mask, and labels columns
Rename non-standard label columns to labels for Trainer compatibility
Check dataset.column_names and len(dataset) when debugging empty predictions