How to use Transformers for text classification?

Transformers

text classification

machine learning

NLP

deep learning

How to use Transformers for text classification?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Transformers are a strong default choice for text classification because they start from language representations learned on large corpora and can be fine-tuned with relatively little task-specific data. In practice, using them well is less about memorizing the architecture and more about choosing a pretrained model, tokenizing correctly, and training with the right loss and evaluation setup. A small, clean pipeline often beats a large, poorly controlled one.

Pick the Right Model Family

For classification, you usually want an encoder-style model such as BERT, RoBERTa, or DistilBERT. These models produce contextual token representations and are commonly fine-tuned by adding a classification head on top of the pooled sequence representation.

A few practical choices:

'distilbert-base-uncased for smaller and faster experiments'
'bert-base-uncased for a standard baseline on English text'
'roberta-base when you want a strong general-purpose encoder'
a domain-specific model if the text is legal, biomedical, financial, or multilingual

The best model is not always the biggest one. If latency and memory matter, a smaller checkpoint can be the better engineering choice.

Build a Minimal Fine-Tuning Pipeline

The Hugging Face transformers library makes the mechanics straightforward. The core steps are:

load a tokenizer
tokenize text with truncation and padding
load a sequence classification model with the correct number of labels
train and evaluate on your labeled data

Here is a runnable example using PyTorch and the Trainer API on a tiny in-memory dataset:

python

1from datasets import Dataset
2from transformers import AutoTokenizer, AutoModelForSequenceClassification
3from transformers import TrainingArguments, Trainer
4import numpy as np
5import evaluate
6
7texts = [
8    "I loved the product and would buy it again.",
9    "This was a terrible purchase.",
10    "The service was excellent and fast.",
11    "I want a refund."
12]
13labels = [1, 0, 1, 0]
14
15model_name = "distilbert-base-uncased"
16tokenizer = AutoTokenizer.from_pretrained(model_name)
17
18raw_dataset = Dataset.from_dict({"text": texts, "label": labels})
19
20
21def tokenize(batch):
22    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=64)
23
24
25dataset = raw_dataset.map(tokenize, batched=True)
26dataset = dataset.train_test_split(test_size=0.5, seed=42)
27
28model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
29accuracy = evaluate.load("accuracy")
30
31
32def compute_metrics(eval_pred):
33    logits, labels = eval_pred
34    predictions = np.argmax(logits, axis=-1)
35    return accuracy.compute(predictions=predictions, references=labels)
36
37training_args = TrainingArguments(
38    output_dir="./tmp-transformer-clf",
39    num_train_epochs=1,
40    per_device_train_batch_size=2,
41    per_device_eval_batch_size=2,
42    evaluation_strategy="epoch",
43    logging_strategy="epoch",
44    save_strategy="no"
45)
46
47trainer = Trainer(
48    model=model,
49    args=training_args,
50    train_dataset=dataset["train"],
51    eval_dataset=dataset["test"],
52    tokenizer=tokenizer,
53    compute_metrics=compute_metrics
54)
55
56trainer.train()
57print(trainer.evaluate())

This is deliberately small, but it demonstrates the full classification flow.

Prepare Labels and Metrics Correctly

For single-label classification, each example belongs to one class, and the loss is usually cross-entropy. For multi-label classification, one example can belong to several classes, which changes both the output activation and the loss.

That distinction matters more than beginners expect:

single-label: one class per example, use softmax-style outputs
multi-label: multiple independent labels, use sigmoid-style outputs

Your evaluation metric should match the business problem. Accuracy is fine for balanced binary tasks, but F1, precision, recall, or ROC AUC may be better for imbalanced datasets.

Control Sequence Length and Batch Size

Transformer cost grows with sequence length, so padding everything to a large maximum length is wasteful. A few practical rules help:

inspect your actual text length distribution
truncate aggressively if the label signal is near the beginning
use dynamic padding when your framework supports it
lower the batch size before lowering model quality elsewhere

If memory is tight, gradient accumulation can simulate a larger batch size without requiring the GPU to hold it all at once.

Fine-Tuning Versus Feature Extraction

You do not always need full fine-tuning. Another useful pattern is to use the transformer as a frozen feature extractor and train a shallow classifier on top of its embeddings.

That approach can be helpful when:

the dataset is small
overfitting is a concern
training resources are limited
you need a fast baseline before investing in deeper tuning

Fine-tuning usually gives better task performance, but frozen embeddings can be simpler to debug and deploy.

Common Pitfalls

A common mistake is treating tokenization as a minor preprocessing step. The tokenizer must match the pretrained model exactly.

Another mistake is using the wrong classification setup for the label structure. Single-label and multi-label tasks are not interchangeable.

People also often pad every sequence to the model maximum length, which wastes memory and slows training dramatically.

Finally, do not judge the model by training loss alone. Keep a validation set and track a metric that reflects the real task.

Summary

For text classification, encoder-style transformer models such as BERT and DistilBERT are the usual starting point
The core workflow is tokenize, load a pretrained classifier, fine-tune, and evaluate
Label format matters because single-label and multi-label classification use different output assumptions
Sequence length and padding strategy have a large effect on speed and memory use
Fine-tuning is powerful, but frozen embeddings can be a good baseline
Match the tokenizer, model, loss, and metric carefully instead of treating them as independent choices