Data Labeling and Annotation

ML Systems & Infrastructure

Data Labeling and Annotation

Topics Covered

Why Labels Matter

The Labeling Cost Problem

Label Noise Cascades

Labeling Strategies

Human Annotation Workflows

Synthetic Data Generation

Active Learning and Weak Supervision

Active Learning

Weak Supervision

Quality Control and Inter-Annotator Agreement

Cohen's Kappa

Quality Control Pipeline

Monitoring Annotator Quality Over Time

A model trained on noisy labels learns noise. This is the most expensive lesson in production ML, and teams learn it the hard way: they spend months tuning architectures and hyperparameters, only to discover that the model's quality ceiling was set by the labels, not the algorithm. A 95%-accurate label set caps your model at roughly 95% accuracy no matter how sophisticated the architecture.

Why? Supervised learning works by minimizing the gap between predictions and labels. If the labels themselves are wrong 10% of the time, the model learns to reproduce those errors. Worse, label noise is not random — it clusters around ambiguous examples (is this email spam or a newsletter? is this image a hotdog or a sandwich?). The model learns the noise patterns as if they were real signal, creating systematic blind spots that no amount of training data can fix.

The Labeling Cost Problem

Labeling is the most labor-intensive step in the ML pipeline. Consider the scale: a text classification model might need 50,000 labeled examples. A named entity recognition model needs every token in every sentence tagged. An image segmentation model needs pixel-level masks — a single image can take 30 minutes to label. At $0.10 per label for simple tasks and $2-5 per label for complex annotation, a production dataset easily costs $50,000 to $500,000 to build.

This cost creates a strategic trade-off: you can label more data (brute force), label smarter (active learning), label programmatically (weak supervision), or generate synthetic labeled data. Each strategy has different cost, quality, and speed characteristics. The rest of this lesson covers when to use each one.

Key Insight

Label quality beats label quantity almost every time. Research consistently shows that 5,000 high-quality labels outperform 50,000 noisy labels. Before scaling your labeling effort, run a small experiment: take 500 labels, measure inter-annotator agreement, and fix the disagreements. If agreement is below 80%, your guidelines need work before you spend money on more labels.

Label Noise Cascades

Label noise does not just reduce accuracy — it compounds through the ML pipeline. Noisy labels produce a noisy model. That noisy model generates noisy predictions used for downstream features (a fraud score used as input to a credit model). The downstream model inherits the upstream noise plus adds its own. In a system with three chained models, 5% label noise in the first model can cascade to 15-20% effective noise in the final model.

This cascade is why labeling is not just a data team problem — it is an infrastructure problem. The labeling pipeline needs the same rigor as the serving pipeline: version control, quality gates, monitoring, and rollback capability.

Labeling Strategies

Not every label needs a human. The art of production labeling is matching the right strategy to each slice of your data. Some examples are easy — a rule can label them. Some are hard — only a domain expert can decide. Spending expert time on easy examples wastes money. Letting rules decide hard examples creates noise. The optimal pipeline uses multiple strategies in layers.

Human Annotation Workflows

Human annotation remains the gold standard for quality. The workflow starts with task design: define exactly what the annotator must decide, what the options are, and what counts as each option. Ambiguous instructions produce inconsistent labels.

Labeling tools (Label Studio, Scale AI, Labelbox) manage the workflow at scale. They handle task routing (send medical images to radiologists, legal documents to lawyers), quality monitoring (flag annotators whose agreement rate drops), and data management (track which examples have been labeled, by whom, and when). Enterprise platforms like Scale AI add a managed workforce — you upload data and receive labels without hiring annotators.

Task design is the most underrated factor in label quality. A well-designed task presents the annotator with exactly the information needed for the decision, structured to minimize cognitive load. For text classification: show the text, show the categories with definitions and examples, let the annotator pick one. For sequence labeling: highlight the entity span and ask the annotator to confirm or correct the type. For image annotation: provide reference images of each class.

Edge case guidelines are where labeling quality is won or lost. Every classification task has a gray zone — examples that could reasonably go either way. Without explicit guidelines for these cases, each annotator invents their own rule, and label noise spikes. The solution: maintain a living edge case document with examples and decisions. Review it weekly as new edge cases surface.

Synthetic Data Generation

When real labeled data is scarce or expensive, you can generate synthetic examples. Three approaches dominate:

Rule-based generation: For structured data, write rules that generate valid examples. A fraud detection system can generate synthetic fraudulent transactions by combining known fraud patterns (unusual amounts, foreign IPs, rapid succession) with random variation. The labels are free because you define them by construction.

Augmentation: Transform existing labeled examples to create new ones. Image augmentation (rotation, cropping, color jitter) is standard in computer vision. Text augmentation (synonym replacement, back-translation, random insertion) works for NLP. The label carries over from the original example because the transformation preserves the semantic content.

LLM-generated data: Use large language models to generate labeled examples. Prompt GPT-4 with "Generate 100 customer support tickets labeled as billing, technical, or account issues." The quality depends heavily on prompt engineering and requires human validation on a sample. Best used for bootstrapping an initial dataset, not as a replacement for human labels.

Synthetic data is not a free lunch. Models trained only on synthetic data learn the distribution of the generator, not the real world. The standard approach: use synthetic data for pre-training or augmentation, then fine-tune on real labeled data. The real data grounds the model in actual data distributions.

Active Learning and Weak Supervision

The brute-force approach to labeling — randomly sample data, label all of it — wastes annotator time on examples the model already handles well. Active learning flips this: let the model tell you which examples to label next. Weak supervision goes further: replace some human labels entirely with programmatic rules. Together, these techniques can cut labeling costs by 50-80% while maintaining or even improving model quality.

Active Learning

Active learning is a cycle. Train an initial model on a small labeled seed set (100-500 examples). Use that model to predict on the unlabeled pool. Rank unlabeled examples by the model's uncertainty — examples where the model's confidence is near 50-50 are the most informative. Send the top-K most uncertain examples to human annotators. Add the new labels to the training set. Retrain. Repeat.

Why does this work? Most datasets are redundant — 80% of examples fall in regions the model already classifies confidently after seeing a few hundred examples. The remaining 20% sit on decision boundaries where the model is uncertain. These boundary examples are worth 10x more for model improvement than easy examples. Active learning spends the labeling budget exclusively on high-value examples.

Uncertainty sampling is the simplest strategy: pick examples where the model's predicted probability is closest to uniform across classes. For binary classification, this means examples near 50% confidence. For multi-class, examples where the top two predicted classes have similar probabilities.

Diversity sampling adds a constraint: among uncertain examples, pick ones that are diverse in feature space. This prevents active learning from over-sampling a single cluster of similar ambiguous examples. Combine uncertainty and diversity by first filtering to the top 20% most uncertain, then selecting a diverse subset using k-means clustering.

Active learning vs random sampling cost curve

Cold start problem: Active learning needs an initial model, which needs initial labels. The seed set should be small (100-500 examples) but representative — use stratified random sampling to cover all classes. A biased seed set creates a biased initial model that asks biased questions, and the active learning loop amplifies the bias.

Interview Tip

In an interview, if asked how to build an ML system with limited labeled data, lead with active learning. It shows you understand that labeling is the bottleneck, not model architecture. The specific pitch: start with 500 random labels, train a baseline, then use uncertainty sampling to get 80% of the performance with 20% of the labels. Interviewers care about cost-aware ML thinking.

Weak Supervision

Weak supervision replaces human annotators with labeling functions — programmatic rules that vote on each example. Instead of paying a human to read 100,000 emails and label them as spam or not-spam, you write 20 labeling functions:

If the email contains "Nigerian prince," label SPAM.
If the sender's domain is in the company directory, label NOT-SPAM.
If the email has more than 5 links, label SPAM.
If the subject matches a known newsletter pattern, label NOT-SPAM.

Each function is noisy, incomplete, and potentially conflicting. Some examples get multiple votes that disagree. Some get no votes at all (abstain). The key insight of weak supervision (pioneered by Snorkel) is that you do not need any single labeling function to be accurate — you need the ensemble to be accurate.

A label model learns the accuracy and correlation structure of the labeling functions without any ground truth. It uses the agreement and disagreement patterns between functions to estimate each function's reliability, then produces probabilistic labels by weighting votes accordingly. Functions that agree with many other functions get more weight. Functions that frequently contradict the majority get less.

When weak supervision wins: tasks where domain experts can articulate rules ("if X then probably Y") even if no single rule is perfect. Spam detection, content moderation, fraud detection, and medical coding all have rich heuristic knowledge that translates into labeling functions. When weak supervision struggles: tasks requiring holistic judgment (sentiment, aesthetics, humor) where rules cannot capture the decision logic.

Quality Control and Inter-Annotator Agreement

How do you know your labels are good? You cannot check every label manually — that defeats the purpose of scaling annotation. Instead, you measure agreement: have multiple annotators label the same examples and quantify how often they agree. High agreement means your task is well-defined and your guidelines are clear. Low agreement means either the task is genuinely ambiguous or your instructions are unclear — and you need to figure out which.

Cohen's Kappa

Raw agreement percentage is misleading. If your task has two classes and the data is 90% positive, two annotators who both label everything as positive agree 90% of the time — by chance. Cohen's kappa corrects for this by measuring agreement above what chance alone would produce:

\kappa = \frac{p_o - p_e}{1 - p_e}

Where $p_o$ is the observed agreement and $p_e$ is the expected agreement by chance. A kappa of 1.0 means perfect agreement. A kappa of 0 means agreement no better than random. A kappa below 0 means annotators actively disagree (worse than chance).

Interpretation guidelines: kappa above 0.80 is excellent (ready for training). 0.60-0.80 is moderate (investigate disagreements, update guidelines). Below 0.60 is poor (halt labeling and redesign the task or guidelines before proceeding).

For tasks with more than two annotators, use Fleiss' kappa, which generalizes Cohen's kappa to multiple raters. The interpretation thresholds remain the same.

Quality Control Pipeline

Production labeling pipelines need systematic quality control, not spot checks. A robust pipeline includes:

Redundant labeling: Assign each example to 2-3 annotators. Use majority vote for the final label. Flag examples with no majority for expert review. This costs 2-3x more per label but catches individual annotator errors.

Gold questions: Inject examples with known correct labels into the annotation stream. Track each annotator's accuracy on gold questions. If an annotator's gold accuracy drops below 80%, flag their work for review. This detects annotator fatigue, misunderstanding, and gaming (clicking randomly to hit volume targets).

Annotator calibration: Before production labeling begins, all annotators label the same 100-200 examples. Compute pairwise kappa. Hold a disagreement review session where annotators discuss their reasoning for disagreed examples. Update guidelines based on the discussion. Repeat until kappa exceeds 0.80.

Common Pitfall

Low inter-annotator agreement is not always a problem to solve — sometimes it reveals that the task itself is genuinely ambiguous. Sentiment analysis of sarcasm, toxicity of political speech, and aesthetic quality of images have inherent subjectivity. If kappa remains below 0.60 after multiple rounds of guideline improvement, consider whether the task definition needs to change (e.g., from binary toxic/not-toxic to a 5-point toxicity scale) rather than forcing annotators into artificial agreement.

Monitoring Annotator Quality Over Time

Annotator quality drifts. An annotator who starts at 95% accuracy on gold questions may drift to 80% after labeling 5,000 examples due to fatigue, boredom, or gradual reinterpretation of guidelines. Track quality metrics per annotator over time:

Gold question accuracy (rolling 100-question window)
Agreement rate with other annotators
Labeling speed (sudden speedups correlate with quality drops)
Per-class accuracy (an annotator may be accurate overall but consistently wrong on one specific class)

Set automated alerts when any metric crosses a threshold. The response is not punishment — it is recalibration. Pull the annotator offline, review their recent labels together, clarify guidelines, and run a calibration exercise before they resume.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone