Content Moderation and Safety

Course

ML Systems & Infrastructure

Content Moderation and Safety

Topics Covered

Content Classification Systems

Severity Levels and Policy Responses

Confidence-Based Routing

Multi-Modal Moderation

Human Review Workflows

Reviewer Queue Prioritization

Consensus Mechanisms for Ambiguous Cases

Handling Ambiguity

Reviewer Well-Being

Quality Assurance

Automated Moderation at Scale

Common Evasion Tactics

Building Adversarial Robustness

Real-Time Latency Requirements

Batch Moderation for Policy Changes

Language and Cultural Challenges

Feedback Loops and Policy Evolution

The Continuous Improvement Cycle

Policy Changes as ML Infrastructure Events

Legal and Regulatory Requirements

User Appeals Workflow

Measuring Moderation System Health

Content Classification Systems

Facebook processes roughly 3 billion content items per day. Even if human reviewers handle one item per minute, you would need over 2 million full-time reviewers working around the clock to cover everything manually. That math alone tells you why content moderation is fundamentally an ML problem: the volume of user-generated content on any major platform is orders of magnitude beyond what human review can handle. ML models act as the first line of defense, classifying content at the speed it arrives so that harmful material is caught before it reaches other users.

The core task is multi-label classification. A single post can simultaneously be hate speech AND spam AND contain graphic violence. Treating moderation as a single-label problem (pick the worst category) loses information that downstream systems need. A spam post that also contains hate speech requires a different policy response than spam alone. Each label typically has its own binary classifier or shares a multi-task model with label-specific heads, because the features that predict spam (repetitive URLs, engagement bait) are very different from the features that predict hate speech (slurs, dehumanizing language patterns).

Tiered content moderation pipeline showing ML scoring, confidence-based routing, and human review

Severity Levels and Policy Responses

Not all violations are equal. A mildly offensive comment might warrant a warning label, while child exploitation material demands immediate removal and law enforcement notification. Most platforms define a severity taxonomy with 3-5 levels, each mapped to a specific policy response:

Severity 1 (low): content is borderline or mildly offensive. Action is a warning label or reduced distribution.
Severity 2 (medium): clear policy violation such as harassment or misinformation. Action is content removal and a notice to the poster.
Severity 3 (high): serious violation like graphic violence or dangerous instructions. Action is immediate removal plus account warning.
Severity 4 (critical): illegal content such as child exploitation or terrorism recruitment. Action is immediate removal, account suspension, and law enforcement referral.

The model outputs both a category label and a severity score, and the policy engine combines these to select the right action. This separation of classification from policy enforcement is critical because policies change frequently (new regulations, new platform rules) while the underlying ML models change slowly.

Confidence-Based Routing

A moderation model that is 99.5 percent confident a post is child exploitation material should trigger immediate automated removal. A model that is 55 percent confident a post is hate speech should route it to a human reviewer, because the cost of a wrong automated decision is too high. This is confidence-based routing: using the model's prediction probability to decide whether to auto-act or escalate.

Distribution of model confidence scores showing auto-action, human review, and auto-allow zones

The thresholds define three zones. Above the upper threshold (say 0.95), the system auto-acts on the content. Below the lower threshold (say 0.30), the system auto-allows the content. Between the two thresholds, content enters a human review queue. Tuning these thresholds is one of the most consequential decisions in a moderation system: lowering the auto-act threshold catches more harmful content but increases false positives; raising the auto-allow threshold reduces reviewer workload but lets more borderline content through.

Key Insight

The precision-recall trade-off in content moderation has asymmetric costs. A false negative (missing harmful content) can cause real-world harm to users and regulatory liability. A false positive (removing benign content) silences legitimate speech and erodes user trust. Most platforms bias toward higher recall for severe categories (catch everything, even at the cost of some false positives) and higher precision for low-severity categories (only act when confident, to avoid over-censoring).

Text is only one content type. Images require vision models (often fine-tuned ResNets or Vision Transformers) trained on labeled datasets of harmful imagery. Video requires frame sampling strategies because processing every frame at 30fps is computationally prohibitive, so systems typically sample key frames and use temporal models to detect transitions from benign to harmful content. Audio moderation applies speech-to-text first, then runs text classifiers on the transcript, plus separate audio classifiers for non-speech signals like gunshots or screaming.

The hardest cases are multi-modal: a meme where the image is innocuous and the text is innocuous, but the combination is harmful. A photo of a historical figure with the caption "we need to finish what he started" is benign in both modalities independently but clearly harmful together. These cases require fusion models that jointly process text and image embeddings, which are significantly more expensive to train and deploy than unimodal classifiers.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone