Alternatively train multi task learning model in pytorch - weight updating

Multi-task Learning

PyTorch

Machine Learning

Weight Updating

Model Training

Alternatively train multi task learning model in pytorch - weight updating

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

In multi-task learning (MTL), alternating training means updating different task heads (and possibly shared backbone) on different steps instead of summing all losses every batch. This is useful when tasks have imbalanced data volume, conflicting gradients, or different convergence speeds.

A correct alternating strategy requires explicit control over which losses contribute to gradients at each step, and which parameters are frozen or updated. Without this, one task can dominate and destabilize shared representations.

Core Sections

1. Build shared backbone and task-specific heads

python

1import torch
2import torch.nn as nn
3
4class MTLModel(nn.Module):
5    def __init__(self, in_dim, hidden, n_cls):
6        super().__init__()
7        self.backbone = nn.Sequential(
8            nn.Linear(in_dim, hidden),
9            nn.ReLU()
10        )
11        self.cls_head = nn.Linear(hidden, n_cls)
12        self.reg_head = nn.Linear(hidden, 1)
13
14    def forward(self, x):
15        z = self.backbone(x)
16        return self.cls_head(z), self.reg_head(z)

2. Alternate optimization steps by schedule

python

1model = MTLModel(32, 128, 5)
2opt = torch.optim.Adam(model.parameters(), lr=1e-3)
3ce = nn.CrossEntropyLoss()
4mse = nn.MSELoss()
5
6for step, batch in enumerate(loader):
7    x, y_cls, y_reg = batch
8    opt.zero_grad()
9
10    pred_cls, pred_reg = model(x)
11
12    if step % 2 == 0:
13        loss = ce(pred_cls, y_cls)      # classification step
14    else:
15        loss = mse(pred_reg.squeeze(-1), y_reg)  # regression step
16
17    loss.backward()
18    opt.step()

This updates shared layers with alternating task signals.

3. Freeze selective heads per step

If you want to avoid irrelevant head updates each step:

python

def set_requires_grad(module, value):
    for p in module.parameters():
        p.requires_grad = value

Toggle head gradients around each step and keep backbone trainable.

4. Balance update frequencies

Alternating 1:1 may not be optimal. If task A is harder, use 2:1 or dynamic schedules based on validation metrics.

5. Track per-task metrics separately

Monitor each task loss, not only aggregate curves. Alternating schedules can hide regressions if only one combined metric is observed.

Common Pitfalls

Summing losses unintentionally when attempting to alternate tasks.
Forgetting zero_grad() between steps and accumulating stale gradients.
Letting one task update far more often without explicit scheduling rationale.
Ignoring per-task validation and optimizing only global training loss.
Freezing/unfreezing parameters incorrectly and silently blocking intended updates.

Summary

Alternating multi-task training in PyTorch is a practical strategy when tasks compete or have different learning dynamics. Implement it by explicitly choosing which loss backpropagates per step, controlling parameter update scope, and balancing task schedules thoughtfully. With separate metric tracking and clear optimization rules, alternating updates can improve stability and task parity over naive joint-loss training.

A practical way to make this guidance durable is to convert it into a small runbook that includes prerequisites, expected environment versions, and a short verification sequence. Even strong teams lose time when troubleshooting steps live only in memory or chat history. A runbook should explicitly answer three questions: what to check first, what output confirms healthy behavior, and what output indicates a known failure mode. This level of clarity helps both experienced maintainers and newer contributors, and it reduces repeated investigation during incidents.

It is also valuable to create a tiny reproducible fixture for this topic. The fixture can be a minimal script, test case, sample request, or small dataset that demonstrates the correct behavior in isolation. When regressions appear after dependency upgrades, infrastructure changes, or framework migrations, that fixture becomes the fastest way to isolate whether the issue is environmental or logic-related. Keeping a focused fixture in source control gives you a stable benchmark across branches and release cycles.

For long-term reliability, pair documentation with one automated guardrail in CI. The guardrail should be narrow and fast: an import check, schema validation, endpoint contract test, deterministic unit test, or lightweight performance threshold. Avoid broad flaky checks that hide real signals. The goal is early, actionable feedback before code reaches production. If the same category of issue appears repeatedly, promote the manual troubleshooting step into automation so the system catches it first. Over time, this shifts effort from reactive debugging to preventive quality control and keeps the knowledge article relevant in real engineering workflows.