WandB
Machine Learning
Visualization
Model Training
\`Loss\` Tracking

getting aligned val_loss and train_loss plots for each epoch using WandB rather than separate plots

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

If train_loss and val_loss appear on separate WandB charts, the problem is usually not the chart itself. The issue is that the two metrics were logged on different steps, so WandB treats them as separate time series instead of two values on the same epoch axis.

Log both metrics against the same epoch

The most reliable fix is to compute one training loss value per epoch, compute one validation loss value per epoch, and log both in the same wandb.log() call.

python
1import wandb
2
3run = wandb.init(project="loss-alignment-demo")
4
5for epoch in range(1, num_epochs + 1):
6    train_total = 0.0
7
8    model.train()
9    for inputs, targets in train_loader:
10        optimizer.zero_grad()
11        outputs = model(inputs)
12        loss = criterion(outputs, targets)
13        loss.backward()
14        optimizer.step()
15        train_total += loss.item()
16
17    train_loss = train_total / len(train_loader)
18
19    model.eval()
20    val_total = 0.0
21    with torch.no_grad():
22        for inputs, targets in val_loader:
23            outputs = model(inputs)
24            val_total += criterion(outputs, targets).item()
25
26    val_loss = val_total / len(val_loader)
27
28    run.log({
29        "epoch": epoch,
30        "train_loss": train_loss,
31        "val_loss": val_loss,
32    })

Because the values are logged together, WandB stores them under the same history step. In the UI, you can add both metrics to one line chart and compare them directly.

Use a custom x-axis when epoch is what matters

WandB uses its internal step counter by default, and each log() call advances that counter. If your code logs training metrics per batch and validation metrics per epoch, the series drift apart. In that case, define epoch as the step metric for both losses.

python
1import wandb
2
3with wandb.init(project="loss-alignment-demo") as run:
4    run.define_metric("epoch")
5    run.define_metric("train_loss", step_metric="epoch")
6    run.define_metric("val_loss", step_metric="epoch")
7
8    for epoch in range(1, num_epochs + 1):
9        train_loss = train_one_epoch(...)
10        val_loss = validate(...)
11
12        run.log({
13            "epoch": epoch,
14            "train_loss": train_loss,
15            "val_loss": val_loss,
16        })

This tells WandB to plot both metrics against the same epoch axis even if your run logs other batch-level metrics elsewhere.

If you must log in two separate calls

Sometimes the training loop and validation loop live in different functions. You can still align the plots, but you must keep the step value identical and prevent WandB from incrementing the step between calls.

python
1with wandb.init(project="loss-alignment-demo") as run:
2    run.define_metric("epoch")
3    run.define_metric("train_loss", step_metric="epoch")
4    run.define_metric("val_loss", step_metric="epoch")
5
6    for epoch in range(1, num_epochs + 1):
7        train_loss = train_one_epoch(...)
8        run.log({"epoch": epoch, "train_loss": train_loss}, commit=False)
9
10        val_loss = validate(...)
11        run.log({"epoch": epoch, "val_loss": val_loss})

The commit=False on the first call keeps both values grouped under the same history entry.

Common Pitfalls

The most common mistake is mixing batch-level train_loss with epoch-level val_loss. If one metric is logged hundreds of times per epoch and the other only once, the chart will not line up the way you expect.

Another issue is calling wandb.log() multiple times without a shared step or custom step metric. Each call advances WandB's internal step counter unless you control it explicitly.

Be careful with metric names too. Small naming differences such as val/loss versus val_loss create separate series and can make the dashboard look inconsistent.

Finally, if you are using a framework integration such as Keras, PyTorch Lightning, or TensorBoard sync, check whether that integration is already choosing a step metric for you. Manual logging and auto-logging can conflict if they track the same concept differently.

It is also worth verifying the chart settings in the WandB workspace. If the panel is still grouped by the default step instead of epoch, the logged data may be correct while the visualization remains misleading.

Summary

  • Log train_loss and val_loss with the same epoch value if you want them on one plot.
  • The simplest approach is one wandb.log() call per epoch containing both metrics.
  • Use run.define_metric(..., step_metric="epoch") when epoch should be the chart axis.
  • If separate log calls are unavoidable, keep the step identical and use commit=False on the first call.
  • Most alignment problems come from inconsistent logging frequency, not from the WandB dashboard itself.

Course illustration
Course illustration

All Rights Reserved.