Evaluation

While we can run validation during training, some plotting callbacks take longer to execute and it can be desirable to run a full validation pass on a saved checkpoint in a decoupled way. For example, to compute metrics on a held-out period, regenerate diagnostic plots, or benchmark a model before deployment — without resuming training.

anemoi-training evaluate --config-name <config>

This runs one complete validation epoch using the same Hydra configuration as training, so all data loading, normalisation, and diagnostics callbacks behave identically. No optimizer state is created and no gradients are computed.

Warning

A checkpoint must be specified via training.run_id, training.fork_run_id, or system.input.warm_start. Omitting all three raises a RuntimeError immediately — evaluation on a randomly-initialised model is almost certainly a user error.

Differences from train

The evaluator reuses AnemoiTrainer for all setup steps (datamodule, graph, model, callbacks, loggers, strategy), but replaces the final trainer.fit() call with trainer.validate(). Key behavioural differences:

  • limit_val_batches controls how many batches to run (config.dataloader.limit_batches.validation).

  • Arguments that only apply to training — max_epochs, max_steps, gradient_clip_val, accumulate_grad_batches, etc. — are not passed to the evaluator trainer.

  • DDP model wrapping is skipped: Lightning’s DDPStrategy only wraps the model in DistributedDataParallel during fit(), not validate(), because there are no gradients to reduce. The strategy handles this transparently — hardware and communication groups are set up as normal.

  • Checkpointing and weight-averaging callbacks are automatically disabled (see below).

Checkpoint loading

A checkpoint source must be configured before evaluation starts. Three cases are recognised, in priority order:

  1. ``system.input.warm_start`` — load from an explicit file path. Raises FileNotFoundError if the file does not exist. Takes precedence over run_id / fork_run_id when both are set.

  2. ``training.run_id`` or ``training.fork_run_id`` — resolve the last checkpoint automatically as <checkpoints.root>/<run_id>/last.ckpt. Raises RuntimeError if the file is not found.

  3. Neither set — raises RuntimeError immediately with a descriptive message (unlike training, where a fresh start is valid).

Once a checkpoint path is resolved, two loading modes are available:

  • ``load_weights_only: True`` (recommended for evaluation) — model weights are loaded once during model initialisation; ckpt_path=None is passed to trainer.validate() to avoid a redundant second load and to skip restoring optimizer/scheduler state.

  • ``load_weights_only: False`` — PyTorch Lightning restores the full training state (weights, optimizer, epoch counter) before validation.

Checkpointing and weight averaging

Checkpointing callbacks (AnemoiCheckpoint) and weight-averaging callbacks (SWA / EMA) are automatically disabled during evaluation regardless of what the diagnostics config says. Evaluation is a read-only operation on a trained model and should never write new checkpoint files or update model weights.

Config and CLI overrides

anemoi-training evaluate works exactly like anemoi-training train for config selection and Hydra overrides. Pass --config-name to select a config file and any Hydra overrides as positional arguments:

anemoi-training evaluate \
    --config-name evaluate_ana_short \
    training.run_id=<run_id>

You can also override individual keys without a dedicated config file:

anemoi-training evaluate \
    --config-name debug_ana_short \
    training.run_id=<run_id> \
    training.load_weights_only=true \
    dataloader.limit_batches.validation=10 \
    diagnostics.plot.enabled=true

A minimal evaluation config that pairs with a training config is shown below. It inherits the same defaults and overrides only the evaluation-specific keys:

# Minimal Hydra override file for `anemoi-training evaluate`.
#
# A checkpoint MUST be specified (Pattern A or B below) or the command will
# raise a RuntimeError before evaluation starts.
#
# Usage:
#   anemoi-training evaluate --config-dir /path/to/overrides --config-name config_evaluate
#
# Or pass individual keys as CLI overrides:
#   anemoi-training evaluate training.run_id=<run_id>

# ── Checkpoint (REQUIRED — choose one pattern) ───────────────────────────────

# Pattern A: last.ckpt of a known run_id (most common)
training:
  run_id: ???               # run_id of the completed training run
  load_weights_only: True   # only restore model weights; skips a redundant second
                            # load by Lightning and avoids restoring optimiser state

# Pattern B: point directly at a checkpoint file
# warm_start takes precedence over run_id when both are set.
# system:
#   input:
#     warm_start: /path/to/epoch=10-step=5000.ckpt
# training:
#   load_weights_only: True

# ── Diagnostics ───────────────────────────────────────────────────────────────
diagnostics:
  # Checkpointing and weight averaging are automatically disabled by the
  # evaluator — evaluation is a read-only operation on a trained model.
  enable_checkpointing: False  # listed here for clarity; always overridden
  enable_progress_bar: True
  print_memory_summary: False

  plot:
    enabled: True            # generate diagnostic plots during validation
    asynchronous: True       # plot in a background thread

  log:
    mlflow:
      enabled: True          # log metrics and plots to an existing MLflow experiment

# ── Data ──────────────────────────────────────────────────────────────────────
dataloader:
  limit_batches:
    validation: null         # set to an integer to cap the number of batches

Distributed evaluation

The evaluator supports multi-GPU and multi-node evaluation via the same DDPGroupStrategy / DDPEnsGroupStrategy strategies used during training. Set the hardware configuration as usual:

system:
  hardware:
    num_gpus_per_node: 4
    num_nodes: 2

The hidden script entry point .anemoi-training-evaluate is registered alongside .anemoi-training-train so that Lightning’s interactive DDP can spawn rank > 0 processes correctly. Note that DDP wrapping is not applied during validation (Lightning only wraps the model for fit()), but communication groups and sharding are set up identically to training.