Evaluation
While we can run validation during training, some plotting callbacks take longer to execute and it can be desirable to run a full validation pass on a saved checkpoint in a decoupled way. For example, to compute metrics on a held-out period, regenerate diagnostic plots, or benchmark a model before deployment — without resuming training.
anemoi-training evaluate --config-name <config>
This runs one complete validation epoch using the same Hydra configuration as training, so all data loading, normalisation, and diagnostics callbacks behave identically. No optimizer state is created and no gradients are computed.
Warning
A checkpoint must be specified via training.run_id,
training.fork_run_id, or system.input.warm_start. Omitting
all three raises a RuntimeError immediately — evaluation on a
randomly-initialised model is almost certainly a user error.
Differences from train
The evaluator reuses AnemoiTrainer
for all setup steps (datamodule, graph, model, callbacks, loggers,
strategy), but replaces the final trainer.fit() call with
trainer.validate(). Key behavioural differences:
limit_val_batchescontrols how many batches to run (config.dataloader.limit_batches.validation).Arguments that only apply to training —
max_epochs,max_steps,gradient_clip_val,accumulate_grad_batches, etc. — are not passed to the evaluator trainer.DDP model wrapping is skipped: Lightning’s
DDPStrategyonly wraps the model inDistributedDataParallelduringfit(), notvalidate(), because there are no gradients to reduce. The strategy handles this transparently — hardware and communication groups are set up as normal.Checkpointing and weight-averaging callbacks are automatically disabled (see below).
Checkpoint loading
A checkpoint source must be configured before evaluation starts. Three cases are recognised, in priority order:
``system.input.warm_start`` — load from an explicit file path. Raises
FileNotFoundErrorif the file does not exist. Takes precedence overrun_id/fork_run_idwhen both are set.``training.run_id`` or ``training.fork_run_id`` — resolve the last checkpoint automatically as
<checkpoints.root>/<run_id>/last.ckpt. RaisesRuntimeErrorif the file is not found.Neither set — raises
RuntimeErrorimmediately with a descriptive message (unlike training, where a fresh start is valid).
Once a checkpoint path is resolved, two loading modes are available:
``load_weights_only: True`` (recommended for evaluation) — model weights are loaded once during model initialisation;
ckpt_path=Noneis passed totrainer.validate()to avoid a redundant second load and to skip restoring optimizer/scheduler state.``load_weights_only: False`` — PyTorch Lightning restores the full training state (weights, optimizer, epoch counter) before validation.
Checkpointing and weight averaging
Checkpointing callbacks (AnemoiCheckpoint)
and weight-averaging callbacks (SWA / EMA) are automatically disabled
during evaluation regardless of what the diagnostics config says.
Evaluation is a read-only operation on a trained model and should never
write new checkpoint files or update model weights.
Config and CLI overrides
anemoi-training evaluate works exactly like anemoi-training train
for config selection and Hydra overrides. Pass --config-name to select
a config file and any Hydra overrides as positional arguments:
anemoi-training evaluate \
--config-name evaluate_ana_short \
training.run_id=<run_id>
You can also override individual keys without a dedicated config file:
anemoi-training evaluate \
--config-name debug_ana_short \
training.run_id=<run_id> \
training.load_weights_only=true \
dataloader.limit_batches.validation=10 \
diagnostics.plot.enabled=true
A minimal evaluation config that pairs with a training config is shown below. It inherits the same defaults and overrides only the evaluation-specific keys:
# Minimal Hydra override file for `anemoi-training evaluate`.
#
# A checkpoint MUST be specified (Pattern A or B below) or the command will
# raise a RuntimeError before evaluation starts.
#
# Usage:
# anemoi-training evaluate --config-dir /path/to/overrides --config-name config_evaluate
#
# Or pass individual keys as CLI overrides:
# anemoi-training evaluate training.run_id=<run_id>
# ── Checkpoint (REQUIRED — choose one pattern) ───────────────────────────────
# Pattern A: last.ckpt of a known run_id (most common)
training:
run_id: ??? # run_id of the completed training run
load_weights_only: True # only restore model weights; skips a redundant second
# load by Lightning and avoids restoring optimiser state
# Pattern B: point directly at a checkpoint file
# warm_start takes precedence over run_id when both are set.
# system:
# input:
# warm_start: /path/to/epoch=10-step=5000.ckpt
# training:
# load_weights_only: True
# ── Diagnostics ───────────────────────────────────────────────────────────────
diagnostics:
# Checkpointing and weight averaging are automatically disabled by the
# evaluator — evaluation is a read-only operation on a trained model.
enable_checkpointing: False # listed here for clarity; always overridden
enable_progress_bar: True
print_memory_summary: False
plot:
enabled: True # generate diagnostic plots during validation
asynchronous: True # plot in a background thread
log:
mlflow:
enabled: True # log metrics and plots to an existing MLflow experiment
# ── Data ──────────────────────────────────────────────────────────────────────
dataloader:
limit_batches:
validation: null # set to an integer to cap the number of batches
Distributed evaluation
The evaluator supports multi-GPU and multi-node evaluation via the same
DDPGroupStrategy / DDPEnsGroupStrategy strategies used during
training. Set the hardware configuration as usual:
system:
hardware:
num_gpus_per_node: 4
num_nodes: 2
The hidden script entry point .anemoi-training-evaluate is
registered alongside .anemoi-training-train so that Lightning’s
interactive DDP can spawn rank > 0 processes correctly. Note that DDP
wrapping is not applied during validation (Lightning only wraps the model
for fit()), but communication groups and sharding are set up
identically to training.