Model Training

Anemoi provides a modular and extensible training framework for Graph Neural Networks (GNNs), designed for tasks such as forecasting, temporal downscaling, and ensemble learning. The training setup is structured around three key components:

BaseTrainingModule: The abstract base class for all task-specific models, encapsulating shared logic for training, evaluation, and distributed execution.
Tasks: Task-specific subclasses that implement models for forecasting, temporal downscaling, autoencoding, etc.
AnemoiTrainer: The training orchestrator responsible for running and managing the training and validation loops.

To train a model, users typically subclass one of the pre-implemented graph modules or create a new one by extending BaseTrainingModule.

BaseTrainingModule

All training methods subclass BaseTrainingModule, which itself inherits from PyTorch Lightning’s LightningModule. This base class defines the standard interface for all models in Anemoi and implements the core logic required for training, validation, and distributed inference.

Key responsibilities include:

Support for sharded and distributed training
Node-based weighting and custom loss scaling
Normalization and inverse-scaling of output variables
Validation metric computation with customizable subsets
Input/output masking to support variable or region-specific processing

BaseTrainingModule is not intended to be instantiated directly. Instead, use one of the concrete training methods or subclass it to implement a new one by overriding the _step() method.

Core Parameters:

config: A structured configuration (usually a dataclass) defining model architecture and training settings.
graph_data: A HeteroData object with static node and edge features.
statistics / statistics_tendencies: Mean and std dev for normalization of variables.
data_indices: Index mappings between variable names and tensor positions.
supporting_arrays: Optional maps like topography or land-sea masks.

Subclasses must implement:

_step(): Defines how a batch is processed and losses are computed.

Additional features include optional sharding of input batches across devices (to reduce communication overhead), dynamic creation of scalers from statistics.

class anemoi.training.train.methods.base.BaseTrainingModule(*, config: BaseSchema, task: BaseTask, graph_data: dict[str, HeteroData], statistics: dict, statistics_tendencies: dict, data_indices: dict[str, IndexCollection], metadata: dict, supporting_arrays: dict)

Bases: LightningModule, ABC

Abstract base class for Anemoi GNN forecasters using PyTorch Lightning.

This class encapsulates the shared functionality for distributed training, scaling, and evaluation of graph-based neural network models across multiple GPUs and nodes. It provides hooks for defining losses, metrics, optimizers, and distributed sharding strategies.

Key Features

Supports model and data parallelism through model and reader process groups.
Handles graph data via torch_geometric.data.HeteroData format.
Supports sharded input batches and reconstruction via allgather.
Integrates modular loss and metric functions with support for variable scaling.
Enables deferred creation of variable scalers post-model instantiation.
Fully compatible with PyTorch Lightning training and validation loops.

Subclass Responsibilities

Child classes must implement the _step method, which defines the forward and loss computation for training and validation steps.

param config:: Configuration object defining all parameters.
type config:: BaseSchema
param graph_data:: Graph-structured input data containing node and edge features, keyed by dataset name.
type graph_data:: HeteroData
param statistics:: Dictionary of training statistics (mean, std, etc.) used for normalization.
type statistics:: dict
param statistics_tendencies:: Statistics related to tendencies (if used).
type statistics_tendencies:: dict
param data_indices:: Maps feature names to index ranges used for training and loss functions.
type data_indices:: dict[str, IndexCollection]
param metadata:: Dictionary with metadata such as dataset provenance and variable descriptions.
type metadata:: dict
param supporting_arrays:: Numpy arrays (e.g., topography, masks) needed during inference and stored in checkpoints.
type supporting_arrays:: dict

model

Wrapper for the underlying GNN model and its pre/post-processing logic.

Type:: AnemoiModelInterface

loss

Training loss function, optionally supporting variable scaling and sharding.

Type:: BaseLoss

metrics

Dictionary of validation metrics (often loss-style) computed during evaluation.

Type:: dict[str, BaseLoss | Callable]

scalers

Variable-wise scaling functions (e.g., standardization).

Type:: dict

val_metric_ranges

Mapping of variable groups for which to calculate validation metrics.

Type:: dict

output_mask

Masking module that filters outputs during inference.

Type:: nn.Module

n_step_input

Number of input timesteps provided to the model.

Type:: int

n_step_output

Number of output timesteps predicted by the model.

Type:: int

keep_batch_sharded

Whether to keep input batches split across GPUs instead of gathering them.

Type:: bool

Distributed Training

--------------------

The module can be configured to work in multi-node, multi-GPU environments with support for

- Custom communication groups for model and reader parallelism

- Sharded input and output tensors

- Support for `ZeroRedundancyOptimizer` and learning rate warmup

Notes

This class should not be used directly. Subclass it and override _step.

See also

-, -, -, -, -, None

property plot_adapter: Any: Single entry point for diagnostics plot callbacks (replaces 5 small methods).

forward(x: dict[str, Tensor], **kwargs) → dict[str, Tensor]

Forward method.

This method calls the model’s forward method with the appropriate communication group and sharding information.

on_save_checkpoint(checkpoint: dict) → None

Called by Lightning when saving a checkpoint to give you a chance to store anything else you might want to save.

Parameters:: checkpoint – The full checkpoint dictionary before it gets dumped to a file. Implementations of this hook can insert additional data into this dictionary.

Example:

def on_save_checkpoint(self, checkpoint):
    # 99% of use cases you don't need to implement this method
    checkpoint['something_cool_i_want_to_save'] = my_cool_pickable_object

Note

Lightning saves all aspects of training (epoch, global step, etc…) including amp scaling. There is no need for you to store anything about training.

on_load_checkpoint(checkpoint: Module) → None

Called by Lightning to restore your model. If you saved something with on_save_checkpoint() this is your chance to restore this.

Parameters:: checkpoint – Loaded checkpoint

Example:

def on_load_checkpoint(self, checkpoint):
    # 99% of the time you don't need to implement this method
    self.something_cool_i_want_to_save = checkpoint['something_cool_i_want_to_save']

Note

Lightning auto-restores global step, epoch, and train state including amp scaling. There is no need for you to restore anything regarding training.

update_scalers(callback: AvailableCallbacks) → None: Update scalers, calling the defined function on them, updating if not None.

compute_dataset_loss_metrics(y_pred: Tensor, y: Tensor, validation_mode: bool = False, dataset_name: str | None = None, **kwargs) → tuple[Tensor | None, dict[str, Tensor], Tensor]

Compute loss and metrics for the given predictions and targets.

Parameters:

y_pred (torch.Tensor) – Predicted values
y (torch.Tensor) – Target values
step (int, optional) – Current step
validation_mode (bool, optional) – Whether to compute validation metrics
**kwargs – Additional arguments to pass to loss computation

Returns:

Loss, metrics dictionary (if validation_mode), and full predictions

Return type:

tuple[torch.Tensor | None, dict[str, torch.Tensor], torch.Tensor]

compute_loss_metrics(y_pred: dict[str, Tensor], y: dict[str, Tensor], validation_mode: bool = False, **kwargs) → tuple[Tensor | None, dict[str, Tensor], dict[str, Tensor]]

Compute loss and metrics for the given predictions and targets.

Parameters:

y_pred (dict[str, torch.Tensor]) – Predicted values
y (dict[str, torch.Tensor]) – Target values
step (int, optional) – Current step
validation_mode (bool, optional) – Whether to compute validation metrics
**kwargs – Additional arguments to pass to loss computation

Returns:

Loss, metrics dictionary (if validation_mode), and full predictions

Return type:

tuple[torch.Tensor | None, dict[str, torch.Tensor], dict[str, torch.Tensor]]

on_after_batch_transfer(batch: dict[str, Tensor], _: int) → dict[str, Tensor]

Assemble batch after transfer to GPU by gathering the batch shards if needed.

Also normalize the batch in-place if needed.

Parameters:: batch (dict[str, torch.Tensor]) – Batch to transfer
Returns:: Batch after transfer
Return type:: dict[str, torch.Tensor]

transfer_batch_to_device(batch: dict[str, Tensor], device: device, _dataloader_idx: int = 0) → dict[str, Tensor]: Transfer batch to device, handling dictionary batches.

allgather_batch(batch: Tensor, dataset_name: str) → Tensor

Allgather the batch-shards across the reader group.

Parameters:

batch (torch.Tensor) – Batch-shard of current reader rank
dataset_name (str) – Dataset name

Returns:

Allgathered (full) batch

Return type:

torch.Tensor

Calculate metrics on the validation output.

Parameters:

y_pred (torch.Tensor) – Predicted ensemble
y (torch.Tensor) – Ground truth (target).
step (int, optional) – Step number

Returns:

val_metrics – validation metrics and predictions

Return type:

dict[str, torch.Tensor]

training_step(batch: dict[str, Tensor], batch_idx: int) → Tensor

Here you compute and return the training loss and some additional metrics for e.g. the progress bar or logger.

Parameters:

batch – The output of your data iterable, normally a DataLoader.
batch_idx – The index of this batch.
dataloader_idx – The index of the dataloader that produced this batch. (only if multiple dataloaders used)

Returns:

Tensor - The loss tensor
dict - A dictionary which can include any keys, but must include the key 'loss' in the case of automatic optimization.
None - In automatic optimization, this will skip to the next batch (but is not supported for multi-GPU, TPU, or DeepSpeed). For manual optimization, this has no special meaning, as returning the loss is not required.

In this step you’d normally do the forward pass and calculate the loss for a batch. You can also do fancier things like multiple forward passes or something model specific.

Example:

def training_step(self, batch, batch_idx):
    x, y, z = batch
    out = self.encoder(x)
    loss = self.loss(out, x)
    return loss

To use multiple optimizers, you can switch to ‘manual optimization’ and control their stepping:

def __init__(self):
    super().__init__()
    self.automatic_optimization = False


# Multiple optimizers (e.g.: GANs)
def training_step(self, batch, batch_idx):
    opt1, opt2 = self.optimizers()

    # do training_step with encoder
    ...
    opt1.step()
    # do training_step with decoder
    ...
    opt2.step()

Note

When accumulate_grad_batches > 1, the loss returned here will be automatically normalized by accumulate_grad_batches internally.

validation_step(batch: dict[str, torch.Tensor], batch_idx: int) → TrainingStepOutput

Calculate the loss over a validation batch using the training loss function.

Parameters:

batch (dict[str, torch.Tensor]) – Validation batch.
batch_idx (int) – Batch index.

Returns:

Output of the validation step.

Return type:

TrainingStepOutput

lr_scheduler_step(scheduler: LRSchedulerTypeUnion, metric: Any | None = None) → None

Step the learning rate scheduler by Pytorch Lightning.

Parameters:

scheduler (LRSchedulerTypeUnion) – Learning rate scheduler object.
metric (Any) – Metric object for e.g. ReduceLRonPlateau. Default is None.

on_train_epoch_end() → None

Called in the training loop at the very end of the epoch.

To access all batch outputs at the end of the epoch, you can cache step outputs as an attribute of the LightningModule and access them in this hook:

class MyLightningModule(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.training_step_outputs = []

    def training_step(self):
        loss = ...
        self.training_step_outputs.append(loss)
        return loss

    def on_train_epoch_end(self):
        # do something with all training_step outputs, for example:
        epoch_mean = torch.stack(self.training_step_outputs).mean()
        self.log("training_epoch_mean", epoch_mean)
        # free up the memory
        self.training_step_outputs.clear()

configure_optimizers() → OptimizerLRScheduler: Create optimizer and LR scheduler based on Hydra config.

static log_optimizer(optimizer: Optimizer) → None: Log optimizer type and settings.

setup(stage: str) → None: Lightning hook that is called after model is initialized but before training starts.

Training Methods

The training method is the PyTorch Lightning module that implements the forward pass, loss computation, and metric calculation for a given task. All methods inherit from BaseTrainingModule.

SingleTraining: Deterministic single-member training. Compatible with Forecaster, TemporalDownscaler, and Autoencoder.
EnsembleTraining: Ensemble (multi-member) training. Generates multiple perturbed members per device and uses DDPEnsGroupStrategy for distributed execution.
TransportTraining: Configurable transport training for EDM diffusion and stochastic-interpolant objectives. Selects state or tendency targets with prediction_mode and the objective with transport_objective.

class anemoi.training.train.methods.single.SingleTraining(*, config: BaseSchema, task: BaseTask, graph_data: dict[str, HeteroData], statistics: dict, statistics_tendencies: dict, data_indices: dict[str, IndexCollection], metadata: dict, supporting_arrays: dict)

Bases: BaseTrainingModule

Base class for deterministic prediction tasks.

class anemoi.training.train.methods.ensemble.EnsembleTraining(*, config: DictConfig, task: BaseTask, graph_data: HeteroData, statistics: dict, statistics_tendencies: dict, data_indices: dict, metadata: dict, supporting_arrays: dict)

Bases: BaseTrainingModule

Graph neural network forecaster for ensembles for PyTorch Lightning.

property plot_adapter: EnsemblePlotAdapterWrapper: Wrap the task’s plot adapter with ensemble handling.

compute_dataset_loss_metrics(y_pred: Tensor, y: Tensor, dataset_name: str, rollout_step: int | None = None, validation_mode: bool = False, pred_layout: IndexSpace | str | None = None, target_layout: IndexSpace | str | None = None, **_kwargs) → tuple[Tensor | None, dict[str, Tensor], Tensor]

Compute loss and metrics for the given predictions and targets.

Parameters:

y_pred (torch.Tensor) – Predicted values
y (torch.Tensor) – Target values
step (int, optional) – Current step
validation_mode (bool, optional) – Whether to compute validation metrics
**kwargs – Additional arguments to pass to loss computation

Returns:

Loss, metrics dictionary (if validation_mode), and full predictions

Return type:

tuple[torch.Tensor | None, dict[str, torch.Tensor], torch.Tensor]

forward(x: dict[str, Tensor], rollout_step: int | None = None, **kwargs) → dict[str, Tensor]

Forward method.

This method calls the model’s forward method with the appropriate communication group and sharding information.

class anemoi.training.train.methods.transport.PredictionMode(module: BaseTransportTraining)

Bases: object

Prepare either state targets or tendency targets for transport training.

class anemoi.training.train.methods.transport.StatePredictionMode(module: BaseTransportTraining)

Bases: PredictionMode

Prediction mode where the model learns the future state directly.

class anemoi.training.train.methods.transport.TendencyPredictionMode(module: BaseTransportTraining)

Bases: PredictionMode

Prediction mode where the model learns changes from the latest input state.

prepare_target(batch: dict[str, Tensor], x: dict[str, Tensor]) → PreparedPredictionTarget: Build tendency targets for training and state targets for validation metrics.

class anemoi.training.train.methods.transport.BaseTransportTraining(*args, **kwargs)

Bases: BaseTrainingModule

Shared training code for transport methods that corrupt targets before prediction.

property prediction_mode: PredictionMode: Return the state/tendency target handler for this module.

property plot_adapter: EnsemblePlotAdapterWrapper: Wrap the task plot adapter with ensemble-dimension handling.

get_data_output_target(target_full: dict[str, Tensor]) → dict[str, Tensor]: Select the target variables that are present in the dataset output.

reduce_data_output_target_to_model_output(y_data_output: dict[str, Tensor]) → dict[str, Tensor]: Select only the variables that the model predicts.

compute_dataset_loss_metrics(y_pred: Tensor, y: Tensor, dataset_name: str, validation_mode: bool = False, metric_prediction: dict[str, Tensor] | None = None, metric_target: dict[str, Tensor] | None = None, **kwargs) → tuple[Tensor | None, dict[str, Tensor], Tensor]: Compute loss according to the objective and validation metrics in clean-state space.

class anemoi.training.train.methods.transport.TransportTraining(*args, **kwargs)

Bases: BaseTransportTraining

Training module for EDM diffusion and stochastic-interpolant transport models.

property transport_objective: TransportObjective: Return the selected transport objective for this module.

forward(x: dict[str, Tensor], conditioned_target: dict[str, Tensor], condition: dict[str, Tensor]) → dict[str, Tensor]

Forward method.

This method calls the model’s forward method with the appropriate communication group and sharding information.

class anemoi.training.train.methods.edm_diffusion.EDMDiffusionTransportObjective(module: TransportTraining)

Bases: TransportObjective

EDM diffusion objective.

compute_loss(y_pred: torch.Tensor, y: torch.Tensor, grid_shard_slice: slice | None = None, dataset_name: str | None = None, pred_layout: IndexSpace | str | None = None, target_layout: IndexSpace | str | None = None, weights: dict[str, torch.Tensor] | None = None, **_kwargs) → torch.Tensor: Compute EDM diffusion loss with noise weighting.

class anemoi.training.train.methods.stochastic_interpolant.StochasticInterpolantTransportObjective(module: TransportTraining)

Bases: TransportObjective

Stochastic-interpolant objective between a source field and the target field.

Available Tasks

Anemoi supports multiple task-specific implementations that define the temporal I/O structure for each scientific workflow.

Current supported tasks include:

Forecaster — Forecaster
Temporal Downscaler — TemporalDownscaler
AutoEncoder — Autoencoder

Each task defines which time steps are loaded as inputs and targets and provides helpers for mapping those offsets to batch positions.

Task `_step` return contract

Task implementations are expected to return a 3-tuple with a consistent shape across all task types:

loss: a tensor scalar used for optimization.
metrics: a mapping of metric names to tensors.
predictions: a list of per-step dictionaries keyed by dataset name.

For single-output tasks (for example autoencoder), the predictions value is a one-element list. For rollout-based tasks, the list contains one entry per rollout step. This shared contract keeps plotting callbacks task-agnostic and avoids task-specific unpacking logic.

Training Controller

The training process is orchestrated by AnemoiTrainer, which wraps a PyTorch Lightning Trainer and provides additional logic for:

Distributed training and inference
Dynamic loss scheduling and learning rate adjustment
Logging and profiling via profiler.py
Dataset loading
Graph loading and creation

Model Training

BaseTrainingModule

Key Features

Subclass Responsibilities

Training Methods

Available Tasks

Task _step return contract

Training Controller

Task `_step` return contract