Training

The Anemoi Training module is the heart of the framework where machine learning models for weather forecasting are trained. This section will guide you through the entire training process, from setting up your data to configuring your model and executing the training pipeline.

Setup Steps

Anemoi Training requires two primary components to get started:

Step 1 and 2:

Graph Definition from Anemoi Graphs: This defines the structure of your machine learning model, including the layers, connections, and operations that will be used during training.
Dataset from Anemoi Datasets: This provides the training data that will be fed into the model. The dataset should be pre-processed and formatted according to the specifications of the Anemoi Datasets module.

These 2 steps are outlined in Preparing training components.

Step 3: Configure the Training Process

Once your graph definition and dataset are ready, you can configure the training process. Anemoi Training allows you to adjust various parameters such as learning rate, batch size, number of epochs, and other hyperparameters that control the training behavior.

To configure the training:

Specify the training parameters in your configuration file or through the command line interface.
Replace all “missing” values in config ??? with the appropriate values for your training setup.
Choose the task (see Tasks), training method (see Training Methods), and model type from Models.
Optionally, customize additional components like the normaliser or optimization strategies to enhance model performance.

Parallelization

Anemoi Training supports different parallelization strategies based on the training task (see Strategy):

DDPGroupStrategy: Used for deterministic training tasks
DDPEnsGroupStrategy: Used for ensemble training tasks

These strategies have to be set depending on the model task specified in the configuration.

Step 4: Set Up Experiment Tracking (Optional)

Experiment tracking is an essential aspect of machine learning development, allowing you to keep track of various runs, compare model performances, and reproduce results. Anemoi Training can be easily integrated with popular experiment tracking tools like TensorBoard, MLflow or Weights & Biases (W&B).

These different tools provide various features such as visualizing training metrics, logging hyperparameters, and storing model checkpoints. You can choose the tool that best fits your workflow and set it up to track your training experiments.

To set up experiment tracking:

Install the desired experiment tracking tool (e.g., TensorBoard, MLflow, or W&B).
Configure the tool in your training configuration file or through the command line interface.
Start the experiment tracking server and monitor your training runs in real-time.

Reproducibility and Seeding

Anemoi Training seeds the random number generators used during training to control stochastic parts of a run. This helps repeat experiments, but it does not guarantee exact reproducibility across hardware, software versions, or distributed executions because floating-point numerics and parallel reductions can still differ slightly.

Set ANEMOI_BASE_SEED explicitly if you want a fixed base seed for a run.
If ANEMOI_BASE_SEED is not set, Anemoi Training uses SLURM_JOB_ID when running inside a SLURM job.
If neither value is available, Anemoi Training falls back to 42.

When restarting from a checkpoint, avoid reusing the same manual base seed. Checkpoints restore model and optimizer state, but not the random-number streams used during training and data loading, so the same seed can replay the same sequence of random choices after restart. This is usually not a concern when the seed comes from SLURM_JOB_ID, because a new SLURM job normally gets a new job ID.

Seeds below 1000 are multiplied by 1000 before use, so a fallback seed of 42 appears in logs as an effective seed of 42000. This normalized base seed is logged during training and stored in checkpoint metadata.

Step 5: Execute Training

With everything set up, you can now execute the training process. Anemoi Training will use the graph definition and dataset to train your model according to the specified configuration.

To execute training:

Run the training command given below, ensuring that all paths to the graph definition and dataset are correctly specified.
Monitor the training process, adjusting parameters as needed to optimize model performance.
Upon completion, the trained model will be registered and stored for further use.

Make sure you have a GPU available and simply call:

anemoi-training train

Data Routing

Anemoi Training uses the Anemoi Datasets module to load the data.

Anemoi training implements data routing, in which you can specify which variables are used as forcings; used as input only, and which variables are as diagnostics; appear as output only and to be predicted by the model. All remaining variables will be treated as prognostic, i.e. they appear as both inputs and outputs.

Forcings are variables such as solar insolation or land-sea-mask. These would make little sense to predict as they are external to the model. These can be static (like the land-sea-mask) or dynamic (like solar insolation). Note within anemoi, forcing does not have the classical NWP meaning of external variables which impact the model, such as wind forcing applied to an ocean model. Instead, forcing here refers to any variable which is an input only. In some cases this includes ‘traditional forcing’, alongside other variables.

Diagnostics includes the variables like precipitation that we want to predict, but which may not be available in forecast step zero due to technical limitations. These can aso include derived quantities which we wish to train the model to predict directly, but do not want to use as inputs.

Prognostic variables are the variables like temperature or humidity that we want to predict and appear as both inputs and outputs.

The user can specify the routing of the data for each dataset separately by setting the config.data.datasets.your_dataset_name.forcings and config.data.datasets.your_dataset_name.diagnostics. These are named strings, as Anemoi datasets enables us to address variables by name. Any variable in the dataset which is not listed as either forcing or diagnostic (or dropped, see Dataloader below), will be classed as a prognostic variable.

data:
   datasets:
      your_dataset_name:
         forcings:
            - solar_insolation
            - land_sea_mask
         diagnostics:
            - total_precipitation

Data Modules

Anemoi Training provides different data modules to handle various model tasks:

AnemoiDatasetDataModule: Standard data module for deterministic training
AnemoiEnsDatasetsDataModule: Specialized data module for ensemble training. It also allows for training with perturbed initial conditions.

The choice of data module depends on your training task and input data requirements.

Dataloader

Anemoi uses the Dataloader class from PyTorch to load the input batches for the upcoming training steps. The data is asynchronously prefetched from a filesystem and stored in CPU memory until a batch is required by the GPU.

The dataloader config exposes configuration options of the underlying pytorch dataloaders. By default, num_workers dataloading processes are created for every GPU. Each worker will prefetch a maximum of prefetch_factor batches.

num_workers:
   training: 8
   validation: 8
   test: 8
batch_size:
   training: 2
   validation: 4
   test: 4

limit_batches:
   training: null
   validation: null
   test: 20

prefetch_factor:
   training: 2
   validation: 2
   test: 2

multiprocessing_context: None

Determining the optimal number of workers depends on your system and training setup. More dataloader processes can increase your filesystem bandwidth, at the cost of higher CPU memory usage. Higher source resolutions and larger batch sizes increase the memory required per worker. When the available CPU memory is not sufficient for the requested number of workers, your training run will crash. One can use the anemoi dataloader benchmark to quickly test different setups and determine the optimal configuration for your training setup.

The config entry limit_batches is an option to limit the number of batches loaded by the dataloader. This can be useful for testing and debugging purposes, allowing you to run a shorter training loop without processing the entire dataset.

The config entry multiprocessing_context allows you to specify the multiprocessing context for the dataloader workers. The default is None, which uses the default context for your system. Typically, there is no need to change this, but in some cases, such as when using certain libraries or running on specific platforms, you may need to set it to fork or spawn to avoid issues with multiprocessing.

Note

When training directly from S3, it is required to use the spawn multiprocessing context to avoid issues with the underlying library used to access S3.

The dataloader file also describes the files used for training, validation and testing, and the datasplit For machine learning, we separate our data into: training data, used to train the model; validation data, used to assess various version of the model throughout the model development process; and test data, used to assess a final version of the model. Best practice is to separate the data in time, ensuring the validation and test data are suitably independent from the training data.

For each dataset your_dataset_name, we define the start and end time of each section of the data. This can be given as a full date, or just the year, or year and month, in these cases the first of the month/first of the year is used.

We also define the dataset reader options under dataset_config. This includes the dataset source (dataset) and optional keys such as frequency, drop, select and statistics. These can be set separately for the different training/validation/test parts of the dataset your_dataset_name, for example, if test data is stored in a different file.

By default, every variable within the dataset is used. If this is not desired, variables can be listed within drop and they won’t be used. Conversely, if only a few variables from the file are needed select can be used in place of drop, and only the listed variables are used. The same overall set of variables must be used throughout training, validation and test. If using different files, which contain different variables, the items listed in drop/select may vary.

dataset: ${system.input.dataset}

training:
  datasets:
    your_dataset_name:
      dataset_config:
        dataset: ${dataloader.dataset}
        frequency: ${data.frequency}
        drop: []
      start: null
      end: 2020

validation:
  datasets:
    your_dataset_name:
      dataset_config:
        dataset: ${dataloader.dataset}
        frequency: ${data.frequency}
        drop: []
      start: 2021-01-01
      end: 2021

test:
  datasets:
    your_dataset_name:
      dataset_config:
        dataset: ${dataloader.dataset}
        frequency: ${data.frequency}
        drop: []
      start: 2022-01
      end: null

Normalisation

Machine learning models are sensitive to the scale of the input data. To ensure that the model can learn effectively, it is important to normalise the input data, so all variables exhibit a similar range. This ensures variables have comparable contributions to the loss function, and enables the model to learn effectively.

The nornmaliser is one of many ‘preprocessors’ within anemoi, it implements multiple strategies that can be applied to the data using the config. Currently, the normaliser supports the following strategies:

none: No normalisation is applied.
mean-std: Data is normalised by subtracting the mean and dividing by the standard deviation
std: Data is normalised by dividing by the standard deviation.
min-max: Data is normalised by substracting the min value and dividing by the range.
max: Data is normalised by dividing by the max value.

Values like the land-sea-mask do not require additional normalisation as they already span a range between 0 and 1. Variables like temperature or humidity are usually normalised using mean-std. Some variables like the geopotential height should be max normalised, so the ‘zero’ point and the proportional distance from this point is retained,

The user can specify the normalisation strategy by choosing a default method, and additionally specifying specific cases for certain variables within config.data.datasets.your_dataset_name.normaliser:

normaliser:
   default: mean-std
   none:
      - land_sea_mask
   max:
      - geopotential_height

An additional option in the normaliser overwrites statistics of specific variables onto others. This is primarily used for convective precipitation (cp), which is a fraction of total precipitation (tp), by overwriting the cp statistics with the tp statistics, we ensure the fractional relationship remains intact in the normalised space. Note that this is a design choice.

normaliser:
   remap:
     cp: tp

Imputer

It is important to have no missing values (e.g. NaNs) in the data when training a model as this will break the backpropagation of gradients and cause the model to predict only NaNs. For fields which contain missing values, we provide options to replace these values via an “imputer”. During training NaN values are replaced with the specified value for the field. The default imputer is “none”, which means no imputation is performed. The user can specify the imputer for each dataset by setting datasets.your_dataset_name.processors.imputer under the data/zarr.yaml file. It is common to impute with the mean value, ensuring that the variable value over NaNs becomes zero after mean-std normalisation. Another option is to impute with a given constant.

datasets:
   your_dataset_name:
      imputer:
         default: "none"
         mean:
            - 2t

      processors:
         imputer:
            _target_: anemoi.models.preprocessing.imputer.InputImputer
            config: ${data.datasets.your_dataset_name.imputer}

Loss Functions

Anemoi Training supports various loss functions for different training tasks and easily allows for custom loss functions to be added.

training_loss:
   datasets:
      your_dataset_name:
         _target_: anemoi.training.losses.mse.WeightedMSELoss
         # class kwargs

The choice of loss function depends on the model task and the desired properties of the forecast and is configured for each dataset separately.

For ensemble training, the following loss functions are available:

CRPS: Kernel Continuous Ranked Probability Score for ensemble predictions. alpha=0 gives standard CRPS, alpha=1 gives fair CRPS, and values between 0 and 1 give the almost fair CRPS formulation. The default alpha: 0.95 combines 5% standard CRPS with 95% fair CRPS. The naive backend uses a simple loop over unordered ensemble-member pairs and avoids materializing the full pairwise tensor. The stable backend materializes pairwise tensors and uses the numerically stable all-pairs formulation.

Loss function scaling

It is possible to change the weighting given to each of the variables in the loss function by changing the default pressure_level and general_variable scalers. They are by default applied to the fields before applying the training loss function and defined in the configuration training.scalers.

While in the general_variable scaler each variable is given a weighting, the pressure_level scaler is applied to the pressure levels variables with respect to the pressure level. For almost all applications, upper atmosphere pressure levels should be given lower weighting than the lower atmosphere pressure levels (i.e. pressure levels nearer to the surface). By default anemoi-training uses a ReLU Pressure Level scaler with a minimum weighting of 0.2 (i.e. no pressure level has a weighting less than 0.2), defined in class anemoi.training.losses.scalers.ReluVariableLevelScaler.

datasets:
   your_dataset_name:
      general_variable:
         _target_: anemoi.training.losses.scalers.GeneralVariableLossScaler
         weights:
            default: 1
            t: 6
            z: 12
            10u: 0.1
            10v: 0.1
            2d: 0.5
            tp: 0.025
            cp: 0.0025

The loss is also scaled by assigning a weight to each node on the output grid. These weights are calculated during graph-creation and stored as an attribute in the graph object. The configuration option config.training.datasets.your_dataset_name.node_weights is used to specify the node attribute used as weights in the loss function. By default anemoi-training uses area weighting, where each node is weighted according to the size of the geographical area it represents.

It is also possible to rescale the weight of a subset of nodes after they are loaded from the graph using the class anemoi.training.losses.scalers.ReweightedGraphNodeAttributeScaler.

Learning rate

Anemoi training uses the CosineLRScheduler from PyTorch as it’s learning rate scheduler. Docs for this scheduler can be found here https://github.com/huggingface/pytorch-image-models/blob/main/timm/scheduler/cosine_lr.py The user can configure the maximum learning rate by setting config.training.lr.rate. Note that this learning rate is scaled by the number of GPUs with:

global_learning_rate = config.training.lr.rate * num_gpus_per_node * num_nodes / gpus_per_model

The user can also control the rate at which the learning rate decreases by setting the total number of iterations - config.training.lr.iterations and the minimum learning rate reached - config.training.lr.min. Note that the minimum learning rate is not scaled by the number of GPUs. The user can also control the warmup period by setting config.training.lr.warmup_t. If the warmup period is set to 0, the learning rate will start at the maximum learning rate. If no warmup period is defined, a default warmup period of 1000 iterations is used.

Restarting a training run

It may be necessary at certain points to restart the model training, i.e. because the training has exceeded the time limit on an HPC system or because the user wants to fine-tune the model from a specific point in the training.

This can be done by setting config.training.run_id in the config file to be the run_id of the run that is being restarted. In this case the new checkpoints will go in the same folder as the old checkpoints. If the user does not want this then they can instead set config.training.fork_run_id in the config file to the run_id of the run that is being restarted. In this case the old run will be unaffected and the new checkpoints will go in to a new folder with a new run_id. The user might want to do this if they want to start multiple new runs from 1 old run.

The above will restart the model training from where the old run finished training. It’s also possible to restart the model training from a specific checkpoint. This can either be a checkpoint from the same run or a checkpoint from a different run that you have run in the past or that you using for transfer learning. To do this, set config.system.input.warm_start to be the path to the checkpoint they want to restart from.

The above can be adapted depending on the use case and taking advantage of hydra, you can also reuse config.training.run_id or config.training.fork_run_id to define the path to the checkpoint.

Rollout

Rollout training is when the model is iterated within the training process, producing forecasts for many future time steps from its own predictions. The loss is calculated on every step in the rollout period and averaged, and gradients flow through the whole forecast chain when backward is called.

For example, with rollout=3 and a 6-hour model timestep, the model autoregressively predicts t+6 h, then uses that prediction as input to predict t+12 h, and again to predict t+18 h. The loss is averaged across all three steps: (loss(t+6h) + loss(t+12h) + loss(t+18h)) / 3. Training with rollout has been shown to improve stability in long autoregressive inference runs, because the training objective more closely resembles the multi-step forecasting use case.

In most cases, in the first stage of training, the model is trained for many epochs to predict only one step (i.e. rollout.max = 1). Once this is completed, there is a second stage of training, which uses rollout to fine-tune the model error at longer leadtimes. The model begins with a rollout loss defined by rollout.start, usually 1, and then every n epochs (defined by rollout.epoch_increment) the rollout value increases up until rollout.max.

rollout:
   start: 1
   # increase rollout every n epochs
   epoch_increment: 1
   # maximum rollout to use
   max: 12

This two stage approach requires the model training to be restarted after stage one, see Data Routing below. The user should make sure to set config.training.run_id equal to the run-id of the first stage of training.

Note, for many purposes, it may make sense for the rollout stage (stage two) to be performed at the minimum learning rate throughout and for the number of batches to be reduced (using config.dataloader.training.limit_batches) to prevent overfitting to specific timesteps.

Restarting rollout training

When restarting an interrupted rollout run, the rollout state (the current rollout.step and the last epoch that triggered an increment) is automatically saved in every Lightning checkpoint and restored on resume. No manual adjustment of rollout.start is needed.

When using rollout training with epoch_increment > 0, extra care is required when restarting an interrupted run.

Anemoi currently does not track how many samples remain in the dataloader at the point where a checkpoint is written. For this reason, only end-of-epoch checkpoints should be used for reproducible restart workflows.

The recommended restart recipe is:

Restart from an end-of-epoch checkpoint.
Keep rollout.start, epoch_increment, and max unchanged in your configuration.
Ensure that the training.run_id is set to the run ID of the interrupted job.

On resume, Anemoi reads the saved rollout state from the checkpoint and continues the schedule from exactly where it left off. A double-increment guard ensures that if the checkpoint was written at an epoch boundary the rollout is not incremented twice for that epoch.

Note

When resuming from a checkpoint that contains saved rollout state, rollout.start is ignored entirely — the rollout step is always restored unconditionally from the checkpoint.

Transfer Learning

Transfer learning allows the model to reuse knowledge from a previously trained checkpoint. This is particularly useful when the new task is related to the old one, enabling faster convergence and often improving model performance.

To enable transfer learning, set the config.training.transfer_learning flag to True in the configuration file.

training:
   # start the training from a checkpoint of a previous run
   fork_run_id: ...
   load_weights_only: True
   transfer_learning: True

When this flag is active and a checkpoint path is specified in config.system.input.warm_start or self.last_checkpoint, the system loads the pre-trained weights using the transfer_learning_loading() function. This approach ensures only compatible weights are loaded and mismatched layers are handled appropriately.

For example, transfer learning might be used to adapt a weather forecasting model trained on one geographic region to another region with similar characteristics.

Variable Compatibility Checks

When loading a checkpoint (for transfer learning, fine-tuning, or resuming a run), Anemoi checks that the variable metadata in the checkpoint matches the current dataset — for example that units have not changed. The same check is applied at training start between any predicted variable and its paired target variable in the loss function.

Both checks respect a check_variables_compatibility configuration block. Each field can be set to true to suppress the check for all variables, or to a list of variable names to suppress it only for those variables.

Checkpoint vs. current dataset (resuming / fine-tuning)

Configure this at training.check_variables_compatibility:

training:
   check_variables_compatibility:
     ignore_units: false           # true, or [var1, var2, ...]
     ignore_period: false          # true, or [var1, var2, ...]
     ignore_time_processing: false # true, or [var1, var2, ...]
     ignore_type_of_level: false   # true, or [var1, var2, ...]

Predicted vs. target variables in the loss (e.g. tp → imerg)

This check runs on every training run, not only when resuming from a checkpoint. Configure it directly on the loss entry that defines the pairing:

training:
   training_loss:
     datasets:
       data:
         _target_: anemoi.training.losses.MAELoss
         scalers: [node_weights]
         predicted_variables: [tp]
         target_variables: [imerg]
         check_variables_compatibility:
           ignore_units: false   # true, or [tp]
           ignore_period: false  # true, or [tp]

Model Freezing

Model freezing is a technique where specific parts (submodules) of a model are excluded from training. This is useful when certain parts of the model have been sufficiently trained or should remain unchanged for the current task.

To specify which submodules to freeze, use the config.training.submodules_to_freeze field in the configuration. List the names of submodules to be frozen. During model initialization, these submodules will have their parameters frozen, ensuring they are not updated during training.

For example, if you have a pre-trained model on a ‘global’ dataset and want to train a new decoder with the previous model’s parameters frozen, you would specify the following configuration to freeze the trainable parameters of the processor, as well as those of the ‘global’ encoder and decoder.

training:
   # start the training from a checkpoint of a previous run
   fork_run_id: ...
   load_weights_only: True

   submodules_to_freeze:
      - encoder.global
      - processor
      - decoder.global

Freezing can be particularly beneficial in scenarios such as fine-tuning when only specific components (e.g., the encoder, the decoder) need to adapt to a new task while keeping others (e.g., the processor) fixed.

Precision and BLAS Backend

Anemoi supports Lightning’s native mixed precision training as well as the option to select a preferred BLAS backend to be used by PyTorch. For example:

training:
   precision: bf16-mixed
   preferred_blas_backend: "cublas"

Note that both entries are optional and can be left unspecified. The default precision is f16-mixed while the BLAS backend will fall back to the default selection of PyTorch.

Weight Averaging

Weight averaging is a technique to improve model generalization by averaging model weights during training. Anemoi Training provides its own weight-averaging callbacks that wrap PyTorch Lightning’s WeightAveraging infrastructure with pair parameters and buffers by name rather than positionally.

Using the stock pytorch_lightning.callbacks.*WeightAveraging classes directly will crash or silently mis-pair tensors when used with:

Imputers (e.g. ConstantImputer), which register scratch buffers whose shapes change on the first forward pass.
Updating loss scalers (e.g. NaNMaskScaler), which re-register scaler buffers every batch via ScaleTensor.update_scaler — shuffling the buffer order in the live model relative to the averaged model’s snapshot.

A warning will be logged if the stock PyTorch Lightning weight-averaging callbacks are used, recommending the anemoi variants instead.

The supported methods are:

Exponential Moving Average (EMA): Maintains an exponential moving

average of model weights, which can lead to smoother convergence and better generalization.

weight_averaging:
   _target_: anemoi.training.diagnostics.callbacks.weight_averaging.EMAWeightAveraging
   decay: 0.999
   update_every_n_steps: 1
   update_starting_at_step: null
   update_starting_at_epoch: null

Stochastic Weight Averaging (SWA): Averages weights from multiple

points along the training trajectory, typically resulting in wider optima and improved generalization.

weight_averaging:
   _target_: anemoi.training.diagnostics.callbacks.weight_averaging.SWAWeightAveraging
   update_every_n_steps: 1
   update_starting_at_step: null
   update_starting_at_epoch: null

By default, weight averaging is disabled. To explicitly disable it or to override a parent configuration, set weight_averaging to null.

Note

Weight averaging is only supported in PyTorch Lightning 2.6 and later versions.