Training

The Anemoi Training module is the heart of the framework where machine learning models for weather forecasting are trained. This section will guide you through the entire training process, from setting up your data to configuring your model and executing the training pipeline.

Setup Steps

Anemoi Training requires two primary components to get started:

Step 1 and 2:

  1. Graph Definition from Anemoi Graphs: This defines the structure of your machine learning model, including the layers, connections, and operations that will be used during training.

  2. Dataset from Anemoi Datasets: This provides the training data that will be fed into the model. The dataset should be pre-processed and formatted according to the specifications of the Anemoi Datasets module.

These 2 steps are outlined in Preparing training components.

Step 3: Configure the Training Process

Once your graph definition and dataset are ready, you can configure the training process. Anemoi Training allows you to adjust various parameters such as learning rate, batch size, number of epochs, and other hyperparameters that control the training behavior.

To configure the training:

  • Specify the training parameters in your configuration file or through the command line interface.

  • Replace all “missing” values in config ??? with the appropriate values for your training setup.

  • Choose the model task and model type from Models.

  • Optionally, customize additional components like the normaliser or optimization strategies to enhance model performance.

Parallelization

Anemoi Training supports different parallelization strategies based on the training task (see Strategy):

  • DDPGroupStrategy: Used for deterministic training tasks

  • DDPEnsGroupStrategy: Used for ensemble training tasks

These strategies have to be set depending on the model task specified in the configuration.

Step 4: Set Up Experiment Tracking (Optional)

Experiment tracking is an essential aspect of machine learning development, allowing you to keep track of various runs, compare model performances, and reproduce results. Anemoi Training can be easily integrated with popular experiment tracking tools like TensorBoard, MLflow or Weights & Biases (W&B).

These different tools provide various features such as visualizing training metrics, logging hyperparameters, and storing model checkpoints. You can choose the tool that best fits your workflow and set it up to track your training experiments.

To set up experiment tracking:

  1. Install the desired experiment tracking tool (e.g., TensorBoard, MLflow, or W&B).

  2. Configure the tool in your training configuration file or through the command line interface.

  3. Start the experiment tracking server and monitor your training runs in real-time.

Step 5: Execute Training

With everything set up, you can now execute the training process. Anemoi Training will use the graph definition and dataset to train your model according to the specified configuration.

To execute training:

  • Run the training command given below, ensuring that all paths to the graph definition and dataset are correctly specified.

  • Monitor the training process, adjusting parameters as needed to optimize model performance.

  • Upon completion, the trained model will be registered and stored for further use.

Make sure you have a GPU available and simply call:

anemoi-training train

Data Routing

Anemoi Training uses the Anemoi Datasets module to load the data.

Anemoi training implements data routing, in which you can specify which variables are used as forcings; used as input only, and which variables are as diagnostics; appear as output only and to be predicted by the model. All remaining variables will be treated as prognostic, i.e. they appear as both inputs and outputs.

Forcings are variables such as solar insolation or land-sea-mask. These would make little sense to predict as they are external to the model. These can be static (like the land-sea-mask) or dynamic (like solar insolation). Note within anemoi, forcing does not have the classical NWP meaning of external variables which impact the model, such as wind forcing applied to an ocean model. Instead, forcing here refers to any variable which is an input only. In some cases this includes ‘traditional forcing’, alongside other variables.

Diagnostics includes the variables like precipitation that we want to predict, but which may not be available in forecast step zero due to technical limitations. These can aso include derived quantities which we wish to train the model to predict directly, but do not want to use as inputs.

Prognostic variables are the variables like temperature or humidity that we want to predict and appear as both inputs and outputs.

The user can specify the routing of the data by setting the config.data.forcings and config.data.diagnostics. These are named strings, as Anemoi datasets enables us to address variables by name. Any variable in the dataset which is not listed as either forcing or diagnostic (or dropped, see Dataloader below), will be classed as a prognostic variable.

data:
   forcings:
      - solar_insolation
      - land_sea_mask
   diagnostics:
      - total_precipitation

Data Modules

Anemoi Training provides different data modules to handle various model tasks:

  • AnemoiDatasetDataModule: Standard data module for deterministic training

  • AnemoiEnsDatasetsDataModule: Specialized data module for ensemble training. It also allows for training with perturbed initial conditions.

The choice of data module depends on your training task and input data requirements.

Dataloader

The dataloader file contains information on how many workers are used, and the batch size. num_workers relates to model parallelisation.

num_workers:
   training: 8
   validation: 8
   test: 8
batch_size:
   training: 2
   validation: 4
   test: 4

limit_batches:
   training: null
   validation: null
   test: 20

The grid points being modelled are also defined. In many cases this will be the full grid. For limited area modelling, you may want to define a set of target indices which mask/remove some grid points, leaving only the area being modelled.

# set a custom mask for grid points.
# Useful for LAM (dropping unconnected nodes from forcing dataset)
grid_indices:
   _target_: anemoi.training.data.grid_indices.FullGrid
   nodes_name: ${graph.data}

The dataloader file also describes the files used for training, validation and testing, and the datasplit For machine learning, we separate our data into: training data, used to train the model; validation data, used to assess various version of the model throughout the model development process; and test data, used to assess a final version of the model. Best practice is to separate the data in time, ensuring the validation and test data are suitably independent from the training data.

We define the start and end time of each section of the data. This can be given as a full date, or just the year, or year and month, in these cases the first of the month/first of the year is used.

The dataset used, and the frequency can be set spearately for the different parts of the dataset, for example, if test data is stored in a different file.

By default, every variable within the dataset is used. If this is not desired, variables can be listed within drop and they won’t be used. Conversely, if only a few variables from the file are needed select can be used in place of drop, and only the listed variables are used. The same overall set of variables must be used throughout training, validation and test. If using different files, which contain different variables, the items listed in drop/select may vary.

dataset: ${hardware.paths.data}/${hardware.files.dataset}

training:
  dataset: ${dataloader.dataset}
  start: null
  end: 2020
  frequency: ${data.frequency}
  drop:  []

validation_rollout: 1 # number of rollouts to use for validation, must be equal or greater than rollout expected by callbacks

validation:
  dataset: ${dataloader.dataset}
  start: 2021-01-01
  end: 2021
  frequency: ${data.frequency}
  drop:  []

test:
  dataset: ${dataloader.dataset}
  start: 2022-01
  end: null
  frequency: ${data.frequency}
  drop:  []

Normalisation

Machine learning models are sensitive to the scale of the input data. To ensure that the model can learn effectively, it is important to normalise the input data, so all variables exhibit a similar range. This ensures variables have comparable contributions to the loss function, and enables the model to learn effectively.

The nornmaliser is one of many ‘preprocessors’ within anemoi, it implements multiple strategies that can be applied to the data using the config. Currently, the normaliser supports the following strategies:

  • none: No normalisation is applied.

  • mean-std: Data is normalised by subtracting the mean and dividing by the standard deviation

  • std: Data is normalised by dividing by the standard deviation.

  • min-max: Data is normalised by substracting the min value and dividing by the range.

  • max: Data is normalised by dividing by the max value.

Values like the land-sea-mask do not require additional normalisation as they already span a range between 0 and 1. Variables like temperature or humidity are usually normalised using mean-std. Some variables like the geopotential height should be max normalised, so the ‘zero’ point and the proportional distance from this point is retained,

The user can specify the normalisation strategy by choosing a default method, and additionally specifying specific cases for certain variables within config.data.normaliser:

normaliser:
   default: mean-std
   none:
      - land_sea_mask
   max:
      - geopotential_height

An additional option in the normaliser overwrites statistics of specific variables onto others. This is primarily used for convective precipitation (cp), which is a fraction of total precipitation (tp), by overwriting the cp statistics with the tp statistics, we ensure the fractional relationship remains intact in the normalised space. Note that this is a design choice.

normaliser:
   remap:
     cp: tp

Imputer

It is important to have no missing values (e.g. NaNs) in the data when training a model as this will break the backpropagation of gradients and cause the model to predict only NaNs. For fields which contain missing values, we provide options to replace these values via an “imputer”. During training NaN values are replaced with the specified value for the field. The default imputer is “none”, which means no imputation is performed. The user can specify the imputer by setting processors.imputer under the data/zarr.yaml file. It is comon to impute with the mean value, ensuring that the variable value over NaNs becomes zero after mean-std normalisation. Another option is to impute with a given constant.

The DynamicInputImputer can be used for fields where the NaN locations change in time.

imputer:
   default: "none"
   mean:
      - 2t

processors:
imputer:
   _target_: anemoi.models.preprocessing.imputer.InputImputer
   _convert_: all
   config: ${data.imputer}

Loss Functions

Anemoi Training supports various loss functions for different training tasks and easily allows for custom loss functions to be added.

training_loss:
   _target_: anemoi.training.losses.mse.WeightedMSELoss
   # class kwargs

The choice of loss function depends on the model task and the desired properties of the forecast.

For ensemble training, the following loss functions are available:

  • Kernel CRPS: Continuous Ranked Probability Score using kernel density estimation

  • AlmostFairKernelCRPS: A variant of Kernel CRPS which accounts for the number of ensemble members used.

Loss function scaling

It is possible to change the weighting given to each of the variables in the loss function by changing config.training.variable_loss_scaling.pl.<pressure level variable> and config.training.variable_loss_scaling.sfc.<surface variable>.

It is also possible to change the scaling given to the pressure levels using config.training.pressure_level_scaler. For almost all applications, upper atmosphere pressure levels should be given lower weighting than the lower atmosphere pressure levels (i.e. pressure levels nearer to the surface). By default anemoi-training uses a ReLU Pressure Level scaler with a minimum weighting of 0.2 (i.e. no pressure level has a weighting less than 0.2).

The loss is also scaled by assigning a weight to each node on the output grid. These weights are calculated during graph-creation and stored as an attribute in the graph object. The configuration option config.training.node_loss_weights is used to specify the node attribute used as weights in the loss function. By default anemoi-training uses area weighting, where each node is weighted according to the size of the geographical area it represents.

It is also possible to rescale the weight of a subset of nodes after they are loaded from the graph. For instance, for a stretched grid setup we can rescale the weight of nodes in the limited area such that their sum equals 0.25 of the sum of all node weights with the following config setup

node_loss_weights:
   _target_: anemoi.training.losses.nodeweights.ReweightedGraphNodeAttribute
   target_nodes: data
   scaled_attribute: cutout
   weight_frac_of_total: 0.25

Learning rate

Anemoi training uses the CosineLRScheduler from PyTorch as it’s learning rate scheduler. Docs for this scheduler can be found here https://github.com/huggingface/pytorch-image-models/blob/main/timm/scheduler/cosine_lr.py The user can configure the maximum learning rate by setting config.training.lr.rate. Note that this learning rate is scaled by the number of GPUs with:

global_learning_rate = config.training.lr.rate * num_gpus_per_node * num_nodes / gpus_per_model

The user can also control the rate at which the learning rate decreases by setting the total number of iterations - config.training.lr.iterations and the minimum learning rate reached - config.training.lr.min. Note that the minimum learning rate is not scaled by the number of GPUs. The user can also control the warmup period by setting config.training.lr.warmup_t. If the warmup period is set to 0, the learning rate will start at the maximum learning rate. If no warmup period is defined, a default warmup period of 1000 iterations is used.

Rollout

Rollout training is when the model is iterated within the training process, producing forecasts for many future time steps. The loss is calculated on every step in the rollout period and averaged, and gradients backprogogated through the iteration process.

For example, if using rollout=3 and a model with a 6 hour prediction step-size, when training the model predicts for time t+1, this is used as inputs to predict time t+2, and this used to predict time t+3. The loss is calculated as 1/3 * ( (loss at t+1) + (loss at t+2) + (loss at t+3) ) Rollout training has been shown to improve stability for long auto-regressive inference runs, by making the training objective is closer to the use case of forecasting arbitrary lead timestep through autoreggresive iteration of the model.

In most cases, in the first stage of training, the model is trained for many epochs to perdict only one step (i.e. rollout.max = 1). Once this is completed, there is a second stage of training, which uses rollout to fine-tune the model error at longer leadtimes. The model begins with a rollout loss defined by rollout.start, usually 1, and then every n epochs (defined by rollout.epoch_increment) the rollout value increases up till rollout.max.

rollout:
   start: 1
   # increase rollout every n epochs
   epoch_increment: 1
   # maximum rollout to use
   max: 12

This two stage approach requires the model training to be restarted after stage one, see instructions below. The user should make sure to set config.training.run_id equal to the run-id of the first stage of training.

Note, for many purposes, it may make sense for the rollout stage (stage two) to performed at the minimum learning rate throughout and for the number of batches to be reduced (using config.dataloader.training.limit_batches) to prevent overfit to specific timesteps.

Restarting a training run

It may be necessary at certain points to restart the model training, i.e. because the training has exceeded the time limit on an HPC system or because the user wants to fine-tune the model from a specific point in the training.

This can be done by setting config.training.run_id in the config file to be the run_id of the run that is being restarted. In this case the new checkpoints will go in the same folder as the old checkpoints. If the user does not want this then they can instead set config.training.fork_run_id in the config file to the run_id of the run that is being restarted. In this case the old run will be unaffected and the new checkpoints will go in to a new folder with a new run_id. The user might want to do this if they want to start multiple new runs from 1 old run.

The above will restart the model training from where the old run finished training. However if the user wants to restart the model from a specific point they can do this by setting config.hardware.files.warm_start to be the checkpoint they want to restart from..

Transfer Learning

Transfer learning allows the model to reuse knowledge from a previously trained checkpoint. This is particularly useful when the new task is related to the old one, enabling faster convergence and often improving model performance.

To enable transfer learning, set the config.training.transfer_learning flag to True in the configuration file.

training:
   # start the training from a checkpoint of a previous run
   fork_run_id: ...
   load_weights_only: True
   transfer_learning: True

When this flag is active and a checkpoint path is specified in config.hardware.files.warm_start or self.last_checkpoint, the system loads the pre-trained weights using the transfer_learning_loading function. This approach ensures only compatible weights are loaded and mismatched layers are handled appropriately.

For example, transfer learning might be used to adapt a weather forecasting model trained on one geographic region to another region with similar characteristics.

Model Freezing

Model freezing is a technique where specific parts (submodules) of a model are excluded from training. This is useful when certain parts of the model have been sufficiently trained or should remain unchanged for the current task.

To specify which submodules to freeze, use the config.training.submodules_to_freeze field in the configuration. List the names of submodules to be frozen. During model initialization, these submodules will have their parameters frozen, ensuring they are not updated during training.

For example with the following configuration, the processor will be frozen and only the encoder and decoder will be trained:

training:
   # start the training from a checkpoint of a previous run
   fork_run_id: ...
   load_weights_only: True

   submodules_to_freeze:
      - processor

Freezing can be particularly beneficial in scenarios such as fine-tuning when only specific components (e.g., the encoder, the decoder) need to adapt to a new task while keeping others (e.g., the processor) fixed.