Training

The Anemoi Training module is the heart of the framework where machine learning models for weather forecasting are trained. This section will guide you through the entire training process, from setting up your data to configuring your model and executing the training pipeline.

Setup Steps

Anemoi Training requires two primary components to get started:

Graph Definition from Anemoi Graphs: This defines the structure of your machine learning model, including the layers, connections, and operations that will be used during training.
Dataset from Anemoi Datasets: This provides the training data that will be fed into the model. The dataset should be pre-processed and formatted according to the specifications of the Anemoi Datasets module.

These 2 steps are outlined in the start guide.

Step 3: Configure the Training Process

Once your graph definition and dataset are ready, you can configure the training process. Anemoi Training allows you to adjust various parameters such as learning rate, batch size, number of epochs, and other hyperparameters that control the training behavior.

To configure the training:

Specify the training parameters in your configuration file or through the command line interface.
Replace all “missing” values in config ??? with the appropriate values for your training setup.
Optionally, customize additional components like the normaliser or optimization strategies to enhance model performance.

Step 4: Set Up Experiment Tracking (Optional)

Experiment tracking is an essential aspect of machine learning development, allowing you to keep track of various runs, compare model performances, and reproduce results. Anemoi Training can be easily integrated with popular experiment tracking tools like TensorBoard, MLflow or Weights & Biases (W&B).

These different tools provide various features such as visualizing training metrics, logging hyperparameters, and storing model checkpoints. You can choose the tool that best fits your workflow and set it up to track your training experiments.

To set up experiment tracking:

Install the desired experiment tracking tool (e.g., TensorBoard, MLflow, or W&B).
Configure the tool in your training configuration file or through the command line interface.
Start the experiment tracking server and monitor your training runs in real-time.

Step 5: Execute Training

With everything set up, you can now execute the training process. Anemoi Training will use the graph definition and dataset to train your model according to the specified configuration.

To execute training:

Run the training command, ensuring that all paths to the graph definition and dataset are correctly specified.
Monitor the training process, adjusting parameters as needed to optimize model performance.
Upon completion, the trained model will be registered and stored for further use.

Then you make sure you have a GPU available and simply call:

anemoi-training train

Data Routing

Anemoi Training uses the Anemoi Datasets module to load the data. The dataset contains the entirety of variables we can use for training. Initial experiments in data-driven weather forecasting have used the same input variables as output variables.

Anemoi training implements data routing, in which you can specify which variables are used as forcings in the input only to inform the model, and which variables are used as diagnostics in the output only to be predicted by the model. All remaining variables will be treated as prognostic in the intial and forecast states.

Intuitively, forcings are the variables like solar insolation or land-sea-mask. These would make little sense to predict as they are external to the model. Diagnostics are the variables like precipitation that we want to predict, but which may not be available in forecast step zero due to technical limitations. Prognostic variables are the variables like temperature or humidity that we want to predict and are available after data assimilation operationally.

The user can specify the routing of the data by setting the config.data.forcings and config.data.diagnostics. These are named strings, as Anemoi datasets enables us to address variables by name.

This can look like the following:

data:
   forcings:
      - solar_insolation
      - land_sea_mask
   diagnostics:
      - total_precipitation

Normalisation

Machine learning models are sensitive to the scale of the input data. To ensure that the model can learn effectively, it is important to normalise the input data.

Anemoi training provides preprocessors for different aspects of the training, with the normaliser being one of them. The normaliser implements multiple strategies that can be applied to the data using the config.

Currently, the normaliser supports the following strategies:

none: No normalisation is applied.
mean-std: Standard normalisation is applied to the data.
min-max: Min-max normalisation is applied to the data.
max: Max normalisation is applied to the data.

Values like the land-sea-mask do not require additional normalisation. However, variables like temperature or humidity should be normalised to ensure the model can learn effectively. Additionally, variables like the geopotential height should be max normalised to ensure the model can learn the vertical structure of the atmosphere.

The user can specify the normalisation strategy, including the default by setting config.data.normaliser, such that:

normaliser:
   default: mean-std
   none:
      - land_sea_mask
   max:
      - geopotential_height

Loss function scaling

It is possible to change the weighting given to each of the variables in the loss function by changing config.training.variable_loss_scaling.pl.<pressure level variable> and config.training.variable_loss_scaling.sfc.<surface variable>.

It is also possible to change the scaling given to the pressure levels using config.training.pressure_level_scaler. For almost all applications, upper atmosphere pressure levels should be given lower weighting than the lower atmosphere pressure levels (i.e. pressure levels nearer to the surface). By default anemoi-training uses a ReLU Pressure Level scaler with a minimum weighting of 0.2 (i.e. no pressure level has a weighting less than 0.2).

The loss is also scaled by assigning a weight to each node on the output grid. These weights are calculated during graph-creation and stored as an attribute in the graph object. The configuration option config.training.node_loss_weights is used to specify the node attribute used as weights in the loss function. By default anemoi-training uses area weighting, where each node is weighted according to the size of the geographical area it represents.

It is also possible to rescale the weight of a subset of nodes after they are loaded from the graph. For instance, for a stretched grid setup we can rescale the weight of nodes in the limited area such that their sum equals 0.25 of the sum of all node weights with the following config setup

node_loss_weights:
   _target_: anemoi.training.losses.nodeweights.ReweightedGraphNodeAttribute
   target_nodes: data
   scaled_attribute: cutout
   weight_frac_of_total: 0.25

Learning rate

Anemoi training uses the CosineLRScheduler from PyTorch as it’s learning rate scheduler. Docs for this scheduler can be found here https://github.com/huggingface/pytorch-image-models/blob/main/timm/scheduler/cosine_lr.py The user can configure the maximum learning rate by setting config.training.lr.rate. Note that this learning rate is scaled by the number of GPUs where for the data parallelism.

global_learning_rate = config.training.lr.rate * num_gpus_per_node * num_nodes / gpus_per_model

The user can also control the rate at which the learning rate decreases by setting the total number of iterations through config.training.lr.iterations and the minimum learning rate reached through config.training.lr.min. Note that the minimum learning rate is not scaled by the number of GPUs. The user can also control the warmup period by setting config.training.lr.warmup_t. If the warmup period is set to 0, the learning rate will start at the maximum learning rate. If no warmup period is defined, a default warmup period of 1000 iterations is used.

Rollout

In the first stage of training, standard practice is to train the model on a 6 hour interval. Once this is completed, in the second stage of training, it is advisable to rollout and fine-tune the model error at longer leadtimes too. Generally for medium range forecasts, rollout is performed on 12 forecast steps (equivalent to 72 hours) incrementally. In other words, at each epoch another forecast step is added to the error term.

Rollout requires the model training to be restarted so the user should make sure to set config.training.run_id equal to the run-id of the first stage of training.

Note, in the standard set-up, rollout is performed at the minimum learning rate and the number of batches used is reduced (using config.dataloader.training.limit_batches) to prevent any overfit to specific timesteps.

To start rollout set config.training.rollout.epoch_increment equal to 1 (thus increasing the rollout step by 1 at every epoch) and set a maximum rollout by setting config.training.rollout.max (usually set to 12).

Restarting a training run

Whether it’s because the training has exceeded the time limit on an HPC system or because the user wants to fine-tune the model from a specific point in the training, it may be necessary at certain points to restart the model training.

This can be done by setting config.training.run_id in the config file to be the run_id of the run that is being restarted. In this case the new checkpoints will go in the same folder as the old checkpoints. If the user does not want this then they can instead set config.training.fork_run_id in the config file to the run_id of the run that is being restarted. In this case the old run will be unaffected and the new checkpoints will go in to a new folder with a new run_id. The user might want to do this if they want to start multiple new runs from 1 old run.

The above will restart the model training from where the old run finished training. However if the user wants to restart the model from a specific point they can do this by setting config.hardware.files.warm_start to be the checkpoint they want to restart from..