Strategy

This module defines the strategy for parallelising the model training across GPUs. It also seeds the random number generators for the rank. The strategy used is a Distributed Data Parallel strategy with group communication. This strategy implements data parallelism at the module level which can also run on multiple GPUs, and is a standard strategy within PyTorch DDP Strategy.

Note

Generally you should not need to change this module, as it is independent of the system being used for training.

Anemoi Training provides different sharding strategies for the deterministic or ensemble based model tasks.

For deterministic models, the DDPGroupStrategy is used while for ensemble models, the DDPEnsGroupStrategy is used which in addition to sharding the model also distributes the ensemble members across GPUs.

DDPGroupStrategy

class anemoi.training.distributed.strategy.DDPGroupStrategy(num_gpus_per_model: int, read_group_size: int, **kwargs: dict)

Bases: DDPStrategy

Distributed Data Parallel strategy with group communication.

setup(trainer: Trainer) None

Sets up the accelerator, plugins and initializes the optimizers (if needed).

Parameters:

trainer – the trainer instance

process_dataloader(dataloader: DataLoader) DataLoader

Pass communication group information to the dataloader for distributed training.

Parameters:

dataloader (torch.utils.data.DataLoader) – Dataloader to process.

Returns:

Processed dataloader.

Return type:

torch.utils.data.DataLoader

register_parameter_hooks() None

Register parameter hooks for gradient reduction.

Here, we rescale parameters that only see a subset of the input on each rank -> these are still divided by the total number of GPUs in DDP as if each rank would see a full set of inputs note: the trainable parameters are added before the split across GPUs and are therefore not rescaled.

DDPEnsGroupStrategy

class anemoi.training.distributed.strategy.DDPEnsGroupStrategy(num_gpus_per_model: int, num_gpus_per_ensemble: int, read_group_size: int, **kwargs)

Bases: DDPStrategy

Distributed Data Parallel strategy with group communication for ensembles.

setup(trainer: Trainer) None

Sets up the accelerator, plugins and initializes the optimizers (if needed).

Parameters:

trainer – the trainer instance

process_dataloader(dataloader: DataLoader) DataLoader

Pass communication group information to the dataloader for distributed training.

Parameters:

dataloader (torch.utils.data.DataLoader) – Dataloader to process.

Returns:

Processed dataloader.

Return type:

torch.utils.data.DataLoader

register_parameter_hooks() None

Register parameter hooks for gradient reduction.

Here, we rescale parameters that only see a subset of the input on each rank -> these are still divided by the total number of GPUs in DDP as if each rank would see a full set of inputs note: the trainable parameters are added before the split across GPUs and are therefore not rescaled.