Strategy
This module defines the strategy for parallelising the model training across GPUs. It also seeds the random number generators for the rank. The strategy used is a Distributed Data Parallel strategy with group communication. This strategy implements data parallelism at the module level which can also run on multiple GPUs, and is a standard strategy within PyTorch DDP Strategy.
Note
Generally you should not need to change this module, as it is independent of the system being used for training.
Anemoi Training provides different sharding strategies for the deterministic or ensemble based model tasks.
For deterministic models, the DDPGroupStrategy
is used while for
ensemble models, the DDPEnsGroupStrategy
is used which in addition
to sharding the model also distributes the ensemble members across GPUs.
DDPGroupStrategy
- class anemoi.training.distributed.strategy.DDPGroupStrategy(num_gpus_per_model: int, read_group_size: int, **kwargs: dict)
Bases:
DDPStrategy
Distributed Data Parallel strategy with group communication.
- setup(trainer: Trainer) None
Sets up the accelerator, plugins and initializes the optimizers (if needed).
- Parameters:
trainer – the trainer instance
- process_dataloader(dataloader: DataLoader) DataLoader
Pass communication group information to the dataloader for distributed training.
- Parameters:
dataloader (torch.utils.data.DataLoader) – Dataloader to process.
- Returns:
Processed dataloader.
- Return type:
torch.utils.data.DataLoader
- register_parameter_hooks() None
Register parameter hooks for gradient reduction.
Here, we rescale parameters that only see a subset of the input on each rank -> these are still divided by the total number of GPUs in DDP as if each rank would see a full set of inputs note: the trainable parameters are added before the split across GPUs and are therefore not rescaled.
DDPEnsGroupStrategy
- class anemoi.training.distributed.strategy.DDPEnsGroupStrategy(num_gpus_per_model: int, num_gpus_per_ensemble: int, read_group_size: int, **kwargs)
Bases:
DDPStrategy
Distributed Data Parallel strategy with group communication for ensembles.
- setup(trainer: Trainer) None
Sets up the accelerator, plugins and initializes the optimizers (if needed).
- Parameters:
trainer – the trainer instance
- process_dataloader(dataloader: DataLoader) DataLoader
Pass communication group information to the dataloader for distributed training.
- Parameters:
dataloader (torch.utils.data.DataLoader) – Dataloader to process.
- Returns:
Processed dataloader.
- Return type:
torch.utils.data.DataLoader
- register_parameter_hooks() None
Register parameter hooks for gradient reduction.
Here, we rescale parameters that only see a subset of the input on each rank -> these are still divided by the total number of GPUs in DDP as if each rank would see a full set of inputs note: the trainable parameters are added before the split across GPUs and are therefore not rescaled.