Data

This module is used to initialise datasets (constructed using anemoi-datasets) and load data into the model. It performs validation checks, such as ensuring that the training dataset end date is before the start date of the validation dataset.

The dataset files contain functions which define how datasets get split between workers (worker_init_func) and how datasets are iterated across to produce data batches that get fed as input into the model (__iter__).

Dataset Architecture

The data module provides two types of dataset readers that wrap anemoi-datasets data:

Native Grid Dataset

The NativeGridDataset class is used for standard atmospheric data on a native grid. It provides a simple interface for reading data samples at specified time indices.

Trajectory Dataset

The TrajectoryDataset class extends NativeGridDataset to support trajectory-based sampling, where data is organized into temporal trajectories. This is useful for tracking atmospheric features over time or for specialized training strategies that require trajectory awareness.

Trajectories are defined by:

Trajectory start: The reference datetime from which trajectories begin
Trajectory length: The number of time steps in each trajectory

Each sample in the dataset is associated with a trajectory ID, ensuring that samples are correctly grouped and that trajectory boundaries are respected during training.

Multi-Dataset

The MultiDataset class provides a higher-level wrapper that can synchronize and combine multiple datasets (either NativeGridDataset or TrajectoryDataset instances). This is the primary interface used for training and supports:

Synchronizing samples across multiple datasets with different grids
Managing distributed data loading across workers and communication groups
Shuffling and batching data for training
Handling grid sharding for distributed training

Note

Users wishing to change the format of the batch input into the model should sub-class MultiDataset and override the __iter__ method or the get_sample method.

API Reference

Dataset Readers

Multi-Dataset

class anemoi.training.data.multidataset.MultiDataset(data_readers: dict[str, BaseAnemoiReader], relative_date_indices: dict[str, slice | int | list[int] | ndarray], shuffle: bool = True, label: str = 'multi', epoch: int = 0, rollout: int = 1)

Bases: IterableDataset

Multi-dataset wrapper that returns synchronized samples from multiple data readers.

set_epoch(epoch: int, *, rollout: int | None = None, relative_date_indices: dict[str, slice | int | list[int] | ndarray] | None = None) → None: Set epoch-dependent sampling state before DataLoader workers are launched.

property statistics: dict[str, dict]: Return combined statistics from all data readers.

property metadata: dict[str, dict]: Return combined metadata from all data readers.

property supporting_arrays: dict[str, dict]: Return combined supporting arrays from all data readers.

property variables: dict[str, list[str]]: Return combined variables from all data readers.

property data: dict: Return data from all data readers as dictionary.

property name_to_index: dict[str, dict]: Return combined name_to_index mapping from all data readers.

property resolution: dict[str, str]: Return combined resolution from all data readers.

property frequency: timedelta: Return combined frequency from all data readers.

set_comm_group_info(global_rank: int, model_comm_group_id: int, model_comm_group_rank: int, model_comm_num_groups: int, reader_group_rank: int, reader_group_size: int, shard_sizes: dict[str, list[int] | None]) → None

Set model and reader communication group information (called by DDPGroupStrategy).

Parameters:

global_rank (int) – Global rank
model_comm_group_id (int) – Model communication group ID
model_comm_group_rank (int) – Model communication group rank
model_comm_num_groups (int) – Number of model communication groups
reader_group_rank (int) – Reader group rank
reader_group_size (int) – Reader group size
shard_sizes (dict[str, ShardSizes]) – Shard sizes for all datasets

set_ens_comm_group_info(ens_comm_group_id: int, ens_comm_group_rank: int, ens_comm_num_groups: int) → None

Set ensemble communication group information (called by DDPGroupStrategy).

Parameters:

ens_comm_group_id (int) – Ensemble communication group ID
ens_comm_group_rank (int) – Ensemble communication group rank
ens_comm_num_groups (int) – Number of ensemble communication groups

per_worker_init(n_workers: int, worker_id: int) → None: Initialize all data readers for this worker.

property shard_shapes: dict[str, list]: Return shard shapes for all data readers.

get_shard_slice(dataset_name: str, reader_group_rank: int) → slice: Get the grid shard slice according to the reader rank.