Data

This module is used to initialise the dataset (constructed using anemoi-datasets) and load in the data in to the model. It also performs a series of checks, for example, that the training dataset end date is before the start date of the validation dataset.

dataset.py contains functions which define how the dataset gets split between the workers (worker_init_func) and how the dataset is iterated across to produce the data batches that get fed as input into the model (__iter__).

Users wishing to change the format of the batch input into the model should sub-class NativeGridDataset and change the __iter__ function.

class anemoi.training.data.dataset.NativeGridDataset(data_reader: Callable, grid_indices: type[BaseGridIndices], rollout: int = 1, multistep: int = 1, timeincrement: int = 1, shuffle: bool = True, label: str = 'generic', effective_bs: int = 1)

Bases: IterableDataset

Iterable dataset for AnemoI data on the arbitrary grids.

property statistics: dict: Return dataset statistics.

property metadata: dict: Return dataset metadata.

property supporting_arrays: dict: Return dataset supporting_arrays.

property name_to_index: dict: Return dataset statistics.

property resolution: dict: Return dataset resolution.

property valid_date_indices: ndarray

Return valid date indices.

A date t is valid if we can sample the sequence: (t - multistep + 1, …, t + rollout)

without missing data (if time_increment is 1).

If there are no missing dates, total number of valid ICs is dataset length minus rollout minus additional multistep inputs (if time_increment is 1).

set_comm_group_info(global_rank: int, model_comm_group_id: int, model_comm_group_rank: int, model_comm_num_groups: int, reader_group_rank: int, reader_group_size: int) → None

Set model and reader communication group information (called by DDPGroupStrategy).

Parameters:

global_rank (int) – Global rank
model_comm_group_id (int) – Model communication group ID
model_comm_group_rank (int) – Model communication group rank
model_comm_num_groups (int) – Number of model communication groups
reader_group_rank (int) – Reader group rank
reader_group_size (int) – Reader group size

per_worker_init(n_workers: int, worker_id: int) → None

Called by worker_init_func on each copy of dataset.

This initialises after the worker process has been spawned.

Parameters:

n_workers (int) – Number of workers
worker_id (int) – Worker ID

anemoi.training.data.dataset.worker_init_func(worker_id: int) → None

Configures each dataset worker process.

Calls WeatherBenchDataset.per_worker_init() on each dataset object.

Parameters:: worker_id (int) – Worker ID
Raises:: RuntimeError – If worker_info is None