Introduction
The anemoi-datasets package allows you to create datasets for training data-driven weather models. The datasets are built using a recipe file, which is a YAML file that describes sources of meteorological fields as well as the operations to perform on them, before they are written to a zarr file. The input of the process is a range of dates and some options to control the layout of the output. Statistics will be computed as the dataset is built, and stored in the metadata, with other information such as the the locations of the grid points, the list of variables, etc.

Concepts
- date
Throughout this document, the term date refers to a date and time, not just a date. A training dataset covers a continuous range of dates with a given frequency. Missing dates are still part of the dataset, but missing data are marked as such using NaNs. Dates are always in UTC, and refer to date at which the data is valid. For accumulations and fluxes, that would be the end of the accumulation period.
- variable
A variable is a meteorological parameter, such as temperature, wind, etc. Multilevel parameters are treated as separate variables, one for each level. For example, temperature at 850 hPa and temperature at 500 hPa will be treated as two separate variables (t_850 and t_500).
- field
A field is a variable at a given date. It is represented by an array of values at each grid point.
- source
The source is a software component that, given a list of dates and variables will return the corresponding fields. An example of source is ECMWF’s MARS archive, a collection of GRIB or NetCDF files, a database, etc. See Sources for more information.
- filter
A filter is a software component that takes as input the output of a source or another filter and can modify the fields and/or their metadata. For example, typical filters are interpolations, renaming of variables, etc. See Filters for more information.
Operations
In order to build a training dataset, sources and filters are combined using the following operations:
- join
The join is the process of combining several sources of data. Each source is expected to provide different variables for the same of dates.
- pipe
The pipe is the process of transforming fields using filters. The first step of a pipe is typically a source, a join, or another pipe. This can subsequently followed by more filters.
- concat
The concatenation is the process of combining different sets of operations that handle different dates. This is typically used to build a dataset that spans several years, when several sources are involved, each providing data for different period.
Each operation is considered as a source, therefore operations can be combined to build complex datasets.
Getting started
First recipe
The simplest recipe file must contain a dates
section and an
input
section. The latter must contain a source. In that case, the
source is mars
dates:
start: 2024-01-01T00:00:00Z
end: 2024-01-01T18:00:00Z
frequency: 6h
input:
mars:
param: [2t, msl, 10u, 10v, lsm]
levtype: sfc
grid: [1, 1]
To create the dataset, run the following command:
$ anemoi-datasets create recipe.yaml dataset.zarr
Once the build is complete, you can inspect the dataset using the following command:
$ anemoi-datasets inspect dataset.zarr
┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈
📦 Path : dataset.zarr
🔢 Format version: 0.20.0
📅 Start : 2024-01-01 00:00
📅 End : 2024-01-01 18:00
⏰ Frequency : 6h
🚫 Missing : 0
🌎 Resolution : 1.0
🌎 Field shape: [181, 360]
📐 Shape : 4 × 5 × 1 × 65,160 (5 MiB)
💽 Size : 2.7 MiB (2,858,121)
📁 Files : 34
Index │ Variable │ Min │ Max │ Mean │ Stdev
──────┼──────────┼──────────┼─────────┼───────────┼─────────
0 │ 10u │ -24.3116 │ 25.79 │ 0.0595319 │ 5.5856
1 │ 10v │ -21.2397 │ 21.851 │ -0.270924 │ 4.23947
2 │ 2t │ 214.979 │ 319.111 │ 277.775 │ 19.9318
3 │ lsm │ 0 │ 1 │ 0.335152 │ 0.464236
4 │ msl │ 95708.5 │ 104284 │ 100867 │ 1452.67
──────┴──────────┴──────────┴─────────┴───────────┴─────────
🔋 Dataset ready, last update 2 hours ago.
📊 Statistics ready.
Adding a second source
To add a second source, you need to use the join
operation. In that
example, we add pressure level variables to the previous example:
dates:
start: 2024-01-01T00:00:00Z
end: 2024-01-01T18:00:00Z
frequency: 6h
input:
join:
- mars:
param: [2t, msl, 10u, 10v, lsm]
levtype: sfc
grid: [1, 1]
- mars:
param: [q, t, z]
levtype: pl
level: [50, 100]
grid: [1, 1]
This will build the following dataset:
┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈
📦 Path : dataset.zarr
🔢 Format version: 0.20.0
📅 Start : 2024-01-01 00:00
📅 End : 2024-01-01 18:00
⏰ Frequency : 6h
🚫 Missing : 0
🌎 Resolution : 1.0
🌎 Field shape: [181, 360]
📐 Shape : 4 × 11 × 1 × 65,160 (10.9 MiB)
💽 Size : 5.7 MiB (5,995,688)
📁 Files : 34
Index │ Variable │ Min │ Max │ Mean │ Stdev
──────┼──────────┼─────────────┼─────────────┼─────────────┼────────────
0 │ 10u │ -24.3116 │ 25.79 │ 0.0595319 │ 5.5856
1 │ 10v │ -21.2397 │ 21.851 │ -0.270924 │ 4.23947
2 │ 2t │ 214.979 │ 319.111 │ 277.775 │ 19.9318
3 │ lsm │ 0 │ 1 │ 0.335152 │ 0.464236
4 │ msl │ 95708.5 │ 104284 │ 100867 │ 1452.67
5 │ q_100 │ 8.95676e-07 │ 5.19827e-06 │ 2.78594e-06 │ 5.39734e-07
6 │ q_50 │ 1.89449e-06 │ 3.41429e-06 │ 3.00331e-06 │ 1.11219e-07
7 │ t_100 │ 186.33 │ 233.74 │ 209.958 │ 12.4899
8 │ t_50 │ 191.921 │ 241.239 │ 213.774 │ 12.3492
9 │ z_100 │ 146865 │ 163937 │ 157791 │ 4962.71
10 │ z_50 │ 186876 │ 204383 │ 199752 │ 4158.18
──────┴──────────┴─────────────┴─────────────┴─────────────┴────────────
🔋 Dataset ready, last update 19 seconds ago.
📊 Statistics ready.
Note
Please note that the pressure levels parameters are named param_level. This is the default behaviour. See remapping option for more information.
Adding some forcing variables
When training a data-driven model, some forcing variables may be required such as the solar radiation, the time of day, the day in the year, etc.
These are provided by the forcings
source. Let us add a few of them
to the above example. The template option is used to point to another
source, in that case the first instance of mars
. This source is used
to get information about the grid points, as some of the forcing
variables are grid dependent.
dates:
start: 2024-01-01T00:00:00Z
end: 2024-01-01T18:00:00Z
frequency: 6h
input:
join:
- mars:
param: [2t, msl, 10u, 10v, lsm]
levtype: sfc
grid: [1, 1]
- mars:
param: [q, t, z]
levtype: pl
level: [50, 100]
grid: [1, 1]
- forcings:
template: ${input.join.0.mars}
param:
- cos_latitude
- sin_latitude
- insolation
This will build the following dataset:
┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈
📦 Path : dataset.zarr
🔢 Format version: 0.20.0
📅 Start : 2024-01-01 00:00
📅 End : 2024-01-01 18:00
⏰ Frequency : 6h
🚫 Missing : 0
🌎 Resolution : 1.0
🌎 Field shape: [181, 360]
📐 Shape : 4 × 8 × 1 × 65,160 (8 MiB)
💽 Size : 3.1 MiB (3,283,650)
📁 Files : 34
Index │ Variable │ Min │ Max │ Mean │ Stdev
──────┼──────────────┼─────────────┼──────────┼───────────┼─────────
0 │ 10u │ -24.3116 │ 25.79 │ 0.0595319 │ 5.5856
1 │ 10v │ -21.2397 │ 21.851 │ -0.270924 │ 4.23947
2 │ 2t │ 214.979 │ 319.111 │ 277.775 │ 19.9318
3 │ cos_latitude │ 6.12323e-17 │ 1 │ 0.633086 │ 0.310546
4 │ insolation │ 0 │ 0.999995 │ 0.231949 │ 0.299927
5 │ lsm │ 0 │ 1 │ 0.335152 │ 0.464236
6 │ msl │ 95708.5 │ 104284 │ 100867 │ 1452.67
7 │ sin_latitude │ -1 │ 1 │ 0 │ 0.709057
──────┴──────────────┴─────────────┴──────────┴───────────┴─────────
🔋 Dataset ready, last update 17 seconds ago.
📊 Statistics ready.
See forcings for more information about forcing variables.