Introduction

The anemoi-datasets package allows you to create datasets for training data-driven weather models. The datasets are built using a recipe file, which is a YAML file that describes sources of meteorological fields as well as the operations to perform on them, before they are written to a zarr file. The input of the process is a range of dates and some options to control the layout of the output. Statistics will be computed as the dataset is built, and stored in the metadata, with other information such as the the locations of the grid points, the list of variables, etc.

Building datasets

Concepts

date

Throughout this document, the term date refers to a date and time, not just a date. A training dataset covers a continuous range of dates with a given frequency. Missing dates are still part of the dataset, but missing data are marked as such using NaNs. Dates are always in UTC, and refer to date at which the data is valid. For accumulations and fluxes, that would be the end of the accumulation period.

variable

A variable is a meteorological parameter, such as temperature, wind, etc. Multilevel parameters are treated as separate variables, one for each level. For example, temperature at 850 hPa and temperature at 500 hPa will be treated as two separate variables (t_850 and t_500).

field

A field is a variable at a given date. It is represented by an array of values at each grid point.

source

The source is a software component that, given a list of dates and variables will return the corresponding fields. An example of source is ECMWF’s MARS archive, a collection of GRIB or NetCDF files, a database, etc. See Sources for more information.

filter

A filter is a software component that takes as input the output of a source or another filter and can modify the fields and/or their metadata. For example, typical filters are interpolations, renaming of variables, etc. See Filters for more information.

Operations

In order to build a training dataset, sources and filters are combined using the following operations:

join

The join is the process of combining several sources of data. Each source is expected to provide different variables for the same of dates.

pipe

The pipe is the process of transforming fields using filters. The first step of a pipe is typically a source, a join, or another pipe. This can subsequently followed by more filters.

concat

The concatenation is the process of combining different sets of operations that handle different dates. This is typically used to build a dataset that spans several years, when several sources are involved, each providing data for different period.

Each operation is considered as a source, therefore operations can be combined to build complex datasets.

Getting started

First recipe

The simplest recipe file must contain a dates section and an input section. The latter must contain a source. In that case, the source is mars

dates:
  start: 2024-01-01T00:00:00Z
  end: 2024-01-01T18:00:00Z
  frequency: 6h

input:
  mars:
    param: [2t, msl, 10u, 10v, lsm]
    levtype: sfc
    grid: [1, 1]

To create the dataset, run the following command:

$ anemoi-datasets create recipe.yaml dataset.zarr

Once the build is complete, you can inspect the dataset using the following command:

$ anemoi-datasets inspect dataset.zarr
┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈
📦 Path          : dataset.zarr
🔢 Format version: 0.20.0

📅 Start      : 2024-01-01 00:00
📅 End        : 2024-01-01 18:00
⏰ Frequency  : 6h
🚫 Missing    : 0
🌎 Resolution : 1.0
🌎 Field shape: [181, 360]

📐 Shape      : 4 × 5 × 1 × 65,160 (5 MiB)
💽 Size       : 2.7 MiB (2,858,121)
📁 Files      : 34

   Index │ Variable │      Min │     Max │      Mean │    Stdev
   ──────┼──────────┼──────────┼─────────┼───────────┼─────────
       0 │ 10u      │ -24.3116 │   25.79 │ 0.0595319 │   5.5856
       1 │ 10v      │ -21.2397 │  21.851 │ -0.270924 │  4.23947
       2 │ 2t       │  214.979 │ 319.111 │   277.775 │  19.9318
       3 │ lsm      │        0 │       1 │  0.335152 │ 0.464236
       4 │ msl      │  95708.5 │  104284 │    100867 │  1452.67
   ──────┴──────────┴──────────┴─────────┴───────────┴─────────
🔋 Dataset ready, last update 2 hours ago.
📊 Statistics ready.

Adding a second source

To add a second source, you need to use the join operation. In that example, we add pressure level variables to the previous example:

dates:
  start: 2024-01-01T00:00:00Z
  end: 2024-01-01T18:00:00Z
  frequency: 6h

input:
  join:
  - mars:
      param: [2t, msl, 10u, 10v, lsm]
      levtype: sfc
      grid: [1, 1]
  - mars:
      param: [q, t, z]
      levtype: pl
      level: [50, 100]
      grid: [1, 1]

This will build the following dataset:

┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈
📦 Path          : dataset.zarr
🔢 Format version: 0.20.0

📅 Start      : 2024-01-01 00:00
📅 End        : 2024-01-01 18:00
⏰ Frequency  : 6h
🚫 Missing    : 0
🌎 Resolution : 1.0
🌎 Field shape: [181, 360]

📐 Shape      : 4 × 11 × 1 × 65,160 (10.9 MiB)
💽 Size       : 5.7 MiB (5,995,688)
📁 Files      : 34

   Index │ Variable │         Min │         Max │        Mean │       Stdev
   ──────┼──────────┼─────────────┼─────────────┼─────────────┼────────────
       0 │ 10u      │    -24.3116 │       25.79 │   0.0595319 │      5.5856
       1 │ 10v      │    -21.2397 │      21.851 │   -0.270924 │     4.23947
       2 │ 2t       │     214.979 │     319.111 │     277.775 │     19.9318
       3 │ lsm      │           0 │           1 │    0.335152 │    0.464236
       4 │ msl      │     95708.5 │      104284 │      100867 │     1452.67
       5 │ q_100    │ 8.95676e-07 │ 5.19827e-06 │ 2.78594e-06 │ 5.39734e-07
       6 │ q_50     │ 1.89449e-06 │ 3.41429e-06 │ 3.00331e-06 │ 1.11219e-07
       7 │ t_100    │      186.33 │      233.74 │     209.958 │     12.4899
       8 │ t_50     │     191.921 │     241.239 │     213.774 │     12.3492
       9 │ z_100    │      146865 │      163937 │      157791 │     4962.71
      10 │ z_50     │      186876 │      204383 │      199752 │     4158.18
   ──────┴──────────┴─────────────┴─────────────┴─────────────┴────────────
🔋 Dataset ready, last update 19 seconds ago.
📊 Statistics ready.

Note

Please note that the pressure levels parameters are named param_level. This is the default behaviour. See remapping option for more information.

Adding some forcing variables

When training a data-driven model, some forcing variables may be required such as the solar radiation, the time of day, the day in the year, etc.

These are provided by the forcings source. Let us add a few of them to the above example. The template option is used to point to another source, in that case the first instance of mars. This source is used to get information about the grid points, as some of the forcing variables are grid dependent.

dates:
  start: 2024-01-01T00:00:00Z
  end: 2024-01-01T18:00:00Z
  frequency: 6h
input:
  join:
  - mars:
      param: [2t, msl, 10u, 10v, lsm]
      levtype: sfc
      grid: [1, 1]
  - mars:
      param: [q, t, z]
      levtype: pl
      level: [50, 100]
      grid: [1, 1]
  - forcings:
      template: ${input.join.0.mars}
      param:
      - cos_latitude
      - sin_latitude
      - insolation

This will build the following dataset:

┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈
📦 Path          : dataset.zarr
🔢 Format version: 0.20.0

📅 Start      : 2024-01-01 00:00
📅 End        : 2024-01-01 18:00
⏰ Frequency  : 6h
🚫 Missing    : 0
🌎 Resolution : 1.0
🌎 Field shape: [181, 360]

📐 Shape      : 4 × 8 × 1 × 65,160 (8 MiB)
💽 Size       : 3.1 MiB (3,283,650)
📁 Files      : 34

   Index │ Variable     │         Min │      Max │      Mean │    Stdev
   ──────┼──────────────┼─────────────┼──────────┼───────────┼─────────
       0 │ 10u          │    -24.3116 │    25.79 │ 0.0595319 │   5.5856
       1 │ 10v          │    -21.2397 │   21.851 │ -0.270924 │  4.23947
       2 │ 2t           │     214.979 │  319.111 │   277.775 │  19.9318
       3 │ cos_latitude │ 6.12323e-17 │        1 │  0.633086 │ 0.310546
       4 │ insolation   │           0 │ 0.999995 │  0.231949 │ 0.299927
       5 │ lsm          │           0 │        1 │  0.335152 │ 0.464236
       6 │ msl          │     95708.5 │   104284 │    100867 │  1452.67
       7 │ sin_latitude │          -1 │        1 │         0 │ 0.709057
   ──────┴──────────────┴─────────────┴──────────┴───────────┴─────────
🔋 Dataset ready, last update 17 seconds ago.
📊 Statistics ready.

See forcings for more information about forcing variables.