Managing missing dates

Managing missing dates with anemoi-training

Anemoi-training has internal handling of missing dates, and will calculate the valid date indices used during training using the missing property. Consequenctly, when training a model with anemoi-training, you should not specify a method to deal with missing dates in the dataloader configuration file.

Filling the missing dates with artificial values

When you have missing dates in a dataset, you can fill them with artificial values. You can either fill them with values that are the result of a linear interpolation between the two closest dates:

ds = open_dataset(dataset, fill_missing_dates="interpolate")

Or you can select the copy the value of the closest date:

ds = open_dataset(dataset, fill_missing_dates="closest")

if the missing date is exactly in the middle of two dates, the library will choose that value of the largest date. You can change this behavior by setting the closest parameter to 'down' or 'up' explicitly.

Skipping missing when iterating over a dataset

If you iterate over a dataset that has missing dates, the library will raise a MissingDatesError exception if you attempt to access a missing date.

This code below will throw an exception if ds[i] or ds[i+1] are missing dates. Because we iterate over the whole dataset, we are guaranteed to fail if there are any missing dates.

ds = open_dataset(dataset)

for i in range(len(ds) - 1):
    ds = ds[i + 1] - ds[i]

You can skip missing dates by setting the skip_missing_dates option to True. You will have to also provide a hint about how you intent to iterate over the dataset. The hint is given using the parameter expected_access which takes a slice as argument.

The library will check the slice against the missing dates and insure that, when iterating over the dataset with that slice, no missing dates are accessed.

The algorithm is illustrated in the picture below. The cells represents the dates in the dataset, and the red cells are the missing dates. Given expected_access=slice(0, 2), the library will consider each group of matching dates that are not missing (in blue). The interval between each dates of a group is guaranteed to be constant across all groups.

../_images/skip-missing.png
ds = open_dataset(
    dataset,
    skip_missing_dates=True,
    expected_access=slice(0, 2),
)

for i in range(len(ds)):
    xi, xi_1 = ds[i]
    dx = xi_1 - xi

The code above will not raise an exception, even if there are missing dates. The slice(0, 2) represents the i and i+1 indices in the loop (the Python slice is similar to Python’s range, as the first bound in included while the last bound is excluded).

You can also provide a single integer to the `expected_access parameter. The two forms below are identical:

expected_access = slice(0, 2)
expected_access = 2

Concatenating datasets with gaps between them

When you concatenate two or more datasets, the library will check that the dates are contiguous, i.e. that the last date of a dataset is one frequency before the first date of the next dataset.

If the dates are not contiguous, the library will raise an error. You can force the concatenation by setting the fill_missing_gaps option:

ds = open_dataset(concat=[dataset1, dataset2, ...], fill_missing_gaps=True)

If there is a gap between the datasets, the library will fill the gap by creating a virtual dataset with only missing dates, and add it between the datasets to make the dates contiguous.

Debugging

You can set missing dates using the set_missing_dates option. This option is for debugging purposes only.

ds = open_dataset(dataset, set_missing_dates=["2010-01-01T12:00:00", "2010-02-01T12:00:00"])