Preprocessing

The preprocessing module is used to pre- and post-process the data. Preprocessors are applied to the input data before it is passed to the model, and postprocessors are applied to the output data after it has been produced by the model and (in training) after the training loss has been calculated. The module contains the following classes:

class anemoi.models.preprocessing.BasePreprocessor(config=None, data_indices: IndexCollection | None = None, statistics: dict | None = None)

Bases: Module

Base class for data pre- and post-processors.

forward(x, in_place: bool = True, inverse: bool = False, **kwargs) Tensor

Process the input tensor.

Parameters:
  • x (torch.Tensor) – Input tensor

  • in_place (bool) – Whether to process the tensor in place

  • inverse (bool) – Whether to inverse transform the input

  • **kwargs – Additional keyword arguments to pass to transform/inverse_transform

Returns:

Processed tensor

Return type:

torch.Tensor

transform(x, in_place: bool = True, **kwargs) Tensor

Process the input tensor.

inverse_transform(x, in_place: bool = True, **kwargs) Tensor

Inverse process the input tensor.

class anemoi.models.preprocessing.Processors(processors: list, inverse: bool = False)

Bases: Module

A collection of processors.

forward(x, in_place: bool = True, **kwargs) Tensor

Process the input tensor.

Parameters:
  • x (torch.Tensor) – Input tensor

  • in_place (bool) – Whether to process the tensor in place

  • **kwargs – Additional keyword arguments to pass to processors

Returns:

Processed tensor

Return type:

torch.Tensor

class anemoi.models.preprocessing.StepwiseProcessors(lead_times: list[str])

Bases: Module

Ordered container for per-step processors that can include missing steps.

Normalizer

The normalizer module is used to normalize the data. The module contains the following classes:

class anemoi.models.preprocessing.normalizer.InputNormalizer(config=None, data_indices: IndexCollection | None = None, statistics: dict | None = None)

Bases: BasePreprocessor

Normalizes input data with a configurable method.

transform(x: Tensor, in_place: bool = True, data_index: Tensor | None = None) Tensor

Normalizes an input tensor x of shape […, nvars].

Normalization done in-place unless specified otherwise.

The default usecase either assume the full batch tensor or the full input tensor. A dataindex is based on the full data can be supplied to choose which variables to normalise.

Parameters:
  • x (torch.Tensor) – Data to normalize

  • in_place (bool, optional) – Normalize in-place, by default True

  • data_index (Optional[torch.Tensor], optional) – Normalize only the specified indices, by default None

Returns:

_description_

Return type:

torch.Tensor

inverse_transform(x: Tensor, in_place: bool = True, data_index: Tensor | None = None) Tensor

Denormalizes an input tensor x of shape […, nvars | nvars_pred].

Denormalization done in-place unless specified otherwise.

The default usecase either assume the full batch tensor or the full output tensor. A dataindex is based on the full data can be supplied to choose which variables to denormalise.

Parameters:
  • x (torch.Tensor) – Data to denormalize

  • in_place (bool, optional) – Denormalize in-place, by default True

  • data_index (Optional[torch.Tensor], optional) – Denormalize only the specified indices, by default None

Returns:

Denormalized data

Return type:

torch.Tensor

Remapper

The remapper module is used to do in-place transformations of the data using a set of predefined transforms and their inverses. This process is crucial for variables with pathological distributions, such as variables with sharp peaks, long tails or other non-Gaussian shapes. It is especially important for diffusion models where the data distribution interacts with the noise distribution.

Note

The remapper module enables only single-variable transformations. Multi-variable transformations (such as (ws wdir) -> (u v)) are not supported for memory reasons and must be performed at the level of the datasets.

The remapper module supports the following transformations:

  • none (no transformation)

  • affine (x -> scale * x + shift)

  • log1p (log(1+x))

  • sqrt

  • boxcox ((x^lambd - 1) / lambd) or (log(x) if lambd == 0) wiki

  • power (x^lambd)

  • atanh (atanh(rho * (2x - 1)) / rho)

  • asinh (asinh(x))

  • displace_boundary_atoms (shifts precise boundary peaks away from other values to give the model a non-zero width bucket to model them)

Several remappers can be applied one after the other in a chain. The order of the remappers is important, as the output of one remapper is the input to the next remapper. Remappers must be applied after the normalizer as normalizer relies on the computed statistics of the dataset.

Example tranform functions:

../_images/preprocessing_remapper_boxcox.png

Box-cox remapper transform function examples with λ = [-2, -1.8, … , 2]. Negative λ is blue, λ=0 red, 0<λ<1 purple, λ=1=linear dashed black, and λ>1 green. Input Values must be positive.

../_images/preprocessing_remapper_power.png

Power remapper transform function examples.

../_images/preprocessing_remapper_atanh.png

Atanh remapper transform function examples.

Example configuration:

data:
   processors:
      normalizer:
        _target_: anemoi.models.preprocessing.normalizer.InputNormalizer
        config:
          default: "mean-std"
          max: ["tp","tcc"]
      remapper1:
        _target_: anemoi.models.preprocessing.remapper.Remapper
        config:
          power: ["tp"]
          atanh: ["tcc"]
          method_kwargs:
            power:
              lambd: 0.1
              tangent_linear_above_one: true
            atanh:
              rho: 3.0
      remapper2:
        _target_: anemoi.models.preprocessing.remapper.Remapper
        config:
          affine: ["tp"]
          displace_boundary_atoms: ["tcc"]
          method_kwargs:
            affine:
              scale: 2.0
            displace_boundary_atoms:
              lower_atom: -1.0
              lower_target: -1.5
              upper_atom: 1.0
              upper_target: 1.5
              eps: 1e-4
      remapper3:
        _target_: anemoi.models.preprocessing.remapper.Remapper
        config:
          displace_boundary_atoms: ["tp"]
          method_kwargs:
            displace_boundary_atoms:
              lower_atom: 0
              lower_target: -1
              eps: 1e-7

The module contains the following classes and functions:

class anemoi.models.preprocessing.remapper.Remapper(config=None, data_indices: IndexCollection | None = None, statistics: dict | None = None)

Bases: BasePreprocessor

Remap and convert variables for single variables.

transform(x, in_place: bool = True) Tensor

Process the input tensor.

inverse_transform(x, in_place: bool = True) Tensor

Inverse process the input tensor.

anemoi.models.preprocessing.mappings.noop(x)

No operation.

anemoi.models.preprocessing.mappings.affine_transform(x, scale=1.0, shift=0.0)

Applies a scale and shift to the input tensor.

anemoi.models.preprocessing.mappings.displace_boundary_atoms(x, lower_atom=None, upper_atom=None, lower_target=None, upper_target=None, eps=0.0)

Displaces exact boundary values to target values (outside of the original range) to give model flexibility to model them as imprecise peaks, instead of delta functions. Reverse transform clamps the imprecise predicted values back to the original range to the original boundary values. Can be used on lower bound, upper bound, or both.

Parameters:
  • x (torch.Tensor) – Input tensor

  • lower_atom (float, optional) – Lower boundary atom

  • upper_atom (float, optional) – Upper boundary atom

  • lower_target (float, optional) – Target value for lower boundary atom

  • upper_target (float, optional) – Target value for upper boundary atom

  • eps (float, optional) – Epsilon value around the atoms for numerical stability. Default is 0.0.

anemoi.models.preprocessing.mappings.inverse_displace_boundary_atoms(x, lower_atom=None, upper_atom=None, lower_target=None, upper_target=None, eps=None)

Clamps the values back to the original range, to the original boundary values. Can be used on lower bound, upper bound, or both.

anemoi.models.preprocessing.mappings.boxcox_converter(x, lambd=0.5, clip_negative=False)

Convert positive var in to boxcox(var) = (x^lambd - 1) / lambd

Special cases: - lambd == 0 -> log(x) - lambd == 1 -> x-1

Notes

  • Choose lambd < 1 to create a real gap/endpoint basin.

  • If lambd == 1, this reduces to a bounded smooth transform with no gap.

Parameters:
  • x (torch.Tensor) – Input tensor

  • lambd (float, optional) – Lambda parameter for the boxcox transform. Default is 0.5.

  • clip_negative (bool, optional) – Whether to clip negative values to 0. Default is False.

anemoi.models.preprocessing.mappings.inverse_boxcox_converter(x, lambd=0.5, clip_negative=None)

Convert back boxcox(var) to var.

anemoi.models.preprocessing.mappings.power_transform(x, lambd=0.33, clip_negative=False, tangent_linear_above_one=False)

Apply a power transform :param x: Input tensor :type x: torch.Tensor :param lambd: Exponent for the power transform. Default is 0.33. :type lambd: float :param clip_negative: Whether to clip negative values to 0. Default is False. :type clip_negative: bool, optional :param tangent_linear_above_one: Whether to use a tangent-linear extension above 1 instead of the power transform. Useful for max-scaled variables where we still might want to predict values above max without clamping them to max and without blowing them up with the power-transform. Default is False. :type tangent_linear_above_one: bool, optional

anemoi.models.preprocessing.mappings.inverse_power_transform(x, lambd=0.33, clip_negative=False, tangent_linear_above_one=False)

Inverse power transform with optional inverse tangent-linear branch above 1.

Parameters:
  • x (torch.Tensor) – Input tensor

  • lambd (float) – Exponent for the power transform. Default is 0.33.

  • clip_negative (bool, optional) – Accepted for symmetry with power_transform but not used in the inverse since the output is already clamped to non-negative values. Default is False.

  • tangent_linear_above_one (bool, optional) – Whether to use the inverse tangent-linear extension above 1. Default is False.

anemoi.models.preprocessing.mappings.atanh_converter(x, rho=2.0)

Encode x in [0, 1] to a single scalar value in [-1, 1]

Mapping:

x == 0 -> -1 0 < x < 1 -> atanh(tanh(rho) * (2x - 1)) / rho x == 1 -> +1

(x == 0.5 -> 0)

Parameters:
  • x (torch.Tensor) – Input tensor

  • rho (float, optional) – Rho parameter for the atanh transform. Default is 0.9. Controls the steepness of the transform at the boundaries.

anemoi.models.preprocessing.mappings.asinh_converter(x, c=1.0)

Apply an asinh transform

anemoi.models.preprocessing.mappings.inverse_asinh_converter(x, c=1.0)

Inverse asinh transform

anemoi.models.preprocessing.mappings.log1p_converter(x)

Convert positive var in to log(1+var).

anemoi.models.preprocessing.mappings.expm1_converter(x)

Convert back log(1+var) to var.

anemoi.models.preprocessing.mappings.sqrt_converter(x)

Apply a sqrt transform

anemoi.models.preprocessing.mappings.inverse_sqrt_converter(x)

Inverse sqrt transform

Imputer

Machine learning models cannot process missing values (NaNs) directly, so missing values in input data and the target must be handled before being handled by the model. The Imputer module in anemoi-models handles missing values (NaNs) before the data is input to the model and after the model’s output is handled by the training loss.

For each input batch, the module identifies NaN locations and replaces the NaNs with a configured imputation value, as specified in the configuration file. If a variable is present in the output data, the imputed values are restored to NaN at the original NaN locations from the first timestep of the input.

The imputer provides the nan mask as a loss scaler anemoi.training.losses.scalers.loss_weights_mask.NaNMaskScaler to the loss function, if the scaler is included in config.training.training_loss. Then the training loss function uses the nan mask to ignore the imputed values in the loss calculation. This mask is updated for every batch during training.

During training, diagnostic variables are included in each batch, and therefore at the input timesteps. Any NaNs in the target data are weighted by zero to enable proper loss computation. During inference, however, NaN locations for diagnostic variables are not available (those fields aren not part of the model input) so the imputer cannot reintroduces NaNs into the diagnostic output. To insert NaNs into diagnostic variables, the postprocessor anemoi.models.preprocessing.postprocessor.ConditionalNaNPostprocessor has to be used. This masks diagnostic variable entries by setting them to NaN wherever the chosen (prognostic) masking variable is NaN.

The dynamic imputers are used to impute NaNs in the input data and do not replace the imputed values with NaNs in the output data. Therefore, the nan mask is not provided as a scaler to the loss function either.

The module contains the following classes:

class anemoi.models.preprocessing.imputer.BaseImputer(config=None, data_indices: IndexCollection | None = None, statistics: dict | None = None)

Bases: BasePreprocessor, ABC

Base class for Imputers.

get_nans(x: Tensor) Tensor

Get NaN mask from data

The mask is only saved for the first two dimensions (batch, timestep) and the last two dimensions (grid, variable) For the rest of the dimensions we select the first element since we assume the nan locations do not change along these dimensions. This means for the ensemble dimension: we assume that the NaN locations are the same for all ensemble members.

Parameters:

x (torch.Tensor) – Input tensor with shape (batch, time, …, grid, variable)

Returns:

Tensor with NaN locations of shape (batch, time, …, grid)

Return type:

torch.Tensor

fill_with_value(x: Tensor, index_x: list[int], nan_locations: Tensor, index_nl: list[int]) Tensor

Fill NaN locations in the input tensor with the specified values.

Parameters:
  • x (torch.Tensor) – Input tensor

  • index (list) – List of indices for the variables to be imputed

  • nan_locations (torch.Tensor) – Tensor with NaN locations

Returns:

Tensor where NaN locations are filled with the specified values

Return type:

torch.Tensor

transform(x: Tensor, in_place: bool = True, skip_imputation: bool = False, **_kwargs) Tensor

Impute missing values in the input tensor.

inverse_transform(x: Tensor, in_place: bool = True, skip_imputation: bool = False, **_kwargs) Tensor

Impute missing values in the input tensor.

class anemoi.models.preprocessing.imputer.InputImputer(config=None, data_indices: IndexCollection | None = None, statistics: dict | None = None)

Bases: BaseImputer

Imputes missing values using the statistics supplied.

Expects the config to have keys corresponding to available statistics and values as lists of variables to impute.: ``` default: “none” mean:

  • y

maximum:
  • x

minimum:
  • q

```

class anemoi.models.preprocessing.imputer.ConstantImputer(config=None, data_indices: IndexCollection | None = None, statistics: dict | None = None)

Bases: BaseImputer

Imputes missing values using the constant value.

Expects the config to have keys corresponding to available statistics and values as lists of variables to impute.: ``` default: “none” 1:

  • y

5.0:
  • x

3.14:
  • q

```

class anemoi.models.preprocessing.imputer.CopyImputer(config=None, data_indices: IndexCollection | None = None, statistics: dict | None = None)

Bases: BaseImputer

Imputes missing values copying them from another variable. ``` default: “none” variable_to_copy:

  • variable_missing_1

  • variable_missing_2

```

fill_with_value(x: Tensor, index_x: list[int], nan_locations: Tensor, index_nl: list[int]) Tensor

Fill NaN locations in the input tensor with the specified values.

Parameters:
  • x (torch.Tensor) – Input tensor

  • index (list) – List of indices for the variables to be imputed

  • nan_locations (torch.Tensor) – Tensor with NaN locations

Returns:

Tensor where NaN locations are filled with the specified values

Return type:

torch.Tensor