Optimisation and Performance

This guide covers strategies for optimizing inference performance and managing memory usage when running anemoi-inference.

Memory Optimisation 

Large models can consume significant memory during inference. Several strategies can help manage memory usage effectively.

Chunking 

The most important optimisation for memory management is controlling the chunk size for model inference. This splits the computation into smaller batches that fit in available memory.

Environment Variable

Set the ANEMOI_INFERENCE_NUM_CHUNKS environment variable to control how many chunks to split each timestep into:

# Split each timestep into 4 chunks
export ANEMOI_INFERENCE_NUM_CHUNKS=4
anemoi-inference run config.yaml

# Or inline
ANEMOI_INFERENCE_NUM_CHUNKS=8 anemoi-inference run config.yaml

# In inference config
env:
   ANEMOI_INFERENCE_NUM_CHUNKS: 8

Warning

Using too many chunks will slow down inference due to overhead. Start with fewer chunks and increase only if you encounter out-of-memory errors.

Monitoring Memory Usage

Monitor GPU memory during inference:

# In another terminal
watch -n 1 nvidia-smi

Look for:

Memory usage: Should stay below 90% to avoid OOM
GPU utilization: Should be high (> 80%) during computation
Fluctuations: Large spikes may indicate inefficient chunking

Precision Reduction 

Using lower precision can significantly reduce memory usage with minimal impact on forecast quality.

Half Precision (FP16)

Most models work well with half precision:

checkpoint: /path/to/model.ckpt
precision: 16  # Use FP16 instead of FP32
lead_time: 240

input:
  grib: /path/to/input.grib
output:
  grib: /path/to/output.grib

This can reduce memory usage by approximately 50%.

BFloat16

For models trained with bfloat16:

precision: bf16

Note

BFloat16 is supported on newer GPUs (Ampere and later). Check your hardware compatibility before using this option.

Mixed Precision

The model automatically handles mixed precision computation when precision is set to 16 or bf16. Critical operations remain in higher precision while most computations use lower precision.

Device Selection 

CPU Inference

For systems without GPU or when GPU memory is insufficient:

checkpoint: /path/to/model.ckpt
device: cpu
lead_time: 240

Warning

CPU inference is significantly slower than GPU inference (typically 10-100x). Use only when GPU is unavailable or for small models/short forecasts.

Specific GPU Selection

On multi-GPU systems, select a specific device:

# Use GPU 1
CUDA_VISIBLE_DEVICES=1 anemoi-inference run config.yaml

# Or in config
device: cuda:1

Profiling and Troubleshooting 

Measuring Performance 

Time Individual Components

Add verbosity to see timing information:

anemoi-inference run config.yaml --verbosity 2

Output will include timing for:

Checkpoint loading
Input data loading
Each forecast step
Output writing

Example output:

INFO Loading checkpoint (3.2s)
INFO Loading input data (1.8s)
INFO Step 1/10 (0.42s)
INFO Step 2/10 (0.41s)
...
INFO Writing output (2.1s)

Debugging Out-of-Memory Errors 

If you encounter CUDA out-of-memory errors:

Increase chunking:

ANEMOI_INFERENCE_NUM_CHUNKS=8 anemoi-inference run config.yaml

Reduce precision:
```
precision: 16
```
Use parallel inference:
```
runner: parallel
```

Check for memory leaks:

# Monitor memory over time
while true; do
    nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits
    sleep 1
done

Clear cache:
```
import torch
torch.cuda.empty_cache()
```

Common Issues and Solutions 

Issue	Solution
Slow first run	Expected for model compilation. Subsequent runs are faster.
High memory usage even with chunking	Reduce precision to 16 or use parallel inference
Low GPU utilization	May indicate I/O bottleneck. Use local data sources.
Inference slower than expected	Too many chunks adds overhead. Reduce chunk count.
Inconsistent timing	Check for background processes or thermal throttling
GRIB writing slow	Use faster storage or write to local disk then copy

Environment Variables 

Complete list of environment variables affecting performance:

Variable	Default	Description
`ANEMOI_INFERENCE_NUM_CHUNKS`	1	Number of chunks per timestep for memory management
`ANEMOI_BASE_SEED`	Random	Base seed for reproducibility (parallel inference)
`CUDA_VISIBLE_DEVICES`	All GPUs	Which GPUs are visible to the process
`PYTORCH_CUDA_ALLOC_CONF`	Default	PyTorch CUDA memory allocator configuration

Example Usage 

export ANEMOI_INFERENCE_NUM_CHUNKS=4
export CUDA_VISIBLE_DEVICES=0,1
anemoi-inference run config.yaml