Optimisation and Performance
This guide covers strategies for optimizing inference performance and
managing memory usage when running anemoi-inference.
Memory Optimisation
Large models can consume significant memory during inference. Several strategies can help manage memory usage effectively.
Chunking
The most important optimisation for memory management is controlling the chunk size for model inference. This splits the computation into smaller batches that fit in available memory.
Environment Variable
Set the ANEMOI_INFERENCE_NUM_CHUNKS environment variable to control
how many chunks to split each timestep into:
# Split each timestep into 4 chunks
export ANEMOI_INFERENCE_NUM_CHUNKS=4
anemoi-inference run config.yaml
# Or inline
ANEMOI_INFERENCE_NUM_CHUNKS=8 anemoi-inference run config.yaml
# In inference config
env:
ANEMOI_INFERENCE_NUM_CHUNKS: 8
Warning
Using too many chunks will slow down inference due to overhead. Start with fewer chunks and increase only if you encounter out-of-memory errors.
Monitoring Memory Usage
Monitor GPU memory during inference:
# In another terminal
watch -n 1 nvidia-smi
Look for:
Memory usage: Should stay below 90% to avoid OOM
GPU utilization: Should be high (> 80%) during computation
Fluctuations: Large spikes may indicate inefficient chunking
Precision Reduction
Using lower precision can significantly reduce memory usage with minimal impact on forecast quality.
Half Precision (FP16)
Most models work well with half precision:
checkpoint: /path/to/model.ckpt
precision: 16 # Use FP16 instead of FP32
lead_time: 240
input:
grib: /path/to/input.grib
output:
grib: /path/to/output.grib
This can reduce memory usage by approximately 50%.
BFloat16
For models trained with bfloat16:
precision: bf16
Note
BFloat16 is supported on newer GPUs (Ampere and later). Check your hardware compatibility before using this option.
Mixed Precision
The model automatically handles mixed precision computation when precision is set to 16 or bf16. Critical operations remain in higher precision while most computations use lower precision.
Device Selection
CPU Inference
For systems without GPU or when GPU memory is insufficient:
checkpoint: /path/to/model.ckpt
device: cpu
lead_time: 240
Warning
CPU inference is significantly slower than GPU inference (typically 10-100x). Use only when GPU is unavailable or for small models/short forecasts.
Specific GPU Selection
On multi-GPU systems, select a specific device:
# Use GPU 1
CUDA_VISIBLE_DEVICES=1 anemoi-inference run config.yaml
# Or in config
device: cuda:1
Profiling and Troubleshooting
Measuring Performance
Time Individual Components
Add verbosity to see timing information:
anemoi-inference run config.yaml --verbosity 2
Output will include timing for:
Checkpoint loading
Input data loading
Each forecast step
Output writing
Example output:
INFO Loading checkpoint (3.2s)
INFO Loading input data (1.8s)
INFO Step 1/10 (0.42s)
INFO Step 2/10 (0.41s)
...
INFO Writing output (2.1s)
Debugging Out-of-Memory Errors
If you encounter CUDA out-of-memory errors:
Increase chunking:
ANEMOI_INFERENCE_NUM_CHUNKS=8 anemoi-inference run config.yaml
Reduce precision:
precision: 16
Use parallel inference:
runner: parallel
Check for memory leaks:
# Monitor memory over time while true; do nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits sleep 1 done
Clear cache:
import torch torch.cuda.empty_cache()
Common Issues and Solutions
Issue |
Solution |
|---|---|
Slow first run |
Expected for model compilation. Subsequent runs are faster. |
High memory usage even with chunking |
Reduce precision to 16 or use parallel inference |
Low GPU utilization |
May indicate I/O bottleneck. Use local data sources. |
Inference slower than expected |
Too many chunks adds overhead. Reduce chunk count. |
Inconsistent timing |
Check for background processes or thermal throttling |
GRIB writing slow |
Use faster storage or write to local disk then copy |
Environment Variables
Complete list of environment variables affecting performance:
Variable |
Default |
Description |
|---|---|---|
|
1 |
Number of chunks per timestep for memory management |
|
Random |
Base seed for reproducibility (parallel inference) |
|
All GPUs |
Which GPUs are visible to the process |
|
Default |
PyTorch CUDA memory allocator configuration |
Example Usage
export ANEMOI_INFERENCE_NUM_CHUNKS=4
export CUDA_VISIBLE_DEVICES=0,1
anemoi-inference run config.yaml
See also
Parallel Inference - Distribute models across multiple GPUs
Environment Setup - Environment setup and dependencies
Run Command - CLI options for the run command
Retrieve Command - Pre-fetch data for faster inference