Parallel Inference
If the memory requirements of your model are too large to fit within a single GPU, you can run Anemoi-Inference in parallel across multiple GPUs.
- You have two options to launch parallel inference:
Launch without Slurm. This allows you to run inference across multiple GPUs on a single node.
Launch via Slurm. Slurm is needed to run inference across multiple nodes.
Prerequisites
Parallel inference requires a certain minimum version of Anemoi-models >= v0.4.2. If this breaks your checkpoints, you could cherry-pick the relevant PR into your old version of Anemoi-Models.
Configuration
To run in parallel, you must add ‘runner:parallel
’ to your inference
config file. If you are running in parallel without Slurm, you must also
add a ‘world_size: num_gpus
’ field. This informs Anemoi-Inference
how many GPUs you want to run across. It cannot be greater then the
number of GPUs on a single node.
Note
If you are launching parallel inference via Slurm, ‘world_size
’
will be ignored in favour of the ‘SLURM_NTASKS
’ environment
variable.
checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner: parallel
world_size: 4 #Only required if running parallel inference without Slurm
input:
grib: /path/to/input.grib
output:
grib: /path/to/output.grib
Running inference in parallel without Slurm
Once you have added ‘runner:parallel
’ and ‘world_size: num_gpus
’
to your config file, you can launch parallel inference by calling
‘anemoi-inferece run config.yaml
’ as normal.
Running inference in parallel with Slurm
Below is an example SLURM batch script to launch a parallel inference job across 4 GPUs with SLURM.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=0:05:00
#SBATCH --output=outputs/parallel_inf.%j.out
source /path/to/venv/bin/activate
srun anemoi-inference run parallel.yaml
Warning
If you specify ‘runner:parallel
’ but you don’t launch with
‘srun
’, your anemoi-inference job may hang as only 1 process will
be launched.
Note
By default, anemoi-inference will determine your systems master address and port itself. If this fails (i.e. when running Anemoi-Inference inside a container), you can instead set these values yourself via environment variables in your SLURM batch script:
MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_ADDR=$(nslookup $MASTER_ADDR | grep -oP '(?<=Address: ).*')
export MASTER_PORT=$((10000 + RANDOM % 10000))
srun anemoi-inference run parallel.yaml