Troubleshooting
When encountering issues while training models with Anemoi Training, this guide will help you identify and resolve common problems. We’ll cover various debugging techniques, including those specific to PyTorch Lightning, which Anemoi Training uses under the hood.
Using Debug Configurations
To troubleshoot errors when trying to train a model for the first time,
it is advisable to use the debug configuration
anemoi/training/config/debug.yaml. This configuration:
Runs a small model
Trains on a limited number of batches per epoch
Helps identify errors more quickly
If you’re using a custom configuration, consider making these temporary adjustments:
dataloader:
limit_batches:
training: 100
validation: 100
hardware:
num_gpus_per_node: 1
These settings limit the data processed and use a single GPU, helping isolate issues related to data or parallelization.
PyTorch Lightning Debugging Tools
Anemoi Training leverages PyTorch Lightning, which provides several useful debugging tools.
Currently these aren’t implemented as config settings yet, but could easily be added, if needed.
1. Overfit on a Single Batch
To identify issues in your model’s ability to learn, try overfitting on a single batch:
# use only 1% of the train & val set
trainer = Trainer(overfit_batches=0.01)
# overfit on 10 of the same batches
trainer = Trainer(overfit_batches=10)
This setting will repeatedly train on the same batch, helping you verify if the model can learn at all.
2. Fast Dev Run
For a quick test of your entire training pipeline:
trainer = Trainer(fast_dev_run=True)
This runs a single batch for training, validation, and testing, checking if all code paths work without errors.
3. Detect Anomalies
Enable PyTorch’s anomaly detection in the diagnostics configuration:
debug:
anomaly_detection: true
This helps identify issues like NaN or infinity values in your model’s computations.
Debug Flags for Better Error Handling
Anemoi Training can make use of several debug flags to provide more detailed error information:
1. Verbose Mode
Enable verbose logging:
hydra.verbose=true
You can set the log level of the logger NAME to DEBUG. Equivalent to
import logging; logging.getLogger(NAME).setLevel(logging.DEBUG).
hydra.verbose=NAME
And even provide multiple targets.
hydra.verbose=[NAME1,NAME2]
This increases the verbosity of log outputs, providing more detailed information about the training process.
2. Asynchronous Callbacks
Disable asynchronous callbacks for clearer error messages:
diagnostics:
plot:
asynchronous: false
This makes error messages generally easier to understand by ensuring callbacks are executed synchronously.
3. Disable Plotting
Turn off plotting callbacks to isolate non-visualization related issues:
diagnostics:
plot:
callbacks: []
Or set the plot config to none, (in diagnostics.evaluation)
defaults:
plot: none
Debugging C10 Distributed Errors
The C10 distributed error can often mask underlying issues. To debug the true model error:
1. Set CUDA to Blocking Mode
Before running your training script, set the following environment variable:
export CUDA_LAUNCH_BLOCKING=1
This forces CUDA operations to run synchronously, which can reveal the true source of errors that might be hidden by asynchronous execution.
2. Run on a Single GPU
Temporarily run your model on a single GPU to eliminate some distributed training complexities:
hardware:
num_gpus_per_node: 1
The code is still distributed, but at least it removes the multi-GPU aspect and you can use debug statements.
3. Gradually Increase Complexity
Once you’ve identified and fixed the underlying issue, gradually reintroduce distributed training and multiple GPUs to ensure the problem doesn’t reoccur in a multi-GPU setting.
Additional Troubleshooting Tips
1. Check Input Data
Verify that your input data is correctly formatted and addressed in the normalizer. Use small subsets of your data to test the pipeline.
2. Inspect Model Outputs
Regularly print or log model outputs, especially in the early stages of training, to catch any anomalies.
3. Monitor Resource Usage
Keep an eye on CPU, GPU, and memory usage. Unexpected spikes or constant high usage might indicate inefficiencies or leaks.
This can be enabled in the diagnostics configuration:
log:
mlflow:
system: true
4. Use PyTorch Profiler
Leverage PyTorch’s built-in profiler to identify performance bottlenecks:
We are currently updating the Anemoi profiler to use modern Pytorch profiling tools.
5. Gradient Checking
If you suspect issues with backpropagation, consider implementing gradient checking to verify correct gradient computations.
Seeking Further Assistance
If you’ve tried these troubleshooting steps and still encounter issues, consider:
Reviewing the Anemoi Training documentation for any recent updates or known issues
Checking the project’s issue tracker for similar problems and solutions
Reaching out to the Anemoi community or support channels for additional help
Remember to provide as much relevant information as possible when seeking assistance, including your configuration, error messages, and steps to reproduce the issue.