################# Troubleshooting ################# When encountering issues while training models with Anemoi Training, this guide will help you identify and resolve common problems. We'll cover various debugging techniques, including those specific to PyTorch Lightning, which Anemoi Training uses under the hood. **************************** Using Debug Configurations **************************** To troubleshoot errors when trying to train a model for the first time, it is advisable to use the debug configuration ``anemoi/training/config/debug.yaml``. This configuration: - Runs a small model - Trains on a limited number of batches per epoch - Helps identify errors more quickly If you're using a custom configuration, consider making these temporary adjustments: .. code:: yaml dataloader: limit_batches: training: 100 validation: 100 hardware: num_gpus_per_node: 1 These settings limit the data processed and use a single GPU, helping isolate issues related to data or parallelization. *********************************** PyTorch Lightning Debugging Tools *********************************** Anemoi Training leverages PyTorch Lightning, which provides several useful debugging tools. Currently these aren't implemented as config settings yet, but could easily be added, if needed. 1. Overfit on a Single Batch ============================ To identify issues in your model's ability to learn, try overfitting on a single batch: .. code:: python # use only 1% of the train & val set trainer = Trainer(overfit_batches=0.01) # overfit on 10 of the same batches trainer = Trainer(overfit_batches=10) This setting will repeatedly train on the same batch, helping you verify if the model can learn at all. 2. Fast Dev Run =============== For a quick test of your entire training pipeline: .. code:: python trainer = Trainer(fast_dev_run=True) This runs a single batch for training, validation, and testing, checking if all code paths work without errors. 3. Detect Anomalies =================== Enable PyTorch's anomaly detection in the diagnostics configuration: .. code:: yaml debug: anomaly_detection: true This helps identify issues like NaN or infinity values in your model's computations. *************************************** Debug Flags for Better Error Handling *************************************** Anemoi Training can make use of several debug flags to provide more detailed error information: 1. Verbose Mode =============== Enable verbose logging: .. code:: yaml hydra.verbose=true You can set the log level of the logger NAME to DEBUG. Equivalent to ``import logging; logging.getLogger(NAME).setLevel(logging.DEBUG)``. .. code:: yaml hydra.verbose=NAME And even provide multiple targets. .. code:: yaml hydra.verbose=[NAME1,NAME2] This increases the verbosity of log outputs, providing more detailed information about the training process. 2. Asynchronous Callbacks ========================= Disable asynchronous callbacks for clearer error messages: .. code:: yaml diagnostics: plot: asynchronous: false This makes error messages generally easier to understand by ensuring callbacks are executed synchronously. 3. Disable Plotting =================== Turn off plotting callbacks to isolate non-visualization related issues: .. code:: yaml diagnostics: plot: callbacks: [] Or set the plot config to none, (in diagnostics.evaluation) .. code:: yaml defaults: plot: none ********************************** Debugging C10 Distributed Errors ********************************** The C10 distributed error can often mask underlying issues. To debug the true model error: 1. Set CUDA to Blocking Mode ============================ Before running your training script, set the following environment variable: .. code:: bash export CUDA_LAUNCH_BLOCKING=1 This forces CUDA operations to run synchronously, which can reveal the true source of errors that might be hidden by asynchronous execution. 2. Run on a Single GPU ====================== Temporarily run your model on a single GPU to eliminate some distributed training complexities: .. code:: yaml hardware: num_gpus_per_node: 1 The code is still distributed, but at least it removes the multi-GPU aspect and you can use debug statements. 3. Gradually Increase Complexity ================================ Once you've identified and fixed the underlying issue, gradually reintroduce distributed training and multiple GPUs to ensure the problem doesn't reoccur in a multi-GPU setting. ********************************* Additional Troubleshooting Tips ********************************* 1. Check Input Data =================== Verify that your input data is correctly formatted and addressed in the normalizer. Use small subsets of your data to test the pipeline. 2. Inspect Model Outputs ======================== Regularly print or log model outputs, especially in the early stages of training, to catch any anomalies. 3. Monitor Resource Usage ========================= Keep an eye on CPU, GPU, and memory usage. Unexpected spikes or constant high usage might indicate inefficiencies or leaks. This can be enabled in the diagnostics configuration: .. code:: yaml log: mlflow: system: true 4. Use PyTorch Profiler ======================= Leverage PyTorch's built-in profiler to identify performance bottlenecks: We are currently updating the Anemoi profiler to use modern Pytorch profiling tools. 5. Gradient Checking ==================== If you suspect issues with backpropagation, consider implementing gradient checking to verify correct gradient computations. **************************** Seeking Further Assistance **************************** If you've tried these troubleshooting steps and still encounter issues, consider: - Reviewing the Anemoi Training documentation for any recent updates or known issues - Checking the project's issue tracker for similar problems and solutions - Reaching out to the Anemoi community or support channels for additional help Remember to provide as much relevant information as possible when seeking assistance, including your configuration, error messages, and steps to reproduce the issue.