GPU Debugging Guide

How to debug GPU simulation issues

This is not an exhaustive guide. There are many issues that can cause you problems, but we will cover the most common.

How do I see what is happening?

You have the choice of a number of tools to see what is happening on the GPU's themselves and the host server.

Server

Typing the command htop will allow you to see the output of what is running on the server.

GPUs

Typing the command nvidia-smi will allow you to see the output of what is running on the GPUs.

Typing the command nvtop will also allow you to see more detailed output of what is running on the GPUs.

Why does my code crash with memory errors?

The GPU servers are setup to not allow jobs to run if there is not enough free memory available. This is a safety measure put in place to stop multiple users from starting GPU processes and exhausting the resources on the server. This measure means that your job will crash, but any running jobs will continue to run.

In this situation you have a few choices:

Try to run your processes later, when the server is less busy
Adjust your code to reduce the amount of memory used. The easiest way to do this is to reduce the batch size, but there are often functions available in libraries like PyTorch, Tensorflow, etc, that can be called to flush the memory that is being used.
Kill processes on the server that are no longer being used - This is the most common cause of this issue. People are often surprised to see that they have dozens, or sometimes hundreds of processes running on the server that they were not aware of. To check what processes are currently running for your account, use the command ps-ef | grep userid (where userid is replace with your userid). This will provide a list of processes. You can kill these by running kill PID (where PID is replaced with the PID number from the output of the previous "ps" command). If you are using tmux/screen on the server make sure that you kill these when they are not running any code.

Do you know how much RAM your code requires to run?

If you use htop and then press "o" and "u", you will be presented with a list of users on the server. use the keyboard arrows to move down and hit return when you come to your userid. This will show the processes that you are starting. Run your code from another window and you should see processes starting. This will show you how much RAM they are requesting and also how many of them there are.

Your code will only run if: (number of processes X RAM required) < Free RAM.

How many processes am I running?

You can get a sense of a simulation that has not been coded properly if it is generating a lot of processes. If you simulation is generating hundreds of processes this will eventually crash the server.

To see how many processes you are using you can run the command

ps -ef | grep userid | wc -l

If you are running a large number of process (100+) you can kill multiple processes with something like:

pkill -u userid