RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47

Dawid Laszuk published on November 07, 2020

3 min, 532 words

Whenever I'm running PyTorch with enabled CUDA and laptop decides to turn off/put to sleep the GPU, executed script will fail with some CUDA allocation error. Not cool, laptop, super not cool. Restarting the script, or starting any other program that uses CUDA, might not be possible since the memory wasn't properly released. When trying to run PyTorch you might see the same error as in the title, i.e.

RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47

Now what? Well, the turn-it-off-and-on™ will work but you don't have to do all of that. Making sure that the process is properly killed (ha!) and Nvidia's universal virtual memory (uvm) is clean is enough. The quickest way is to try remove and add nvidia-uvm module with

$ sudo modprob -r nvidia-uvm  # same as sudo rmmod nvidia-uvm
$ sudo modprob nvidia-uvm

Chances are that was enough and you're good to go. (Note that hyphen "-" is treated equally as the underscore "_" so you might see either version on the internet.)

However, if the remove command above fails with something like "Module nvidia-uvm is in use" then it means that the dead process is alive (double ha!). You will find lsmod and lsof helpful in finding out which modules and processes use nvidia-uvm.

Firstly, try with lsmod by running something like (Note that dot is a wildcard so that it matches

$ lsmod | grep nvidia.uvm

Note that dot "." is a wildcard so that it matches both "-" and "_". If

kretyn@junk:~$ lsmod | grep nvidia.uvm
nvidia_uvm 966656 2
nvidia 20680704 1206 nvidia_uvm,nvidia_modeset

Columns are "Module", "Size", and "Used by". The number 2 means that there are two modules that use nvidia_uvm although their names aren't tracked. My observation is that each PyTorch that uses CUDA. To check which modules/processes exactly use the nvidia_uvm you might want to use lsof with something like

$ lsof | grep nvidia.uvm

Note that this might take a while. For me it starts with some docker warnings and then, after ~10 seconds, returns lines with /dev/nvidia-uvm. The second column is what you're interested as it's the PID. Take a note of it and check with ps aux whether that's the process you want to kill, e.g. if PID is 115034 then that's

$ ps aux | grep 115034  # "115034" is PID and replace with what you want

If that's proven to be the old, undead then just kill it with kill -9 <PID> and try again with restarting modprod -r nvidia-uvm.

That's a bit of work when describing but it's actually quite quick.