RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47
Whenever I'm running PyTorch with enabled CUDA and laptop decides to turn off/put to sleep the GPU, executed script will fail with some CUDA allocation error. Not cool, laptop, super not cool. Restarting the script, or starting any other program that uses CUDA, might not be possible since the memory wasn't properly released. When trying to run PyTorch you might see the same error as in the title, i.e.
RuntimeError: cuda runtime error (999) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:47
Now what? Well, the turn-it-off-and-on™ will work but you don't have to do all of that. Making sure that the process is properly killed (ha!) and Nvidia's universal virtual memory (uvm) is clean is enough. The quickest way is to try remove and add nvidia-uvm
module with
$ sudo modprob -r nvidia-uvm # same as sudo rmmod nvidia-uvm
$ sudo modprob nvidia-uvm
Chances are that was enough and you're good to go. (Note that hyphen "-
" is treated equally as the underscore "_
" so you might see either version on the internet.)
However, if the remove command above fails with something like "Module nvidia-uvm is in use" then it means that the dead process is alive (double ha!). You will find lsmod
and lsof
helpful in finding out which modules and processes use nvidia-uvm
.
Firstly, try with lsmod
by running something like (Note that dot is a wildcard so that it matches
$ lsmod | grep nvidia.uvm
Note that dot "."
is a wildcard so that it matches both "-
" and "_
". If
kretyn@junk:~$ lsmod | grep nvidia.uvm nvidia_uvm 966656 2 nvidia 20680704 1206 nvidia_uvm,nvidia_modeset
Columns are "Module", "Size", and "Used by". The number 2 means that there are two modules that use nvidia_uvm
although their names aren't tracked. My observation is that each PyTorch that uses CUDA. To check which modules/processes exactly use the nvidia_uvm you might want to use lsof
with something like
$ lsof | grep nvidia.uvm
Note that this might take a while. For me it starts with some docker
warnings and then, after ~10 seconds, returns lines with /dev/nvidia-uvm
. The second column is what you're interested as it's the PID. Take a note of it and check with ps aux
whether that's the process you want to kill, e.g. if PID is 115034 then that's
$ ps aux | grep 115034 # "115034" is PID and replace with what you want
If that's proven to be the old, undead then just kill it with kill -9 <PID>
and try again with restarting modprod -r nvidia-uvm
.
That's a bit of work when describing but it's actually quite quick.