解決 NVIDIA Linux 使用 nvidia-smi 出現 No devices were found 問題

發生的問題

最近在使用一台電腦時遇到了一個問題:裝上顯示卡後系統無法正常掛載。嘗試了使用 ubuntu-drivers autoinstall 和手動安裝 apt install nvidia-driver-535apt install nvidia-driver-545 等方式,卻未能解決問題。

每次安裝完成後,執行 nvidia-smi 指令都顯示 No devices were found,同時檢查 Linux 系統日誌時出現以下錯誤:

Feb 1 00:00:00 opadm kernel: [ 4251.620273] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x56:1469)
Feb 1 00:00:00 opadm kernel: [ 4251.620379] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Feb 1 00:00:00 opadm kernel: [ 4253.351677] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x56:1469)
Feb 1 00:00:00 opadm kernel: [ 4253.351805] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

解決方法

最終,我使用以下方法成功解決了問題:

sudo ./NVIDIA-Linux-x86_64-535.154.05.run -m=kernel-open

若系統中已經安裝了驅動,請務必先執行以下指令移除:

sudo apt-get purge nvidia*

然而,在執行這個指令時我遇到了一點錯誤,主要是由於 GCC 版本不符,問題發生的原因可以在 /var/log/nvidia-installer.log 找到,在 log 可以看到目前系統的 GCC 是使用 12 版本編譯的,但是目前使用的 GCC 版本是 11

   x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38

   does not match the compiler used here:

   cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
   Copyright (C) 2021 Free Software Foundation, Inc.
   This is free software; see the source for copying conditions.  There is NO
   warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


   It is recommended to set the CC environment variable
   to the compiler that was used to compile the kernel.

   To skip the test and silence this warning message, set
   the IGNORE_CC_MISMATCH environment variable to "1".
   However, mixing compiler versions between the kernel
   and kernel modules can result in subtle bugs that are
   difficult to diagnose.

   *** Failed CC version check. ***

所以以下是解決方法:

查找目前安裝的 GCC 版本及路徑:

$ which gcc-11
/usr/bin/gcc-11
$ which gcc
/usr/bin/gcc
$ ls -all /usr/bin/gcc
lrwxrwxrwx 1 root root 6 Aug  5  2021 /usr/bin/gcc -> gcc-11

列出目前所有 GCC 版本

$ ls -all /usr/bin/gc
gcc            gcc-ar         gcc-nm         gcc-ranlib     gcov           gcov-dump      gcov-tool
gcc-11         gcc-ar-11      gcc-nm-11      gcc-ranlib-11  gcov-11        gcov-dump-11   gcov-tool-11
gcc-12         gcc-ar-12      gcc-nm-12      gcc-ranlib-12  gcov-12        gcov-dump-12   gcov-tool-12

切換 GCC 版本:

$ sudo rm /usr/bin/gcc
$ sudo ln -s /usr/bin/gcc-12 /usr/bin/gcc

重新檢查 GCC 版本:

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/12/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 12.3.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-12/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-12 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-12-ALHxjy/gcc-12-12.3.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-12-ALHxjy/gcc-12-12.3.0/debian/tmp-gcn/usr --enable-offload-defaulted --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)

使用 sudo ./NVIDIA-Linux-x86_64-535.154.05.run -m=kernel-opencl 安裝成功後我就確實可以抓到顯卡了

$ nvidia-smi
Mon Feb 1 00:00:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:01:00.0 Off |                  N/A |
| 30%   36C    P0              38W / 170W |      0MiB / 12288MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------

參考資料

NVIDIA-Linux-x86_64-535.154.05.run下載位置