發生的問題
最近在使用一台電腦時遇到了一個問題:裝上顯示卡後系統無法正常掛載。嘗試了使用 ubuntu-drivers autoinstall
和手動安裝 apt install nvidia-driver-535
、apt install nvidia-driver-545
等方式,卻未能解決問題。
每次安裝完成後,執行 nvidia-smi
指令都顯示 No devices were found
,同時檢查 Linux 系統日誌時出現以下錯誤:
Feb 1 00:00:00 opadm kernel: [ 4251.620273] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x56:1469) Feb 1 00:00:00 opadm kernel: [ 4251.620379] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 Feb 1 00:00:00 opadm kernel: [ 4253.351677] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x56:1469) Feb 1 00:00:00 opadm kernel: [ 4253.351805] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
解決方法
最終,我使用以下方法成功解決了問題:
sudo ./NVIDIA-Linux-x86_64-535.154.05.run -m=kernel-open
若系統中已經安裝了驅動,請務必先執行以下指令移除:
sudo apt-get purge nvidia*
然而,在執行這個指令時我遇到了一點錯誤,主要是由於 GCC 版本不符,問題發生的原因可以在 /var/log/nvidia-installer.log 找到,在 log 可以看到目前系統的 GCC 是使用 12 版本編譯的,但是目前使用的 GCC 版本是 11
x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38 does not match the compiler used here: cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. It is recommended to set the CC environment variable to the compiler that was used to compile the kernel. To skip the test and silence this warning message, set the IGNORE_CC_MISMATCH environment variable to "1". However, mixing compiler versions between the kernel and kernel modules can result in subtle bugs that are difficult to diagnose. *** Failed CC version check. ***
所以以下是解決方法:
查找目前安裝的 GCC 版本及路徑:
$ which gcc-11 /usr/bin/gcc-11 $ which gcc /usr/bin/gcc $ ls -all /usr/bin/gcc lrwxrwxrwx 1 root root 6 Aug 5 2021 /usr/bin/gcc -> gcc-11
列出目前所有 GCC 版本
$ ls -all /usr/bin/gc gcc gcc-ar gcc-nm gcc-ranlib gcov gcov-dump gcov-tool gcc-11 gcc-ar-11 gcc-nm-11 gcc-ranlib-11 gcov-11 gcov-dump-11 gcov-tool-11 gcc-12 gcc-ar-12 gcc-nm-12 gcc-ranlib-12 gcov-12 gcov-dump-12 gcov-tool-12
切換 GCC 版本:
$ sudo rm /usr/bin/gcc $ sudo ln -s /usr/bin/gcc-12 /usr/bin/gcc
重新檢查 GCC 版本:
$ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/12/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 12.3.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-12/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-12 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-12-ALHxjy/gcc-12-12.3.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-12-ALHxjy/gcc-12-12.3.0/debian/tmp-gcn/usr --enable-offload-defaulted --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)
使用 sudo ./NVIDIA-Linux-x86_64-535.154.05.run -m=kernel-opencl
安裝成功後我就確實可以抓到顯卡了
$ nvidia-smi Mon Feb 1 00:00:00 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A | | 30% 36C P0 38W / 170W | 0MiB / 12288MiB | 4% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------
參考資料
NVIDIA-Linux-x86_64-535.154.05.run
的下載位置
《AWS CDK 完全學習手冊:打造雲端基礎架構程式碼 IaC》
第 12 屆 iT 邦幫忙鐵人賽 DevOps 組冠的《用 CDK 定 義 AWS 架構》
第 11 屆 iT 邦幫忙鐵人賽《LINE bot 好好玩 30 天玩轉 LINE API》
一個熱愛分享的雲端工程師!