在 EC2 上面遇到 XFS 無法掛載 /sysroot

今天遇到有人跟我說 EC2 的服務無法使用,我當下以為跟以前一樣是因為某台機器可能剛好卡了一下,因此就直接在 AWS 上按下系統重啟,結果系統竟然沒有重啟我就直接按下停止。結果機器啟動之後就出現 Status check 1/2 的狀況,重啟了幾次發現都是一樣的狀況我就請 Support 幫忙查了一下,不過因為沒有購買 AWS Support 所以沒有辦法進一步查詢問題。

在沒有辦法的情況下我只能打開 Get instance screenshot 看一下狀況

系統無法 mount /sysroot

看完之後好像也沒看到什麼狀況就點了下面的 Connect 進去看看狀況,看了一下大事不妙系統在 mount /sysroot 的時候竟然失敗難怪沒辦法啟動。

[  OK  ] Started File System Check on /dev/d…e390f-835b-4223-a9bb-9b45984ddf8d.
         Mounting /sysroot...
[    6.659555] SGI XFS with ACLs, security attributes, quota, no debug enabled
[    6.668196] XFS (nvme0n1p1): Mounting V5 Filesystem
[   11.020272] XFS (nvme0n1p1): Starting recovery (logdev: internal)
[FAILED] Failed to mount /sysroot.
See 'systemctl status sysroot.mount' for details.
[DEPEND] Dependency failed for Initrd Root File System.
[DEPEND] Dependency failed for Reload Configuration from the Real Root.

解決方法

出現這個問題的解決方法是開一台新的 EC2 Instance 然後把這顆有問題的硬碟用資料碟的方法掛載起來,然後修復它。

修復方法

使用 lsblk 檢查有問題的硬碟代號

[ec2-user@ip-172-31-10-10 ~]$ lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme0n1       259:0    0    8G  0 disk
├─nvme0n1p1   259:3    0    8G  0 part /
└─nvme0n1p128 259:4    0    1M  0 part
nvme1n1       259:1    0  300G  0 disk
└─nvme1n1p1   259:2    0  280G  0 part

使用指令檢查硬碟發現它真的有問題

[ec2-user@ip-172-31-10-10 ~]$ sudo xfs_repair -v /dev/nvme1n1p1
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
        - block cache size set to 172200 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 5405 tail block 7092
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

使用修復硬碟指令修復硬碟

[ec2-user@ip-172-31-10-10 ~]$ sudo xfs_repair -v -L  /dev/nvme1n1p1
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
        - block cache size set to 172200 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 5405 tail block 7092
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
agi unlinked bucket 20 is 414292 in ag 30 (inode=126243412)
agi unlinked bucket 55 is 1652663 in ag 59 (inode=249116599)
sb_icount 2286464, counted 2291648
sb_ifree 27750, counted 23431
sb_fdblocks 11511530, counted 11353049
        - 10:51:18: scanning filesystem freespace - 141 of 141 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - 10:51:18: scanning agi unlinked lists - 141 of 141 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 105
        - agno = 90
        - agno = 60

...

        - agno = 140
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 126243412, moving to lost+found
disconnected inode 249116599, moving to lost+found
Phase 7 - verify and correct link counts...
        - 10:51:51: verify and correct link counts - 141 of 141 allocation groups done
Note - quota info will be regenerated on next quota mount.
Maximum metadata LSN (87058:5395) is ahead of log (1:2).
Format log to cycle 87061.

        XFS_REPAIR Summary    Tue Dec 27 10:51:52 2022

Phase		Start		End		Duration
Phase 1:	12/27 10:51:17	12/27 10:51:17
Phase 2:	12/27 10:51:17	12/27 10:51:18	1 second
Phase 3:	12/27 10:51:18	12/27 10:51:35	17 seconds
Phase 4:	12/27 10:51:35	12/27 10:51:36	1 second
Phase 5:	12/27 10:51:36	12/27 10:51:37	1 second
Phase 6:	12/27 10:51:37	12/27 10:51:51	14 seconds
Phase 7:	12/27 10:51:51	12/27 10:51:51

Total run time: 34 seconds
done

修復成功後再把硬碟掛回去原本的 EC2 開機就會發現系統正常可以啟動了!我一直以為我不會在雲端上面執行這種修復的說