Re: first it froze, now the (btrfs) root fs won't mount ...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2019/10/20 上午6:34, Christian Pernegger wrote:
> [Please CC me, I'm not on the list.]
> 
> Hello,
> 
> I'm afraid I could use some help.
> 
> The affected machine froze during a game, was entirely unresponsive
> locally, though ssh still worked. For completeness' sake, dmesg had:
> [110592.128512] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> timeout, signaled seq=3404070, emitted seq=3404071
> [110592.128545] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
> information: process Xorg pid 1191 thread Xorg:cs0 pid 1204
> [110592.128549] amdgpu 0000:0c:00.0: GPU reset begin!
> [110592.138530] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
> timeout, signaled seq=13149116, emitted seq=13149118
> [110592.138577] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
> information: process Overcooked.exe pid 4830 thread dxvk-submit pid
> 4856
> [110592.138579] amdgpu 0000:0c:00.0: GPU reset begin!

It looks like you're using eGPU and the thunderbolt 3 connection disconnect?
That would cause a kernel panic/hang or whatever.

> 
> Oh well, I thought, and "shutdown -h now" it. That quit my ssh session
> and locked me out, but otherwise didn't take, no reboot, still frozen.
> Alt-SysRq-REISUB it was. That did it.
> 
> Only now all I get is a rescue shell, the pertinent messages look to
> be [everything is copied off the screen by hand]:
> [...]
> BTRFS info [...]: disk space caching is enabled
> BTRFS info [...]: has skinny extents
> BTRFS error [...]: bad tree block start, want [big number] have 0
> BTRFS error [...]: failed to read block groups: -5
> BTRFS error [...]: open_ctree failed

This means some tree blocks didn't reach disk or just got wiped out.

Are you using discard mount option?

> 
> Mounting with -o ro,usebackuproot doesn't change anything.
> 
> running btrfs check gives:
> checksum verify failed on [same big number] found [8 digits hex] wanted 00000000
> checksum verify failed on [same big number] found [8 digits hex] wanted 00000000

Again, some old tree blocks got wiped out.

BTW, you don't need to wipe the numbers, sometimes it help developer to
find some corner problem.

> bytenr mismatch, want=[same big number], have=0
> ERROR: cannot open filesystem.
> 
> That's all I've got, I'd really appreciate some help. There's hourly
> snapshots courtesy of Timeshift, though I have a feeling those won't
> help ...

If it's the only problem, you can try this kernel branch to at least do
a RO mount:
https://github.com/adam900710/linux/tree/rescue_options

Then mount the fs with "rescue=skipbg,ro" option.
If the bad tree block is the only problem, it should be able to mount it.

If that mount succeeded, and you can access all files, then it means
only extent tree is corrupted, then you can try btrfs check
--init-extent-tree, there are some reports of --init-extent-tree fixed
the problem.

> 
> Oh, it's a recent Linux Mint 19.2 install, default layout (@, @home),
> Timeshift enabled; on a single device (NVMe). HWE kernel (Kernel
> 5.0.0-31-generic), btrfs-progs 4.15.1.

About the cause, either btrfs didn't write some tree blocks correctly or
the NVMe doesn't implement FUA/FLUSH correctly (which I don't believe is
the case).

So it's recommended to update the kernel to 5.3 kernel.

Thanks,
Qu

> 
> TIA,
> Christian
> 

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux