On 2020/2/26 上午4:39, Jonathan H wrote: > Hello everyone, > > Previously, I was running an array with six disks all connected via > USB. I am running raid1c3 for metadata and raid6 for data, kernel > 5.5.4-arch1-1 and btrfs --version v5.4, and I use bees for > deduplication. Four of the six drives are stored in a single four-bay > enclosure. Due to my oversight, TLER was not enabled for any of the > drives, so when one of them started failing, the enclosure was reset > and all four drives were disconnected. > > After rebooting, the file system was still mountable. I saw some > transid errors in dmesg, This means the fs is already corrupted. If btrfs check is run before mount, it may provide some pretty good debugging info. Also the exact message for the transid error and some context would help us to determine how serious the corruption is. > but I didn't really pay attention to them > because I was trying to get rid of the now failed drive. I tried to > "btrfs replace" the drive with a different one, but the replace > stopped making progress because all reads to the dead drive in a > certain location were failing (even with the "-r") flag. So I tried > mounting degraded without the dead drive and doing "btrfs dev delete > missing" instead. The deletion failed with the following kernel > message: > > [ +2.697798] BTRFS warning (device sdb): csum failed root -9 ino 257 > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1 > [ +0.003381] BTRFS warning (device sdb): csum failed root -9 ino 257 > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2 > [ +0.002514] BTRFS warning (device sdb): csum failed root -9 ino 257 > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4 > [ +0.000543] BTRFS warning (device sdb): csum failed root -9 ino 257 > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1 > [ +0.001170] BTRFS warning (device sdb): csum failed root -9 ino 257 > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2 > [ +0.001151] BTRFS warning (device sdb): csum failed root -9 ino 257 > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4 This is a different error. This means data reloc tree is corrupted. This somewhat looks like an existing bug. especially when all rebuild result the same csum. > > I noticed that almost all of the files give an I/O error when read, > and similar kernel messages are generated, but with positive roots. Please give the exact dmesg. Including all the messages for the same bytenr. > I > also see "read error corrected" messages, but > if I try to read the files again, I the exact same messages are > printed again, which seems to suggest that the errors haven't really > been corrected? (But maybe this is intended behavior.) > > I also attempted to use "btrfs restore" to recover the files, but > almost all of the files produce "ERROR: zstd decompress failed Unknown > frame descriptor" and the recovery does not succeed. > > Since, then, I have been scrubbing the file system. The first scrub > produce lots of Uncorrectable read errors and several hundred csum > errors. I'm assuming the read errors are due to the missing drive. The > puzzling thing is, the scrub can "complete" (actually, it is aborted > after it completes on all drives but the missing one) and I can delete > all of the files with unrecoverable csum errors, but all of the issues > above persist. I can then turn around scrub again, and the scrub will > find new csum errors, which seems bizarre to me, since I would have > expected them all to be fixed. However, all transid related errors > have disappeared after the first scrub. > > I have also tried deleting the file referenced in the device deletion > error and restarting the deletion. This seems to be working, but > progress has been very slow and I fear I'll have to delete all of the > I/O error-producing files above, which I would like to avoid if > possible. > > What should I do in this situation and how can I avoid this in the future? Although I don't believe it's the hardware to blame, but you can still try to disable write cache on all related devices, as an experiment to rule out bad disk flush/fua behavior. Thanks, Qu > > Thanks, > Jonathan >
Attachment:
signature.asc
Description: OpenPGP digital signature
