I appreciate the replies, and as a general update I ended up cleaning out large amount of unneeded files, hoping the corruption would be in one of those and retried the device deletion - it completed successfully. Not really sure why the files were ever unrecoverably corrupted - the system has never crashed or lost power since this filesystem was created. It's a Fedora server and somewhat regularly updated, and this btrfs FS was created about 2 years ago maybe - not really sure which kernel version, but most recently running kernel 5.3.16 when I noticed the hard drive failing. Not really sure when it first started having problems. Thanks, Martin On Thu, Dec 26, 2019 at 1:50 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote: > > > > On 2019/12/26 下午1:40, Zygo Blaxell wrote: > > On Thu, Dec 26, 2019 at 01:03:47PM +0800, Qu Wenruo wrote: > >> > >> > >> On 2019/12/26 上午3:25, Martin wrote: > >>> Hi, > >>> > >>> I have a drive that started failing (uncorrectable errors & lots of > >>> relocated sectors) in a RAID6 (12 device/70TB total with 30TB of > >>> data), btrfs scrub started showing corrected errors as well (seemingly > >>> no big deal since its RAID6). I decided to remove the drive from the > >>> array with: > >>> btrfs device delete /dev/sdg /mount_point > >>> > >>> After about 20 hours and having rebalanced 90% of the data off the > >>> drive, the operation failed with an I/O error. dmesg was showing csum > >>> errors: > >>> BTRFS warning (device sdf): csum failed root -9 ino 2526 off > >>> 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2 > >>> BTRFS warning (device sdf): csum failed root -9 ino 2526 off > >>> 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2 > >>> . . . > >> > >> This means some data reloc tree had csum mismatch. > >> The strange part is, we shouldn't hit csum error here, as if it's some > >> data corrupted, it should report csum error at read time, other than > >> reporting the error at this timing. > >> > >> This looks like something reported before. > >> > >>> > >>> I pulled the drive out of the system and attempted the device deletion > >>> again, but getting the same error. > >>> > >>> Looking back through the logs to the previous scrubs, it showed the > >>> file paths where errors were detected, so I deleted those files, and > >>> tried removing the failing drive again. It moved along some more. Now > >>> its down to only 13GiB of data remaining on the missing drive. Is > >>> there any way to track the above errors to specific files so I can > >>> delete them and finish the removal. Is there is a better way to finish > >>> the device deletion? > >> > >> As the message shows, it's the data reloc tree, which store the newly > >> relocated data. > >> So it doesn't contain the file path. > >> > >>> > >>> Scrubbing with the device missing just racks up uncorrectable errors > >>> right off the bat, so it seemingly doesn't like missing a device - I > >>> assume it's not actually doing anything useful, right? > >> > >> Which kernel are you using? > >> > >> IIRC older kernel doesn't retry all possible device combinations, thus > >> it can report uncorrectable errors even if it should be correctable. > > > >> Another possible cause is write-hole, which reduced the tolerance of > >> RAID6 stripes by stripes. > > > > Did you find a fix for > > > > https://www.spinics.net/lists/linux-btrfs/msg94634.html > > > > If that bug is happening in this case, it can abort a device delete > > on raid5/6 due to corrupted data every few block groups. > > My bad, always lost my track of to-do works. > > It looks like one possible cause indeed. > > Thanks for reminding me that bug, > Qu > > > > >> You can also try replace the missing device. > >> In that case, it doesn't go through the regular relocation path, but dev > >> replace path (more like scrub), but you need physical access then. > >> > >> Thanks, > >> Qu > >> > >>> > >>> I'm currently traveling and away from the system physically. Is there > >>> any way to complete the device removal without reconnecting the > >>> failing drive? Otherwise, I'll have a replacement drive in a couple of > >>> weeks when I'm back, and can try anything involving reconnecting the > >>> drive. > >>> > >>> Thanks, > >>> Martin > >>> > >> > > > > > > >
