Re: Deleting a failing drive from RAID6 fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 26, 2019 at 01:03:47PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/12/26 上午3:25, Martin wrote:
> > Hi,
> > 
> > I have a drive that started failing (uncorrectable errors & lots of
> > relocated sectors) in a RAID6 (12 device/70TB total with 30TB of
> > data), btrfs scrub started showing corrected errors as well (seemingly
> > no big deal since its RAID6). I decided to remove the drive from the
> > array with:
> >     btrfs device delete /dev/sdg /mount_point
> > 
> > After about 20 hours and having rebalanced 90% of the data off the
> > drive, the operation failed with an I/O error. dmesg was showing csum
> > errors:
> >     BTRFS warning (device sdf): csum failed root -9 ino 2526 off
> > 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2
> >     BTRFS warning (device sdf): csum failed root -9 ino 2526 off
> > 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2
> >     . . .
> 
> This means some data reloc tree had csum mismatch.
> The strange part is, we shouldn't hit csum error here, as if it's some
> data corrupted, it should report csum error at read time, other than
> reporting the error at this timing.
> 
> This looks like something reported before.
> 
> > 
> > I pulled the drive out of the system and attempted the device deletion
> > again, but getting the same error.
> > 
> > Looking back through the logs to the previous scrubs, it showed the
> > file paths where errors were detected, so I deleted those files, and
> > tried removing the failing drive again. It moved along some more. Now
> > its down to only 13GiB of data remaining on the missing drive. Is
> > there any way to track the above errors to specific files so I can
> > delete them and finish the removal. Is there is a better way to finish
> > the device deletion?
> 
> As the message shows, it's the data reloc tree, which store the newly
> relocated data.
> So it doesn't contain the file path.
> 
> > 
> > Scrubbing with the device missing just racks up uncorrectable errors
> > right off the bat, so it seemingly doesn't like missing a device - I
> > assume it's not actually doing anything useful, right?
> 
> Which kernel are you using?
> 
> IIRC older kernel doesn't retry all possible device combinations, thus
> it can report uncorrectable errors even if it should be correctable.

> Another possible cause is write-hole, which reduced the tolerance of
> RAID6 stripes by stripes.

Did you find a fix for

	https://www.spinics.net/lists/linux-btrfs/msg94634.html

If that bug is happening in this case, it can abort a device delete
on raid5/6 due to corrupted data every few block groups.

> You can also try replace the missing device.
> In that case, it doesn't go through the regular relocation path, but dev
> replace path (more like scrub), but you need physical access then.
> 
> Thanks,
> Qu
> 
> > 
> > I'm currently traveling and away from the system physically. Is there
> > any way to complete the device removal without reconnecting the
> > failing drive? Otherwise, I'll have a replacement drive in a couple of
> > weeks when I'm back, and can try anything involving reconnecting the
> > drive.
> > 
> > Thanks,
> > Martin
> > 
> 



Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux