Re: USB reset + raid6 = majority of files unreadable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 26, 2020 at 3:37 PM Steven Fosdick <stevenfosdick@xxxxxxxxx> wrote:

> It looks like the disc started to fail here:
>
> Jan 30 13:41:04 meije kernel: scsi_io_completion_action: 806 callbacks
> suppressed
> Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#18 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
> Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#18 CDB: Read(16)
> 88 00 00 00 00 00 a2 d3 3b 00 00 00 00 40 00 00
> Jan 30 13:41:04 meije kernel: print_req_error: 806 callbacks suppressed
> Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
> sector 2731752192 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
> Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#15 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
> Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#15 CDB: Write(16)
> 8a 00 00 00 00 00 a2 d3 3b 00 00 00 00 08 00 00
> Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
> sector 2731752192 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
> Jan 30 13:41:04 meije kernel: btrfs_dev_stat_print_on_error: 732
> callbacks suppressed

Both read and write errors reported by the hardware. These aren't the
typical UNC error though. I'm not sure what DID_BAD_TARGET means. Some
errors might be suppressed.

Write errors are generally fatal. Read errors, if they include sector
LBA, Btrfs can fix if there's an extra copy (dup, raid1, raid56, etc),
otherwise, it may or may not be fatal depending on what's missing, and
what's affected by it being missing.

Btrfs might survive the write errors though with metadata raid1c3. But
later you get more concerning messages...


> This goes on for pages and quite a few days, I can extract more if it
> is of interest.

Ahhh yeah. So for what it's worth, in an md driver backed world, this
drive would have been ejected (faulty) upon the first write error. md
does retries for reads but writes it pretty much considers the drive
written off, which means the array is degraded.

As btrfs doesn't have such a concept of faulty drives, ejected drives,
yet; you kinda have to keep an eye on this, and setup monitoring so
you know when the array goes degraded like this.

It's vaguely possible to get the array into a kind of split brain
situation, if two drives experience transient write errors. And in
that case, right now, there's no recovery. Btrfs just gets too
confused.

You need to replace the bad drive, and do a scrub to fix things up.
And double check with 'btrfs fi us /mountpoint/' that all block groups
have one profile set, and that it's the correct one.


> then after mounting degraded, add a new device and attempt to remove
> the missing one:

That's not a good idea in my opinion... you really need to replace the
drive. Otherwise you're doing a really expensive full rebalance while
degraded, effectively. That means nothing else can go wrong or you're
in much bigger trouble. In particular it's really common for there to
be a mismatch between physical drive SCT ERC timeouts, and the
kernel's command timer. Mismatches can cause a of confusion because
upon kernel timer being reached, it resets the block device that
contains the "late" command, which then blows away that drive's entire
command queue.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch


> Feb 10 19:38:36 meije kernel: BTRFS info (device sda): disk added /dev/sdb
> Feb 10 19:39:18 meije kernel: BTRFS info (device sda): relocating
> block group 10045992468480 flags data|raid5
> Feb 10 19:39:27 meije kernel: BTRFS info (device sda): found 19 extents
> Feb 10 19:39:34 meije kernel: BTRFS info (device sda): found 19 extents
> Feb 10 19:39:39 meije kernel: BTRFS info (device sda): clearing
> incompat feature flag for RAID56 (0x80)
> Feb 10 19:39:39 meije kernel: BTRFS info (device sda): relocating
> block group 10043844984832 flags data|raid5

I'm not sure what's going on here. This is a raid6 volume and raid56
flag is being cleared? That's unexpected and I dn't know why you have
raid5 block groups on a raid6 array.


> and at that point the device remove aborted with an I/O error.

OK well you didn't include that so we have no idea if this I/O error
is about the same failed device or another device. If it's another
device it's more complicated what can happen to the array. Hence why
timeout mismatches are important. And why it's important to have
monitoring so you aren't running a degraded array for three days.

>
> I did discover I could use balance with a filter to balance much of
> the onto the three working discs, away from the missing one but I also
> discovered that whenever the checksum error appears the space cache
> seems to get corrupted.  Any further balance attempt results in
> getting stuck in a loop.  Mounting with clear_cache resolves that.

This sounds like a bug. The default space cache is stored in the data
block group which for you should be raid6, with a missing device it's
effectively raid5. But there's some kind of conversion happening
during the balance/missing device removal, hence the clearing of the
raid56 flag per block group, and maybe this corruption is happening
related to that removal.

-- 
Chris Murphy



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux