Re: USB reset + raid6 = majority of files unreadable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 26, 2020 at 08:45:17AM -0800, Jonathan H wrote:
> On Tue, Feb 25, 2020 at 8:37 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
> > It's great that your metadata is safe.
> >
> > The biggest concern is no longer a concern now.
> 
> Glad to hear.
> 
> > More context would be welcomed.
> 
> Here's a string of uncorrectable errors detected by the scrub: http://ix.io/2cJM
> 
> Here is another attempt to read a file giving an I/O error: http://ix.io/2cJS
> The last two lines are produced when trying to read the file a second time.
> 
> Here's the state of the currently running scrub: http://ix.io/2cJU
> I had to cancel and resume the scrub to run `btrfs check` earlier, but
> otherwise it has been uninterrupted.
> 
> > Anyway, even with more context, it may still lack the needed info as
> > such csum failure message is rate limited.
> >
> > The mirror num 2 means it's the first rebuild try failed.
> >
> > Since only the first rebuild try failed, and there are some corrected
> > data read, it looks btrfs can still rebuild the data.
> >
> > Since you have already observed some EIO, it looks like write hole is
> > involved, screwing up the rebuild process.
> > But it's still very strange, as I'm expecting more mirror number other
> > than 2.
> > For your 6 disks with 1 bad disk, we still have 5 ways to rebuild data,
> > only showing mirror num 2 doesn't look correct to me.
> 
> I'm sort of curious why so many files have been affected. It seems
> like most of the file system has become unreadable, but I was under
> the impression that if the write hole occurred it would at least not
> damage too much data at once. Is that incorrect?

There are still unfixed bugs in btrfs parity RAID:

	https://www.spinics.net/lists/linux-btrfs/msg94594.html

If you have an array where some of the drives go offline for a while and
come back online, then you will see a lot of what looks like disk-level
corruption.  The unwritten blocks on drives that come back online are
treated as corrupted data (csums or transid fields don't match expected
values recorded on the other drives) and btrfs will attempt to repair
them.

If you have parent transid verify failures, you are very likely to also
have correctable data errors made uncorrectable due to the above raid5/6
bug.  The two visible error cases are two different possible consequences
of the same low-level write loss events.  This means that at some point,
you had two disks offline for a while, but the other disks in the array
were still getting updates (array failures are never simple--multiple
modes of failure at different times during a single event are the norm).

If you have corrupted data on raid5/6 on btrfs, some of it won't come
back due to the data recovery corruption bug linked above.  Until this
bug is fixed, the only alternative is to restore the lost data from
backups.  Replacing the missing drive before fixing the correction bug
in the kernel will damage some more data, so data that is theoretically
readable now may be lost in the future as you replace drives; however,
losses should be 1% or less, so raid5/6 recovery in-place can still be
quicker than a full mkfs+restore for raid0.

Note that the raid5/6 write hole is a separate issue.  It's possible
for both issues to occur at the same time in a failing array, but the
correction bug will affect several orders of magnitude more data than
the write hole.

raid1 and raid1c3 have no such problems.  The parent transid verify errors
come from btrfs metadata, which in your filesystem is raid1c3, so they
would have been easily and correctly repaired as they were encountered.

> > BTW, since your free space cache is already corrupted, it's recommended
> > to clear the space cache.
> 
> It's strange to me that the free space cache is giving an error, since
> I cleared it previously and the most recent unmount was clean.

Free space cache is stored in data block groups and subject to all
of the above btrfs parity raid data integrity problems.  Do not use
space_cache=v1 with raid5 or raid6.  Better not to use space_cache=v1
at all, but v1 + raid5/6 is bad in ways that go beyond merely being slow
and unreliable.

Free space tree (space_cache=v2) is stored in btrfs metadata, so it will
work properly with raid1 or raid1c3 metadata.  Probably faster too,
and nothing can break v2 that doesn't also destroy the filesystem.

All that said, there are internal data integrity checks in free space
cache (v1), so it's possible that the only bad thing that happens here
is that you get a bunch of free space cache invalidation error messages.

> > For now, since it looks like write hole is involved, the only way to
> > solve the problem is to remove all offending files (including a super
> > large file in root 5).
> >
> > You can use `btrfs inspect logical-resolve <bytenr> <mnt>" to see all
> > the involved files.
> >
> > The full <bytenr> are the bytenr shown in btrfs check --check-data-csum
> > output.
> 
> The strange thing is if you use `btrfs inspect logical-resolve` on all
> of the bytenrs mentioned in the check output, I get that all of the
> corruption is in the same file (see http://ix.io/2cJP), but this does
> not seem consistent with the uncorrectable csum errors the scrub is
> detecting.

The uncorrectable csum errors, and changes in errors over time, are
probably the correction bug in action.  Scrub also produces highly
questionable error statistics on raid5/6.  That may be a distinct bug
from the correction/corruption bug--it's hard to tell without fixing
all the current bugs and testing again.

Note: even if scrub is fully debugged, it is limited by the btrfs
on-disk format.  Corruption in data blocks with csums can always be
corrected up to raid5/6 drive-loss limits.  nodatasum files cannot be
reliably corrected.  Free space cache will be corrupted, but btrfs
should detect this and invalidate/rebuild the cache (but don't use
space_cache=v1 anyway).  In the best case scrub will count some csum
errors against the wrong disks in some cases (though the best case
is certainly better than what scrub does now).

In a RAID5/6 stripe that is completely filled with data blocks
belonging to files that have csums, every data block in the stripe can
be individually tested against its csum.  If all the data blocks have
correct csums, but the parity block on disk does not match computed
parity of the data, then we know that the parity block is corrupted
because we eliminated every other possible corrupted block.

If one of the data blocks in a RAID5/6 stripe has an incorrect csum then
we can try to recover the data using the parity block.  If that recovered
data fails the csum check too, then we know both data and parity blocks
are corrupt, since all other blocks in the raid stripe have good csums.
RAID6 has another parity block and some more combinations to try,
but eventually ends up either recovering the entire stripe or knowing
exactly which blocks were corrupt.

If there is:

	- a data block in a RAID5/6 stripe which does not have a csum
	(either because it is unoccupied, part of free space cache,
	or part of a nodatasum file)

	- all the data blocks that do have csums in the RAID stripe
	are OK (otherwise we would know that those blocks were the
	corrupted ones)

	- a parity mismatch detected in the RAID stripe, i.e. by scrub

then btrfs cannot determine whether the parity block is corrupted or one
of the no-csum data blocks.  The parity mismatch can be detected, but any
of the drives without a csum on its data block could have contributed to
the mismatch, and there is no way to tell which no-csum data block(s) is
(are) correct.  This will cause scrub to place csum error counts on the
wrong disks, e.g. blaming the disk that happens to hold the parity block
for the raid stripe when one of the other disks is the one flipping bits.

None of this explains why scrub reports "read" errors on healthy drives
when there is data corruption on other drives.  That part is a bug, the
only question is whether it's the _same_ bug as the correction corruption
bug, and that won't be known until at least one of the bugs is fixed.

> I've been calculating the offsets of the files mentioned in the
> relocation csum errors (by adding the block group and offset),
> resolving the files with `btrfs inspect logical-resolve` and deleting
> them. But it seems like the set of files I'm deleting is also totally
> unrelated to the set of files the scrub is detecting errors in. Given
> the frequency of relocation errors, I fear I will need to delete
> almost everything on the file system for the deletion to complete. I
> can't tell if I should expect these errors to be fixable since the
> relocation isn't making any attempt to correct them as far as I can
> tell.



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux