Re: USB reset + raid6 = majority of files unreadable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 25, 2020 at 8:37 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
> It's great that your metadata is safe.
>
> The biggest concern is no longer a concern now.

Glad to hear.

> More context would be welcomed.

Here's a string of uncorrectable errors detected by the scrub: http://ix.io/2cJM

Here is another attempt to read a file giving an I/O error: http://ix.io/2cJS
The last two lines are produced when trying to read the file a second time.

Here's the state of the currently running scrub: http://ix.io/2cJU
I had to cancel and resume the scrub to run `btrfs check` earlier, but
otherwise it has been uninterrupted.

> Anyway, even with more context, it may still lack the needed info as
> such csum failure message is rate limited.
>
> The mirror num 2 means it's the first rebuild try failed.
>
> Since only the first rebuild try failed, and there are some corrected
> data read, it looks btrfs can still rebuild the data.
>
> Since you have already observed some EIO, it looks like write hole is
> involved, screwing up the rebuild process.
> But it's still very strange, as I'm expecting more mirror number other
> than 2.
> For your 6 disks with 1 bad disk, we still have 5 ways to rebuild data,
> only showing mirror num 2 doesn't look correct to me.

I'm sort of curious why so many files have been affected. It seems
like most of the file system has become unreadable, but I was under
the impression that if the write hole occurred it would at least not
damage too much data at once. Is that incorrect?

> BTW, since your free space cache is already corrupted, it's recommended
> to clear the space cache.

It's strange to me that the free space cache is giving an error, since
I cleared it previously and the most recent unmount was clean.

> For now, since it looks like write hole is involved, the only way to
> solve the problem is to remove all offending files (including a super
> large file in root 5).
>
> You can use `btrfs inspect logical-resolve <bytenr> <mnt>" to see all
> the involved files.
>
> The full <bytenr> are the bytenr shown in btrfs check --check-data-csum
> output.

The strange thing is if you use `btrfs inspect logical-resolve` on all
of the bytenrs mentioned in the check output, I get that all of the
corruption is in the same file (see http://ix.io/2cJP), but this does
not seem consistent with the uncorrectable csum errors the scrub is
detecting.

I've been calculating the offsets of the files mentioned in the
relocation csum errors (by adding the block group and offset),
resolving the files with `btrfs inspect logical-resolve` and deleting
them. But it seems like the set of files I'm deleting is also totally
unrelated to the set of files the scrub is detecting errors in. Given
the frequency of relocation errors, I fear I will need to delete
almost everything on the file system for the deletion to complete. I
can't tell if I should expect these errors to be fixable since the
relocation isn't making any attempt to correct them as far as I can
tell.



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux