Re: USB reset + raid6 = majority of files unreadable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 4 Mar 2020 at 06:32, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:

> a. the write hole doesn't happen with raid1, and your metadata is
> raid1 so any file system corruption is not related to the write hole

That makes sense and I had no problem moving the metadata blocks from
the failed disk onto the working ones.

> c. to actually be affected by the write hole problem, the stripe with
> mismatching parity strip must have a missing data strip such as a bad
> sector making up one of the strips, or a failed device. If neither of
> those are the case, it's not the write hole, it's something else.

Normally there would not have been any power failures as the machine
is protected by a UPS and auto shuts down when the battery is getting
low.  Certainly there would have been none between the filesystem
being created and the disk failure.  The only exception is that after
the disk failure, and after the device remove had failed I had to turn
the power off because the machine would not shut down.  As the device
remove had moved very little data before failing I had used balance to
move data away from the failed disk, i.e. to restore redundancy and
mitigate the disk of a further disk failure.  That is how I came to
migrate the metadata ahead of the data and then used a range filter to
migrate the data in stages.  That is where I discovered that when
migrating a range of block groups failed all subsequent attempts to
use balance resulted in an infinite loop and that is what lead to the
need to turn the power off.  Subsequently I was able to avoid a repeat
of that by applying a patch from this list to make the balance
cancellable in more places and also discovered that clearing the space
cache avoided the loop anyway.

> d. before there's a device or sector failure, a scrub following a
> crash or power loss will correct the problem resulting from the write
> hole.

Worth knowing for the future.

> It can't fix them when the file system is mounted read only.

I had mounted it r/w by then.

> The most obvious case of corruption is a checksum mismatch (the
> onthefly checksum for a node/leaf/block compared to the recorded
> checksum). Btrfs always reports this.

And it did, but only for the relocation tree that was being built as
part of the balance.  I am sure you or Qu said in a previous e-mail
that this is a temporary structure only built during that operation so
should not have been corrupted by previous problems.  As no media
errors were logged either that must surely mean that either there is a
bug in constructing the tree or corrupted data was being copied from
elsewhere into the tree and only detected after that copy rather than
before.

> So that leaves the less obvious cases of corruption where some
> metadata or data is corrupt in memory, and a valid checksum is
> computed on already corrupt data/metadata, and then written to disk.

But if the relocation tree is constructed during the balance operation
rather than being a permanent structure then the chance of flipped
bits in memory corrupting it on successive attempts is surely very
small indeed.

> At least Btrfs gave you a chance. But that's the gotcha
> of bad RAM or other sources of bit flips in the storage stack.

I am not complaining about checksums - yes it is better to know if
your data has been corrupted, I just want btrfs to be as robust as
possible.

> From six days ago, your dmesg:
>
> Sep 27 15:16:08 meije kernel:       Not tainted 5.1.10-arch1-1-ARCH #1

Sorry for the confusion from having theadjacked.  My kernel history is:

Sep 04 13:53:53 meije kernel: Linux version 5.1.10-arch1-1-ARCH
Oct 14 14:31:56 meije kernel: Linux version 5.3.6-arch1-1-ARCH
Nov 29 12:25:21 meije kernel: Linux version 5.4.0-arch1-1
Dec 30 17:30:49 meije kernel: Linux version 5.4.6-arch3-1
Jan 04 15:54:22 meije kernel: Linux version 5.4.7-arch1-1
Jan 09 15:43:56 meije kernel: Linux version 5.4.8-arch1-1
Jan 22 10:15:30 meije kernel: Linux version 5.4.13-arch1-1
Jan 26 17:23:36 meije kernel: Linux version 5.4.15-arch1-1
Jan 29 22:56:42 meije kernel: Linux version 5.5.0-arch1-1
Feb 10 16:28:52 meije kernel: Linux version 5.5.2-arch2-2
Feb 14 20:23:08 meije kernel: Linux version 5.5.3-arch1-1
Feb 16 16:06:43 meije kernel: Linux version 5.5.3-arch1-1
Feb 18 08:36:49 meije kernel: Linux version 5.5.4-arch1-1
Feb 26 17:11:08 meije kernel: Linux version 5.5.6-arch1-1
Mar 03 20:45:28 meije kernel: Linux version 5.5.7-arch1-1

> Actually what I should have asked is whether you ever ran 5.2 - 5.2.14
> kernels because that series had a known corruption bug in it, fixed in
> 5.2.15

No, I skipped the whole of 5.2 because I saw messages about corruption
on this list.

> I think btrf filesystem usage doesn't completely support raid56 is all
> it's saying.
>
> 'btrfs fi df' and 'btrfs fi show' should show things correctly

They do, and the info for individual devices in the output of btrfs fi
usage also looks completely believable.  It's only the summary at the
top that's obviously incorrect.

> I don't understand the question. The device replace command includes
> 'device add' and 'device remove' in one step, it just lacks the
> implied resize that happens with add and remove.

When i did the add and remove separately, the add succeeded and the
remove failed (initially) having moved very little data.  If that were
to happen with those same steps within a replace would it simply stop
where it found the problem, leaving the new device added and the old
one not yet removed, or would it try to back out the whole operation?

> The free space cache isn't that important. It can be discarded and
> reconstructed. It's an optimization.

Of course, but if it wasn't for Jonathan mentioning that btrfs check
had found his space cache to be corrupt I would never have
hypothesised that the same might be happening for me as there were no
messages in the log about the cache or anything that looked like an
error, only an infinite loop.  I think what was happening was that
moving the block group was failing with an "out of space" error and
the loop simply retries it in the hope that some space has become
available since, and the out of space error was in turn caused by the
corrupt space cache.

So in summary, I believe I was suffering from a situation in which,
for a very small number of data blocks, something went wrong in the
reconstruction process or some associated metadata was bad such that:

1. Building the relocation tree went wrong so that it had a checksum error.
2. The free space cache became corrupt.

Regards,
Steve.



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux