Re: USB reset + raid6 = majority of files unreadable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 29 Feb 2020 at 06:31, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:

> s/might/should

I do think it is worth looking at the possibility that the "write
hole", because it well documented, is being blamed for all cases that
data proves to be unrecoverable when some of these may be due to a bug
or bugs.  From what I've found about the write hole this is because of
uncertainty over which of several discs actually got written to so
when copies don't match there is no way to know which one is right.
In the case of a disc failure, though, surely the copy that is right
is the one that doesn't involve the failed disc?  Or is there
something else I don't understand?

> I'm curious why you had to use force, but yes that should check all of
> them. If this is a mounted file system, there's 'btrfs scrub' for this
> purpose though too and it can be set to run read-only on a read-only
> mounted file system.

In the case of 'btrfs check' the filesystem was mounted r/o but I had
things reading it so didn't want to unmount it completely.  It
requires --force to work on a mounted filesyetem even if the mount is
r/o.

I did try running a scrub but had to abandon it as it wasn't proving
very useful.  It wasn't fixing the errors and wasn't providing any
messages that would help diagnose or fix them some other way - it only
seems to provide a count of the errors it didn't fix.  That seems to
be general thing in that there seem plenty of ways an overall 'failed'
status can be returned to userspace, usually without anything being
logged.  That obviously makes sense if the request was to do something
stupid but if instead the error return is because corruption has been
found would it not be better to log an error?

> That looks like a bug. I'd try a newer btrfs-progs version. Kernel 5.1
> is EOL but I don't think that's related to the usage info. Still, tons
> of btrfs bugs fixed between 5.1 and 5.5...
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/?id=v5.5&id2=v5.1
>
> Including raid56 specific fixes:z
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v5.5&id2=v5.1

This was in response to posting dodgy output from btrfs fi usage.  My
output was from btrfs-progs v5.4 which, when I checked yesterday,
seemed to be the latest.  I am also running Linux 5.5.7.  It may have
been slightly older when the disk failed but would still have been
5.5.x

Since my previous e-mail I have managed to get a 'btrfs device remove
missing' to work by reading all the files from userspace, deleting
those that returned I/O error and restoring from backup.  Even after
that the summary information is still wacky:

WARNING: RAID56 detected, not implemented
Overall:
    Device size:   16.37TiB
    Device allocated:   30.06GiB
    Device unallocated:   16.34TiB
    Device missing:      0.00B
    Used:   25.40GiB
    Free (estimated):      0.00B (min: 8.00EiB)
    Data ratio:       0.00
    Metadata ratio:       2.00
    Global reserve: 512.00MiB (used: 0.00B)

is the clue in the warning message?  It looks like it is failing to
count any of the RAID5 blocks.

Point taken about device replace.  What would device replace do if the
remove step failed in the same way that device remove has been failing
for me recently?

I'm a little disappointed we didn't get to the bottom of the bug that
was causing the free space cache to become corrupted when a balance
operation failed but when I asked what I could do to help I got no
reply to that part of my message (not just from you, from anyone on
the list).



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux