Re: [PATCH][RFC] btrfs: introduce rescue=onlyfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

On 7/1/20 9:53 PM, Josef Bacik wrote:
> On 7/1/20 3:43 PM, waxhead wrote:
>>
>>
>> Josef Bacik wrote:
>>> One of the things that came up consistently in talking with Fedora about
>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>> to metadata corruption.  If any of the core global roots are corrupted,
>>> the fs is unmountable and fsck can't usually do anything for you without
>>> some special options.
>>>
>>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>>> what it really does is just allow you to operate without an extent root.
>>> However there are a lot of other roots, and I'd rather not have to do
>>>
>>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>
>>> Instead take his original idea and modify it so it just works for
>>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>>> to read just gets left empty and we carry on.
>>>
>>> Obviously if the fs roots are screwed then the user is in trouble, but
>>> otherwise this makes it much easier to pull stuff off the disk without
>>> needing our special rescue tools.  I tested this with my TEST_DEV that
>>> had a bunch of data on it by corrupting the csum tree and then reading
>>> files off the disk.
>>>
>>> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
>>> ---
>>
>> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing DUP2 
>> and/or even DUP3 making multiple copies of the metadata to increase the chance 
>> to recover metadata on even a single storage device?
> 
> Because this only works on HDD.  On SSD's concurrent writes will often be 
> shunted to the same erase block, and if the whole erase block goes, so do all of 
> your copies.  This is why we default to 'single' for SSD's.
> 
> The one thing I _do_ want to do is make better use of the backup roots.  Right 
> now we always free the pinned extents once the transaction commits,

For other readers, who might think something actively needs to be
freed... Sort of some opposite thing happens: AIUI, the in-memory yolo
blacklist gets emptied. The space was already freed officially, but it
was still blacklisted for new writes until the transaction commits.

This difference is essential to understand that removing this in-memory
blacklist also happens when you reboot, and that that's fine. (currently)

> which makes 
> the backup roots useless as we're likely to re-use those blocks.  With Nikolay's 
> patches we can now async drop pinned extents, which I've implemented here for an 
> unrelated issue.  We could take that work and simply hold pinned extents

It could be called 'cow light'.

https://imgproc.airliners.net/photos/airliners/9/3/2/0693239.jpg?v=v40

Jokes aside, that would be great, of course, and much better than giving
up and removing all the backup roots related tooling because it's just
problematic right now.

About the 'simply' part of the story: I've been thinking of this while
writing the reply to "Buggy disk firmware (fsync/FUA) and power-loss
btrfs survability" and I ended up thinking about just dumping the
mappings to disk (since they need to be able to survive a reboot to
prevent leaking space). And, that means, space cache v1, but then for
pinned extents...

So, I'm curious to hear the out of the box idea about how to solve this
while not introducing a problem like space cache v1 again and also not
using proper metadata trees. :)

> for 
> several transactions so that old backup roots and all of their nodes don't get 
> over-written until they cycle out.  This would go a long way towards making us 
> more resilient under metadata corruption conditions.  Thanks,

K



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux