Re: BTRFS state on kernel 5.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Sep 2, 2019 at 11:21 AM waxhead <waxhead@xxxxxxxxxxxxxx> wrote:
> 2. DEFRAG: (status page)
> The status page marks defrag as "mostly ok" for stability and "ok" for
> performance. While I understand that extents gets unshared I don't see
> how this will affect stability. Performance (as in space efficiency) on
> the other hand is more likely to be affected. Also is is not (perfectly)
> clear what the difference is in consequence by using the autodefrag
> mount option vs "btrfs filesystem defrag" Can someone please consider
> rewriting this?

It needs "OK - see Gotchas" because shared extents becoming unshared
could be hugely problematic if you're not expecting it.

> 3. SCRUB + RAID56: (status page)
> The status page says it is mostly ok for both stability and performance.
> It is not stated what the problem is with stability, does this have to
> do with the write-hole ?

I think concerns need to be split out for metadata and data. The main
gotcha is if there's a crash you need to do a scrub, and there are no
partial scrubs.

In the case of data, at least there's still a warning on bad
reconstruction (from corrupt strip), because of data csums not
matching.


> 5. DEVICE REPLACE: (Using_Btrfs_with_Multiple_Devices page)
> It is not clear what to do to recover from a device failure on BTRFS.
> If a device is partly working then you can run the replace functionality
> and hopefully you're good to go afterwards. Ok fine , if this however
> does not work or you have a completely failed device it is a different
> story. My understanding of it is:
> If not enough free space (or devices) is available to restore redundancy
> you first need to add a new device, and then you need to A: first run
> metadata balance (to ensure that the filesystem structures is redundant)
> and then B: run a data balance to restore redundancy for your data.
> Is there any filters that can be applied to only restore chunks which
> are having a missing mirror / stripe member?

It is a bit boolean in that it depends on several variables, and is
another reason why a btrfsd service to help do smarter things that
depend on policy decisions would be a very useful future addition. But
sorta what you're getting to is we're not sure what the medium, long
term plans are.


>
> 6. RAID56 (status page)
> The RAID56 have had the write hole problem for a long time now, but it
> is not well explained what the consequence of it is for data -
> especially if you have metadata stored in raid1/10.
> If you encounter a powerloss / kernel panic during write - what will
> actually happen?
> Will a fresh file simply be missing or corrupted (as in partly written).
> If you overwrite/append to a existing file - what is the consequence
> then? will you end up with... A: The old data, B: Corrupted or zeroed
> data?! This is not made clear in the comment and it would be great if
> we, the BTRFS users would understand what the risk of hitting the write
> hole actually is.

If you do an immediate scrub, any corruption should be detected and
fixed by reconstruction, before there are any device failures. If a
device fails before scrub, it's possible data is corrupt, but last
time I tested this I got EIO with csum mismatches for affected files,
not corrupt data return to user space.  Worse is if metadata is
affected because nothing can be done, if a device has failed, and
there's corruption in raid5 metadata.

I'm not entirely clear on the COW guarantees between metadata and
data, even in the idealized case where hardware doesn't lie, does what
the file system expects, and all devices complete commits at the same
time. And then when any of those things isn't true, what are the
consequences. It's probably its own separate grid that's needed. But
if someone understood it clearly, someone else could make the
explanation pretty.


> 7. QUOTAS, QGROUPS (status page)
> Again marked as "mostly ok" on the stability. Is there any risk of
> dataloss or irrecoverable failure? If not I think it should be marked as
> stable - The only note seems to be performance related.

Pretty sure all the performance issues are supposed to be fixed by
kernel 5.2 or 5.3. But that probably needs testing to confirm it.

>
> 8. PER SUBVOLUME REDUNDANCY LEVEL:
> What is the state / plan for per subvolume (or object level) redundancy
> levels - is that on the agenda somewhere?

No one has started that work as far as I'm aware.

>
> 9. ADDING EXISTING FILESYSTEM TO THE POOL?:
> Is it somehow, or will it ever be possible to add a existing BTRFS
> filesystem to a pool?

I haven't hear anything like this, so I suspect no one is working on
it. Btrfs subvolumes are just a files tree. It's not a self contained
file system. All subvolumes share the extent, csum, chunk and dev
trees. So this would need some way to import it. Not sure.

> 10. PURE BTRFS BOOTLOADER?
> This probably belongs somewhere else, but has someone considered the
> very idea of a pure BTRFS bootloader which only supports booting up a
> BTRFS filesystem in a (as failsafe as possible) way. It is a pain to
> ensure that grub is installed on all devices and update as you
> add/remove devices from the pool and a "butterboot"-loader would be
> fantastic

Bootloaders are f'n hard. I don't see the advantage of starting
something from scratch that's this narrow purposed.

Realistically, as ugly as it is, we're better off with every drive
having a large EFI system partition or plain boot volume if BIOS, and
a daemon that keeps them all in sync. And use a simple bootloader like
sd-boot, to locate, load, and execute the kernel and let kernel code
worry about all the complex Btrfs device discovery, and how to handle
degradedness.

By the way, GRUB 2.04 should have Btrfs raid5/6 support. And I'm
guessing it supports degraded operation similar to mdadm raid 5/6,
which GRUB supports for a long time.

> 12. SPACE CACHE: (Manpage/btrfs(5) page):
> I have been using space cache v2 for a long time. No issues (that I know
> about) yet. That page states that the safe default space cache is v1.
> What is the current recommended default?

v2 expected default for a long time now. It'd be useful if someone
could benchmark v2 versus no space cache: run time performance with
various loads, mount time, and memory usage.

> 13. NODATACOW:
> As far as I can remember there was some issues regarding NOCOW
> files/directories on the mailing list a while ago. I can't find any
> issues related to nocow on the wiki (I might note have searched enough)
> but I don't think they are fixed so maybe someone can verify that.
> And by the way ...are NOCOW files still not checksummed? If yes, are
> there plans to add that (it would be especially nice to know if a nocow
> file is correct or not)

I think we're better off optimizing COW and getting rid of nocow. It's
really a work around for things becoming slow due to massive
fragmentation. There's a bug (or unexpected behavior) where NOCOW
files can become compressed when defragmented and compress mount
option is used. There's a fix that prevents this, I think in 5.2 or
5.3.


-- 
Chris Murphy



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux