Re: Status of RAID5/6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018-04-02 11:18, Goffredo Baroncelli wrote:
On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
[...]
It is possible to combine writes from a single transaction into full
RMW stripes, but this *does* have an impact on fragmentation in btrfs.
Any partially-filled stripe is effectively read-only and the space within
it is inaccessible until all data within the stripe is overwritten,
deleted, or relocated by balance.

btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
update, but that has a significant write magnification effect (and before
kernel 4.14, non-trivial CPU load as well).

btrfs could also just allocate the full stripe to an extent, but emit
only extent ref items for the blocks that are in use.  No fragmentation
but lots of extra disk space used.  Also doesn't quite work the same
way for metadata pages.

If btrfs adopted the ZFS approach, the extent allocator and all higher
layers of the filesystem would have to know about--and skip over--the
parity blocks embedded inside extents.  Making this change would mean
that some btrfs RAID profiles start interacting with stuff like balance
and compression which they currently do not.  It would create a new
block group type and require an incompatible on-disk format change for
both reads and writes.

I thought that a possible solution is to create BG with different number of data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are parity disk; we should allocate 3 BG

BG #1: 1 data disk, 2 parity disks
BG #2: 2 data disks, 2 parity disks,
BG #3: 4 data disks, 2 parity disks

For simplicity, the disk-stripe length is assumed = 4K.

So If you have a write with a length of 4 KB, this should be placed in BG#1; if you have a write with a length of 4*3KB, the first 8KB, should be placed in in BG#2, then in BG#1.

This would avoid space wasting, even if the fragmentation will increase (but shall the fragmentation matters with the modern solid state disks ?).
Yes, fragmentation _does_ matter even with storage devices that have a uniform seek latency (such as SSD's), because less fragmentation means fewer I/O requests have to be made to load the same amount of data. Contrary to popular belief uniform seek-time devices do still perform better doing purely sequential I/O to random I/O because larger requests can be made, the difference is just small enough that it only matters if you're constantly using all the disk bandwidth.

Also, you're still going to be wasting space, it's just that less space will be wasted, and it will be wasted at the chunk level instead of the block level, which opens up a whole new set of issues to deal with, most significantly that it becomes functionally impossible without brute-force search techniques to determine when you will hit the common-case of -ENOSPC due to being unable to allocate a new chunk.

Time to time, a re-balance should be performed to empty the BG #1, and #2. Otherwise a new BG should be allocated.

The cost should be comparable to the logging/journaling (each data shorter than a full-stripe, has to be written two times); the implementation should be quite easy, because already NOW btrfs support BG with different set of disks.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux