On Fri, Mar 27, 2020 at 11:29:52AM +0100, Holger Hoffstätte wrote: > On 3/26/20 11:21 PM, Hans van Kranenburg wrote: > > 2) Metadata "cluster allocator" write behavior: > > > > *empty_cluster = SZ_64K # nossd > > *empty_cluster = SZ_2M # ssd > > > > This happens in extent-tree.c. > > 2M used to be a common erase block size on SSDs. Or maybe it's just > a nice round number.. ¯\(ツ)/¯ As a side-effect, 2M write clusters close the write hole on raid5/6 if you have an array that is a power of 2 data disks wide. This capability is wasted when it's only available through the 'ssd' mount option. The behavior could be quite useful if it was properly integrated with the raid5/6 stuff: set *empty_cluster = block group data width, make sure it's aligned to raid5/6 stripe boundaries, and use it for both data and metadata. It works by effectively making partially-filled clusters read-only. If we can guarantee that clusters are aligned to raid5/6 data/parity block boundaries, then btrfs can't allocate new data in partially filled raid5/6 stripes, so it won't break the parity relation and won't have write hole. > cheers, > Holger > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08635bae0b4ceb08fe4c156a11c83baec397d36d > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba8a9d07954397f0645cf62bcc1ef536e8e7ba24 >
