On 1/6/20 1:13 AM, Qu Wenruo wrote:
Although btrfs_calc_avail_data_space() is trying to do an estimation on how many data chunks it can allocate, the estimation is far from perfect: - Metadata over-commit is not considered at all - Chunk allocation doesn't take RAID5/6 into consideration Although current per-profile available space itself is not able to handle metadata over-commit itself, the virtual chunk infrastructure can be re-used to address above problems. This patch will change btrfs_calc_avail_data_space() to do the following things: - Do metadata virtual chunk allocation first This is to address the over-commit behavior. If current metadata chunks have enough free space, we can completely skip this step. - Allocate data virtual chunks as many as possible Just like what we did in per-profile available space estimation. Here we only need to calculate one profile, since statfs() call is a relative cold path. Now statfs() should be able to report near perfect estimation on available data space, and can handle RAID5/6 better. [BENCHMARK] For the performance difference, here is the benchmark: Disk layout: - devid 1: 1G - devid 2: 2G - devid 3: 3G - devid 4: 4G - devid 5: 5G metadata: RAID1 data: RAID5 This layout should be the worst case for RAID5, as it can from 5 disks raid5 to 2 disks raid 5 with unusable space. Then use ftrace to trace the execution time of btrfs_statfs() after allocating 1G data chunk. Both have 12 samples. avg std Patched: 17.59 us 7.04 WIthout patch (v5.5-rc2): 14.98 us 6.16 When the fs is cold, there is a small performance for this particular case, as we need to do several more iterations to calculate correct RAID5 data space. But it's still pretty good, and won't block chunk allocator for any observable time. When the fs is hot, the performance bottleneck is the chunk_mutex, where the most common and longest holder would be __btrfs_chunk_alloc(). In that case, we may sleep much longer as __btrfs_chunk_alloc() can trigger IO. Since the new implementation is not observable slower than the old one, and won't cause meaningful delay for chunk allocator, it should be more or less OK. Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
Sorry I should have been more specific. Taking the chunk_mutex here means all the people running statfs at the same time are going to be serialized. Can we just do what overcommit does and take the already calculated free space instead of doing the calculation again? Thanks,
Josef
