On 3/16/19 7:07 AM, Andrei Borzenkov wrote: > 15.03.2019 23:31, Hans van Kranenburg пишет: > ... >>> >>>>> If so, shouldn't it be really balancing (spreading) the data among all >>>>> the drives to use all the IOPS capacity, even when the raid5 redundancy >>>>> constraint is currently satisfied? >>> >>> btrfs divides the disks into chunks first, then spreads the data across >>> the chunks. The chunk allocation behavior spreads chunks across all the >>> disks. When you are adding a disk to raid5, you have to redistribute all >>> the old data across all the disks to get balanced IOPS and space usage, >>> hence the full balance requirement. >>> >>> If you don't do a full balance, it will eventually allocate data on >>> all disks, but it will run out of space on sdb, sdc, and sde first, >>> and then be unable to use the remaining 2TB+ on sdd. >> >> Also, if you have a lot of empty space in the current allocations, btrfs >> balance has the tendency to first start packing everything together >> before allocating new (4 disk wide) block groups. >> >> This is annoying, because it can result in moving the same data multiple >> times during balance (into empty space of another existing block group, >> and then when that one has its turn again etc). >>> So you want to get rid of empty space in existing block groups as soon >> as possible. btrfs-balance-least-used can do this, (also an example from >> python-btrfs), by doing them in order of most empty one first. >> > > But if I understand the above correctly it will still attempt to move > data in next most empty chunks first. Balance feeds data back to the fs as new writes, so it will try filling up existing block groups with lowest vaddr first (when running nossd/ssd mode). Newly added block groups (/chunks) always get a new vaddr which is higher than everything else, so they're chosen last, which means when all lower numbered ones are packed with data and we keep removing those ones. > Is there any way to force > allocation of new chunks? Or, better, force usage of chunks with given > stripe width as balance target? Nope. Or, the other way, blacklisting everything that you know you want to get rid of. Currently that's not possible. It would be knobs that influence the extent allocator (e.g. prefer writing into chunk with highest num_stripes first). Conversion has a similar problem. For every chunk that gets converted, you get a new empty one with the new target profile, and it's quite possible that you're first rewriting data a few times, (depending on how compacted everything already was) into the existing old profile chunks before actually starting to use the new profile. Having a lot of empty space in existing block groups is something that mainly happens after removing a lot of data. In that case, if you care, compacting everything together with least amount of data movement is why I added the balance-least-first algorithm. Since we're not using the "cluster" allocator for data any more (the ssd-option related change in 4.14), normal operation with equal amounts of removing and adding data all the time do not result in overallocation any more. > This thread actually made me wonder - is there any guarantee (or even > tentative promise) about RAID stripe width from btrfs at all? Is it > possible that RAID5 degrades to mirror by itself due to unfortunate > space distribution? For RAID5, minimum is two disks. So yes, if you add two disks and don't forcibly rewrite all your data, it will happily start adding two-disk RAID5 block groups if the other disks are full. Hans
