On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote: > > > On 10.04.19 г. 22:56 ч., Josef Bacik wrote: > > With the per-inode block rsvs we started refilling the reserve based on > > the calculated size of the outstanding csum bytes and extents for the > > inode, including the amount we were adding with the new operation. > > > > However generic/224 exposed a problem with this approach. With 1000 > > files all writing at the same time we ended up with a bunch of bytes > > being reserved but unusable. > > > > When you write to a file we reserve space for the csum leaves for those > > bytes, the number of extent items required to cover those bytes, and a > > single credit for updating the inode at ordered extent finish for that > > range of bytes. This is held until the ordered extent finishes and we > > release all of the reserved space. > > > > If a second write comes in at this point we would add a single > > reservation for the new outstanding extent and however many reservations > > for the csum leaves. > > If a second write comes we won't do anything different than the first > i.e calculate the number of extent items + csums bytes required, add > them to the block reservation and call btrfs_inode_rsv_refill which > should refill the delta necessary for the 2nd write. > > > At this point we find the delta of how much we > > have reserved and how much outstanding size this is and attempt to > > reserve this delta. If the first write finishes it will not release any > > space, because the space it had reserved for the initial write is still > > needed for the second write. However some space would have been used, > > Each and every reservation is responsible for itself how come the first > one will know some of its space is required for the second, hence it > won't be released? > Write 1 comes in, sets the size to 3mib, reserves 3mib. Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib. - can't reserve because there's not enough space, starts flushing. Write 1 finishes, used 1mib of it's 3mib reservation Write 1 sets the size to 3mib We still have 2mib in reserves, which is less than 3mib, so no bytes are released to the space info. Now multiply this by 1000, you have 1000 files with 2mib sitting in their reservations, but they need 2mib, and there's no space to be squeezed from the rest of the fs, so they start to ENOSPC out one by one. With the new thing we get this Write 1 comes in, reserves 3mib, sets the size to 3mib. Write 2 comes in, attempts to reserve 3mib. - can't reserve because there's not enough space, starts flushing. Write 1 finishes, used 1mib of it's 3mib reservation Write 1 sets the size to 0mib Write 1 releases 2mib to the space_info, allowing the next waiter to claim 2mib of whatever it's reservation was. Multiply this by 1000. As we get smaller and smaller amounts of metadata space to work with, we get less and less writes happening in parallel, because we only have X inodes worth of reservations to be in flight at any given time. Thanks, Josef
