Re: [PATCH 0/5] Deal with a few ENOSPC corner cases

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 9.03.20 г. 22:23 ч., Josef Bacik wrote:
> Nikolay has been digging into a failure of generic/320 on ppc64.  This has
> shaken out a variety of issues, and he's done a good job at running all of the
> weird corners down and then testing my ideas to get them all fixed.  This is the
> series that has survived the longest, so we're declaring victory.
> 
> First there is the global reserve stealing logic.  The way unlink works is it
> attempts to start a transaction with a normal reservation amount, and if this
> fails with ENOSPC we fall back to stealing from the global reserve.  This is
> problematic because of all the same reasons we had with previous iterations of
> the ENOSPC handling, thundering herd.  We get a bunch of failures all at once,
> everybody tries to allocate from the global reserve, some win and some lose, we
> get an ENSOPC.
> 
> To fix this we need to integrate this logic into the normal ENOSPC
> infrastructure.  The idea is simple, we add a new flushing state that indicates
> we are allowed to steal from the global reserve.  We still go through all of the
> normal flushing work, and at the moment we begin to fail all the tickets we try
> to satisfy any tickets that are allowed to steal by stealing from the global
> reserve.  If this works we start the flushing system over again just like we
> would with a normal ticket satisfaction.  This serializes our global reserve
> stealing, so we don't have the thundering herd problem
> 
> This isn't the only problem however.  Nikolay also noticed that we would
> sometimes have huge amounts of space in the trans block rsv and we would ENOSPC
> out.  This is because the may_commit_transaction() logic didn't take into
> account the space that would be reclaimed by all of the outstanding trans
> handles being required to stop in order to commit the transaction.
> 
> Another corner here was that priority tickets could race in and make
> may_commit_transaction() think that it had no work left to do, and thus not
> commit the transaction.
> 
> Those fixes all address the failures that Nikolay was seeing.  The last two
> patches are just cleanups around how we handle priority tickets.  We shouldn't
> even be serializing priority tickets behind normal tickets, only behind other
> priority tickets.  And finally there would be a small window where priority
> tickets would fail out if there were multiple priority tickets and one of them
> failed.  This is addressed by the previous patch.
> 
> Nikolay has put these through many iterations of generic/320, and so far it
> hasn't failed.  Thanks,
> 
> Josef
> 

This patchset causes regressions on following tests:

btrfs/132 btrfs/170 btrfs/177 generic/102 generic/103 generic/170
generic/172 generic/275 generic/299 generic/464 generic/551

Please don't merge for now.



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux