On 8 May 2019, at 3:15, Nikolay Borisov wrote:
> On 7.05.19 г. 20:27 ч., Josef Bacik wrote:
>> We have been seeing issues in production where a cleaner script will
>> end
>> up unlinking a bunch of files that have pending iputs. This means
>> they
>> will get their final iput's run at btrfs-cleaner time and thus are
>> not
>> throttled, which impacts the workload.
>>
>> Since we are unlinking these files we can just drop the delayed iput
>> at
>> unlink time. We are already holding a reference to the inode so this
>> will not be the final iput and thus is completely safe to do at this
>> point. Doing this means we are more likely to be doing the final
>> iput
>> at unlink time, and thus will get the IO charged to the caller and
>> get
>> throttled appropriately without affecting the main workload.
>>
>> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
>> ---
>> fs/btrfs/inode.c | 22 ++++++++++++++++++++++
>> 1 file changed, 22 insertions(+)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index b6d549c993f6..e58685b5d398 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -4009,6 +4009,28 @@ static int __btrfs_unlink_inode(struct
>> btrfs_trans_handle *trans,
>> ret = 0;
>> else if (ret)
>> btrfs_abort_transaction(trans, ret);
>> +
>> + /*
>> + * If we have a pending delayed iput we could end up with the final
>> iput
>> + * being run in btrfs-cleaner context. If we have enough of these
>> built
>> + * up we can end up burning a lot of time in btrfs-cleaner without
>> any
>> + * way to throttle the unlinks. Since we're currently holding a
>> ref on
>> + * the inode we can run the delayed iput here without any issues as
>> the
>> + * final iput won't be done until after we drop the ref we're
>> currently
>> + * holding.
>> + */
>
> FWIW the caller is not really holding an explicit reference, rather
> there is a reference held by the dentry which is going to be disposed
> of
> by vfs. Considering this I'd say this is a false claim. I.e "we" do
> not
> hold a reference.
It's impossible to call this function without a reference held on the
inode, kind of nit-picking on "we" vs "the vfs".
>
>> + if (!list_empty(&inode->delayed_iput)) {
>> + spin_lock(&fs_info->delayed_iput_lock);
>> + if (!list_empty(&inode->delayed_iput)) {
>> + list_del_init(&inode->delayed_iput);
>> + spin_unlock(&fs_info->delayed_iput_lock);
>> + iput(&inode->vfs_inode);
>> + if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
>> + wake_up(&fs_info->delayed_iputs_wait);
>> + } else {
>> + spin_unlock(&fs_info->delayed_iput_lock);
>> + }
>> + }
>
> OTOH this really feels like a hack and this stems from the fact that
> iput is rather rudimentary. Additionally you are essentially
> opencoding
> the body of btrfs_run_delayed_iputs. I was going to suggest to
> introduce
> a new helper factoring out the common code but it will get ugly due to
> the spin lock being dropped before doing the iput.
>
> But then I'm really starting to question the utility of delayed iputs.
> Presumably it was added to defer the expensive final iput in the
> cleaner
> context or avoid some deadlocks (but we don't know which exactly).
> Yet,
> here we are some time later where you are essentially saying "this
> mechanism is suboptimal because it's dumb and instead of improving
> things it's making them worse in certain cases, so let's unload it a
> bit
> by doing an iput here".
The final iput is pretty expensive, since it potentially does the full
truncate of an arbitrary sized file. There are a lot of contexts it
can't be called from, so the delayed iput code saves us from some
impossible situations. It originally came here:
commit 24bbcf0442ee04660a5a030efdbb6d03f1c275cb
Author: Yan, Zheng <zheng.yan@xxxxxxxxxx>
Date: Thu Nov 12 09:36:34 2009 +0000
Btrfs: Add delayed iput
But we've expanded usage to solve a few different deadlocks.
-chris