On 17.08.2018 00:07, Liu Bo wrote:
> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> while searching btree, locking is always acquired in spinning mode and
> then converted to blocking mode if necessary, and in some hot paths we may
> switch the locking back to spinning mode by btrfs_clear_path_blocking().
>
> When acquiring locks, both of reader and writer need to wait for blocking
> readers and writers to complete before doing read_lock()/write_lock().
>
> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> make lockdep happy before doing its actual clearing blocking job.
>
> When switching to blocking mode from spinning mode, it consists of
>
> step 1) bumping up blocking readers counter and
> step 2) read_unlock()/write_unlock(),
>
> this has caused serious ping-pong effect if there're a great amount of
> concurrent readers/writers, as waiters will be woken up and go to
> sleep immediately.
I think this ping-pong needs to be explained in a bit more detail.
Looking at the code it seems the issue happens when we have a path
locked in spinning mode (via btrfs_tree_lock) - in this case we have
blocking_readers/writes == 0 and write_lock acquired. Then we could
potentially have multiple requests to lock the tree (in this case the
root) and since the current holder is spinning then btrfs_tree_lock
callers will block on the write_lock. Subsequently when the holder
switches to blocking he calls
btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
increments the blocking reader/writes and calls the unlock at which
point a "thundering herd" problem occurs on the lock. Am I correct?
Furthermore I think the problem occurs not in btrfs_clear_path_blocking
but rather in the initial call to btrfs_set_path_blocking.
The way I see it the following sequence of operations occur:
1. Call btrfs_set_path_blocking: increments
blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
the lock types for the nodes in the path are
BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
2. Some time later we call btrfs_clear_path_blocking
which also calls btrfs_set_path_blocking, however this time this
function doesn't do anything because the lock types in path struct are
already blocking and none of the 2 conditions inside
btrfs_set_lock_blocking_rw will match hence this code won't be executed.
Did I miss something ?
>
> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> files creation script,
>
> MNT=/mnt/btrfs
> mkfs.btrfs -f /dev/sdf
> mount /dev/def $MNT
> time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \
> -d $MNT/0 -d $MNT/1 \
> -d $MNT/2 -d $MNT/3 \
> -d $MNT/4 -d $MNT/5 \
> -d $MNT/6 -d $MNT/7 \
> -d $MNT/8 -d $MNT/9 \
> -d $MNT/10 -d $MNT/11 \
> -d $MNT/12 -d $MNT/13 \
> -d $MNT/14 -d $MNT/15
>
> w/o patch:
> real 2m27.307s
> user 0m12.839s
> sys 13m42.831s
>
> w/ patch:
> real 1m2.273s
> user 0m15.802s
> sys 8m16.495s
>
> 2) dbench also proves the improvement,
> dbench -t 120 -D /mnt/btrfs 16
>
> w/o patch:
> Throughput 158.363 MB/sec
>
> w/ patch:
> Throughput 449.52 MB/sec
>
> 3) xfstests didn't show any additional failures.
>
> One thing to note is that callers may set leave_spinning to have all
> nodes in the path stay in spinning mode, which means callers are ready
> to not sleep before releasing the path, but it won't cause problems if
> they don't want to sleep in blocking mode, IOW, we can just get rid of
> leave_spinning.
>
> Signed-off-by: Liu Bo <bo.liu@xxxxxxxxxxxxxxxxx>
> ---
> fs/btrfs/ctree.c | 57 ++++--------------------------------------------
> fs/btrfs/ctree.h | 2 --
> fs/btrfs/delayed-inode.c | 3 ---
> 3 files changed, 4 insertions(+), 58 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index d436fb4c002e..8b31caa60b0a 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
> }
> }
>
> -/*
> - * reset all the locked nodes in the patch to spinning locks.
> - *
> - * held is used to keep lockdep happy, when lockdep is enabled
> - * we set held to a blocking lock before we go around and
> - * retake all the spinlocks in the path. You can safely use NULL
> - * for held
> - */
> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> - struct extent_buffer *held, int held_rw)
> -{
> - int i;
> -
> - if (held) {
> - btrfs_set_lock_blocking_rw(held, held_rw);
> - if (held_rw == BTRFS_WRITE_LOCK)
> - held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> - else if (held_rw == BTRFS_READ_LOCK)
> - held_rw = BTRFS_READ_LOCK_BLOCKING;
> - }
> - btrfs_set_path_blocking(p);
> -
> - for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> - if (p->nodes[i] && p->locks[i]) {
> - btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> - if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> - p->locks[i] = BTRFS_WRITE_LOCK;
> - else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> - p->locks[i] = BTRFS_READ_LOCK;
> - }
> - }
> -
> - if (held)
> - btrfs_clear_lock_blocking_rw(held, held_rw);
> -}
> -
> /* this also releases the path */
> void btrfs_free_path(struct btrfs_path *p)
> {
> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
> }
> }
>
> - btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
> btrfs_tree_read_unlock_blocking(eb);
> free_extent_buffer(eb);
>
> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> btrfs_set_path_blocking(p);
> reada_for_balance(fs_info, p, level);
> sret = split_node(trans, root, p, level);
> - btrfs_clear_path_blocking(p, NULL, 0);
>
> BUG_ON(sret > 0);
> if (sret) {
> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> btrfs_set_path_blocking(p);
> reada_for_balance(fs_info, p, level);
> sret = balance_level(trans, root, p, level);
> - btrfs_clear_path_blocking(p, NULL, 0);
>
> if (sret) {
> ret = sret;
> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> }
> cow_done:
> p->nodes[level] = b;
> - btrfs_clear_path_blocking(p, NULL, 0);
> + /*
> + * Leave path with blocking locks to avoid massive
> + * lock context switch, this is made on purpose.
> + */
>
> /*
> * we have a lock on b and as long as we aren't changing
> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> if (!err) {
> btrfs_set_path_blocking(p);
> btrfs_tree_lock(b);
> - btrfs_clear_path_blocking(p, b,
> - BTRFS_WRITE_LOCK);
> }
> p->locks[level] = BTRFS_WRITE_LOCK;
> } else {
> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> if (!err) {
> btrfs_set_path_blocking(p);
> btrfs_tree_read_lock(b);
> - btrfs_clear_path_blocking(p, b,
> - BTRFS_READ_LOCK);
> }
> p->locks[level] = BTRFS_READ_LOCK;
> }
> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> btrfs_set_path_blocking(p);
> err = split_leaf(trans, root, key,
> p, ins_len, ret == 0);
> - btrfs_clear_path_blocking(p, NULL, 0);
>
> BUG_ON(err > 0);
> if (err) {
> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> while (b) {
> level = btrfs_header_level(b);
> p->nodes[level] = b;
> - btrfs_clear_path_blocking(p, NULL, 0);
>
> /*
> * we have a lock on b and as long as we aren't changing
> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> if (!err) {
> btrfs_set_path_blocking(p);
> btrfs_tree_read_lock(b);
> - btrfs_clear_path_blocking(p, b,
> - BTRFS_READ_LOCK);
> }
> b = tree_mod_log_rewind(fs_info, p, b, time_seq);
> if (!b) {
> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
> path->locks[level - 1] = BTRFS_READ_LOCK;
> path->nodes[level - 1] = cur;
> unlock_up(path, level, 1, 0, NULL);
> - btrfs_clear_path_blocking(path, NULL, 0);
> }
> out:
> path->keep_locks = keep_locks;
> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> if (!ret) {
> btrfs_set_path_blocking(path);
> btrfs_tree_read_lock(next);
> - btrfs_clear_path_blocking(path, next,
> - BTRFS_READ_LOCK);
> }
> next_rw_lock = BTRFS_READ_LOCK;
> }
> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> if (!ret) {
> btrfs_set_path_blocking(path);
> btrfs_tree_read_lock(next);
> - btrfs_clear_path_blocking(path, next,
> - BTRFS_READ_LOCK);
> }
> next_rw_lock = BTRFS_READ_LOCK;
> }
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 318be7864072..1aeed3c0e949 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
> struct btrfs_path *btrfs_alloc_path(void);
> void btrfs_free_path(struct btrfs_path *p);
> void btrfs_set_path_blocking(struct btrfs_path *p);
> -void btrfs_clear_path_blocking(struct btrfs_path *p,
> - struct extent_buffer *held, int held_rw);
> void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
>
> int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index f51b509f2d9b..db9f45082c61 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
> i++;
> }
>
> - /* reset all the locked nodes in the patch to spinning locks. */
> - btrfs_clear_path_blocking(path, NULL, 0);
> -
> /* insert the keys of the items */
> setup_items_for_insert(root, path, keys, data_size,
> total_data_size, total_size, nitems);
>