Re: [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 20, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
> 
> 
> On 20.08.2018 09:07, Liu Bo wrote:
> > On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
> >>
> >>
> >> On 17.08.2018 00:07, Liu Bo wrote:
> >>> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> >>> while searching btree, locking is always acquired in spinning mode and
> >>> then converted to blocking mode if necessary, and in some hot paths we may
> >>> switch the locking back to spinning mode by btrfs_clear_path_blocking().
> >>>
> >>> When acquiring locks, both of reader and writer need to wait for blocking
> >>> readers and writers to complete before doing read_lock()/write_lock().
> >>>
> >>> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> >>> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> >>> make lockdep happy before doing its actual clearing blocking job.
> >>>
> >>> When switching to blocking mode from spinning mode, it consists of
> >>>
> >>> step 1) bumping up blocking readers counter and
> >>> step 2) read_unlock()/write_unlock(),
> >>>
> >>> this has caused serious ping-pong effect if there're a great amount of
> >>> concurrent readers/writers, as waiters will be woken up and go to
> >>> sleep immediately.
> >>
> >> I think this ping-pong needs to be explained in a bit more detail.
> >>
> >> Looking at the code it seems the issue happens when we have a path
> >> locked in spinning mode (via btrfs_tree_lock) - in this case we have
> >> blocking_readers/writes == 0 and write_lock acquired. Then we could
> >> potentially have multiple requests to lock the tree (in this case the
> >> root) and since the current holder is spinning then btrfs_tree_lock
> >> callers will block on the write_lock. Subsequently when the holder
> >> switches to blocking he calls
> >> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
> >> increments the blocking reader/writes and calls the unlock at which
> >> point a "thundering herd" problem occurs on the lock. Am I correct?
> >>
> > 
> > That's correct, but the "thundering herd' problem is not really the
> > problem that this patch is trying to address, see more in below.
> > 
> >> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
> >> but rather in the initial call to btrfs_set_path_blocking.
> >>
> >> The way I see it the following sequence of operations occur:
> >>
> >> 1. Call btrfs_set_path_blocking: increments
> >> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
> >> the lock types for the nodes in the path are
> >> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
> >>
> >> 2. Some time later we call btrfs_clear_path_blocking
> >> which also calls btrfs_set_path_blocking, however this time this
> >> function doesn't do anything because the lock types in path struct are
> >> already blocking and none of the 2 conditions inside
> >> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
> >>
> >> Did I miss something ?
> >>
> > 
> > Thanks for sharing your thoughts.
> > 
> > We have to set path blocking as the subsequent code needs to sleep,
> > it's the cost we have to pay by design, I'm OK with it.
> > 
> > The problem is that btrfs_clear_path_blocking always tries to set path
> > to blocking before clearing path blocking as it needs to ensure
> > spin_locks on each level are acquired in right order.  And if all the
> > nodes in the path are already in spinning mode (say that
> > should_cow_block() decides to not COW), setting path blocking doesn't
> > really make sense and causes others waiting on the lock to do a
> > wake-up and a go-to-sleep.
> 
> Ok for the case of btrfs_clear_path_blocking in btrfs_search_path I
> agree this could happen but it can't in any of the other hunks you
> touched since they all call btrfs_clear_path_blocking AFTER
> btrfs_set_path_blocking was called. So the problem can't stem from them
> since they take the perf hit at the time btrfs_set_path_blocking is
> called. IMO this is important information for context and needs to be
> included in the changelog. In short the only problematic code is the one
> in btrfs_search_slot.
> 
> Another less intrusive idea is to touch only btrfs_search_slot by
> introducing a boolean flag to indicate we have set the path to
> blocking/COWed blocks.
> 
> Then the fix will be a simple:
> 
> if (cowed/path_blocking/whatever)
>     btrfs_clear_path_blocking(p, NULL, 0);

if (cow)
    btrfs_clear_path_blocking(); 

I did exactly this at first and it ended up with 1m20s while killing
all the btrfs_clear_path_blocking() gave a stable 1m2s, so I confirmed
that even switching path back to spinning from blocking also hurts a
little bit.

> 
> Yours is also a valid approach albeit seems more heavy weight. Having
> said that, I'm all in favor of removing code so long as it does not
> introduce any regressions :)

As all the callers that were running in spinning lock context should
be ready to cooperate with blocking lock context, it'd be fine IMO.

Thanks for the comments again.

thanks,
-liubo

> 
> 
> > 
> > My trace showed that there could be up to 50 switches for a single
> > lock waiter and the difference on the finished time also proved how
> > big the impact is>
> > thanks,
> > -liubo
> >>>
> >>> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> >>> files creation script,
> >>>
> >>> MNT=/mnt/btrfs
> >>> mkfs.btrfs -f /dev/sdf
> >>> mount /dev/def $MNT
> >>> time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
> >>>         -d  $MNT/0  -d  $MNT/1 \
> >>>         -d  $MNT/2  -d  $MNT/3 \
> >>>         -d  $MNT/4  -d  $MNT/5 \
> >>>         -d  $MNT/6  -d  $MNT/7 \
> >>>         -d  $MNT/8  -d  $MNT/9 \
> >>>         -d  $MNT/10  -d  $MNT/11 \
> >>>         -d  $MNT/12  -d  $MNT/13 \
> >>>         -d  $MNT/14  -d  $MNT/15
> >>>
> >>> w/o patch:
> >>> real    2m27.307s
> >>> user    0m12.839s
> >>> sys     13m42.831s
> >>>
> >>> w/ patch:
> >>> real    1m2.273s
> >>> user    0m15.802s
> >>> sys     8m16.495s
> >>>
> >>> 2) dbench also proves the improvement,
> >>> dbench -t 120 -D /mnt/btrfs 16
> >>>
> >>> w/o patch:
> >>> Throughput 158.363 MB/sec
> >>>
> >>> w/ patch:
> >>> Throughput 449.52 MB/sec
> >>>
> >>> 3) xfstests didn't show any additional failures.
> >>>
> >>> One thing to note is that callers may set leave_spinning to have all
> >>> nodes in the path stay in spinning mode, which means callers are ready
> >>> to not sleep before releasing the path, but it won't cause problems if
> >>> they don't want to sleep in blocking mode, IOW, we can just get rid of
> >>> leave_spinning.
> >>>
> >>> Signed-off-by: Liu Bo <bo.liu@xxxxxxxxxxxxxxxxx>
> >>> ---
> >>>  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
> >>>  fs/btrfs/ctree.h         |  2 --
> >>>  fs/btrfs/delayed-inode.c |  3 ---
> >>>  3 files changed, 4 insertions(+), 58 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> >>> index d436fb4c002e..8b31caa60b0a 100644
> >>> --- a/fs/btrfs/ctree.c
> >>> +++ b/fs/btrfs/ctree.c
> >>> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
> >>>  	}
> >>>  }
> >>>  
> >>> -/*
> >>> - * reset all the locked nodes in the patch to spinning locks.
> >>> - *
> >>> - * held is used to keep lockdep happy, when lockdep is enabled
> >>> - * we set held to a blocking lock before we go around and
> >>> - * retake all the spinlocks in the path.  You can safely use NULL
> >>> - * for held
> >>> - */
> >>> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> >>> -					struct extent_buffer *held, int held_rw)
> >>> -{
> >>> -	int i;
> >>> -
> >>> -	if (held) {
> >>> -		btrfs_set_lock_blocking_rw(held, held_rw);
> >>> -		if (held_rw == BTRFS_WRITE_LOCK)
> >>> -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> >>> -		else if (held_rw == BTRFS_READ_LOCK)
> >>> -			held_rw = BTRFS_READ_LOCK_BLOCKING;
> >>> -	}
> >>> -	btrfs_set_path_blocking(p);
> >>> -
> >>> -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> >>> -		if (p->nodes[i] && p->locks[i]) {
> >>> -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> >>> -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> >>> -				p->locks[i] = BTRFS_WRITE_LOCK;
> >>> -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> >>> -				p->locks[i] = BTRFS_READ_LOCK;
> >>> -		}
> >>> -	}
> >>> -
> >>> -	if (held)
> >>> -		btrfs_clear_lock_blocking_rw(held, held_rw);
> >>> -}
> >>> -
> >>>  /* this also releases the path */
> >>>  void btrfs_free_path(struct btrfs_path *p)
> >>>  {
> >>> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
> >>>  		}
> >>>  	}
> >>>  
> >>> -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
> >>>  	btrfs_tree_read_unlock_blocking(eb);
> >>>  	free_extent_buffer(eb);
> >>>  
> >>> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >>>  		btrfs_set_path_blocking(p);
> >>>  		reada_for_balance(fs_info, p, level);
> >>>  		sret = split_node(trans, root, p, level);
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		BUG_ON(sret > 0);
> >>>  		if (sret) {
> >>> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >>>  		btrfs_set_path_blocking(p);
> >>>  		reada_for_balance(fs_info, p, level);
> >>>  		sret = balance_level(trans, root, p, level);
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		if (sret) {
> >>>  			ret = sret;
> >>> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  		}
> >>>  cow_done:
> >>>  		p->nodes[level] = b;
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>> +		/*
> >>> +		 * Leave path with blocking locks to avoid massive
> >>> +		 * lock context switch, this is made on purpose.
> >>> +		 */
> >>>  
> >>>  		/*
> >>>  		 * we have a lock on b and as long as we aren't changing
> >>> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  					if (!err) {
> >>>  						btrfs_set_path_blocking(p);
> >>>  						btrfs_tree_lock(b);
> >>> -						btrfs_clear_path_blocking(p, b,
> >>> -								  BTRFS_WRITE_LOCK);
> >>>  					}
> >>>  					p->locks[level] = BTRFS_WRITE_LOCK;
> >>>  				} else {
> >>> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  					if (!err) {
> >>>  						btrfs_set_path_blocking(p);
> >>>  						btrfs_tree_read_lock(b);
> >>> -						btrfs_clear_path_blocking(p, b,
> >>> -								  BTRFS_READ_LOCK);
> >>>  					}
> >>>  					p->locks[level] = BTRFS_READ_LOCK;
> >>>  				}
> >>> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  				btrfs_set_path_blocking(p);
> >>>  				err = split_leaf(trans, root, key,
> >>>  						 p, ins_len, ret == 0);
> >>> -				btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  				BUG_ON(err > 0);
> >>>  				if (err) {
> >>> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >>>  	while (b) {
> >>>  		level = btrfs_header_level(b);
> >>>  		p->nodes[level] = b;
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		/*
> >>>  		 * we have a lock on b and as long as we aren't changing
> >>> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >>>  			if (!err) {
> >>>  				btrfs_set_path_blocking(p);
> >>>  				btrfs_tree_read_lock(b);
> >>> -				btrfs_clear_path_blocking(p, b,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
> >>>  			if (!b) {
> >>> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
> >>>  		path->locks[level - 1] = BTRFS_READ_LOCK;
> >>>  		path->nodes[level - 1] = cur;
> >>>  		unlock_up(path, level, 1, 0, NULL);
> >>> -		btrfs_clear_path_blocking(path, NULL, 0);
> >>>  	}
> >>>  out:
> >>>  	path->keep_locks = keep_locks;
> >>> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >>>  			if (!ret) {
> >>>  				btrfs_set_path_blocking(path);
> >>>  				btrfs_tree_read_lock(next);
> >>> -				btrfs_clear_path_blocking(path, next,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			next_rw_lock = BTRFS_READ_LOCK;
> >>>  		}
> >>> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >>>  			if (!ret) {
> >>>  				btrfs_set_path_blocking(path);
> >>>  				btrfs_tree_read_lock(next);
> >>> -				btrfs_clear_path_blocking(path, next,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			next_rw_lock = BTRFS_READ_LOCK;
> >>>  		}
> >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >>> index 318be7864072..1aeed3c0e949 100644
> >>> --- a/fs/btrfs/ctree.h
> >>> +++ b/fs/btrfs/ctree.h
> >>> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
> >>>  struct btrfs_path *btrfs_alloc_path(void);
> >>>  void btrfs_free_path(struct btrfs_path *p);
> >>>  void btrfs_set_path_blocking(struct btrfs_path *p);
> >>> -void btrfs_clear_path_blocking(struct btrfs_path *p,
> >>> -			       struct extent_buffer *held, int held_rw);
> >>>  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
> >>>  
> >>>  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> >>> index f51b509f2d9b..db9f45082c61 100644
> >>> --- a/fs/btrfs/delayed-inode.c
> >>> +++ b/fs/btrfs/delayed-inode.c
> >>> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
> >>>  		i++;
> >>>  	}
> >>>  
> >>> -	/* reset all the locked nodes in the patch to spinning locks. */
> >>> -	btrfs_clear_path_blocking(path, NULL, 0);
> >>> -
> >>>  	/* insert the keys of the items */
> >>>  	setup_items_for_insert(root, path, keys, data_size,
> >>>  			       total_data_size, total_size, nitems);
> >>>
> > 



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux