Re: btrfs: hanging processes - race condition?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 14 Jun 2010, Chris Mason wrote:

On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:
On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng  wrote:
On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@xxxxxxxxx> wrote:
On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:
List,

I ran into a hang issue (race condition: cpu is high when the server is
idle, meaning that btrfs is hanging, and IOwait is high as well) running
2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram).
The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0,
as shown below:

Label: none  uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52
         Total devices 18 FS bytes used 2.94TB
         devid    5 size 279.39GB used 208.33GB path /dev/cciss/c1d0
         devid   17 size 279.39GB used 208.34GB path /dev/cciss/c1d8
         devid   16 size 279.39GB used 209.33GB path /dev/cciss/c1d7
         devid    4 size 279.39GB used 208.33GB path /dev/cciss/c0d4
         devid    1 size 279.39GB used 233.72GB path /dev/cciss/c0d1
         devid   13 size 279.39GB used 208.33GB path /dev/cciss/c1d4
         devid    8 size 279.39GB used 208.33GB path /dev/cciss/c1d11
         devid   12 size 279.39GB used 208.33GB path /dev/cciss/c1d3
         devid    3 size 279.39GB used 208.33GB path /dev/cciss/c0d3
         devid    9 size 279.39GB used 208.33GB path /dev/cciss/c1d12
         devid    6 size 279.39GB used 208.33GB path /dev/cciss/c1d1
         devid   11 size 279.39GB used 208.33GB path /dev/cciss/c1d2
         devid   14 size 279.39GB used 208.33GB path /dev/cciss/c1d5
         devid    2 size 279.39GB used 233.70GB path /dev/cciss/c0d2
         devid   15 size 279.39GB used 209.33GB path /dev/cciss/c1d6
         devid   10 size 279.39GB used 208.33GB path /dev/cciss/c1d13
         devid    7 size 279.39GB used 208.33GB path /dev/cciss/c1d10
         devid   18 size 279.39GB used 208.34GB path /dev/cciss/c1d9
Btrfs v0.19-16-g075587c-dirty

The filesystem, mounted in /mnt/btrfs is hanging, no existing or new
process can access it, however 'df' still displays the disk usage (3TB out
of 5). The disks appear to be physically healthy. Please note that a
significant number of files were placed on this filesystem, between 20 and
30 million files.

The relevant kernel messages are displayed below:

INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-submit- D 000000010042e12f     0  4220      2 0x00000000
  ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680
  ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680
  0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0
Call Trace:
  [<ffffffff813089f3>] ? io_schedule+0x71/0xb1
  [<ffffffff811470be>] ? get_request_wait+0xab/0x140
  [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97
  [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27
  [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc
  [<ffffffff81145b16>] ? generic_make_request+0x207/0x268
  [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2
  [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs]
  [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs]
  [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs]
  [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs]
  [<ffffffff81040331>] ? kthread+0x79/0x81
  [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff810402b8>] ? kthread+0x0/0x81
  [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10
This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375.
This is reproduceable in our setup.

I think I know the cause of http://lkml.org/lkml/2010/6/8/375.
The code in the first do-while loop in btrfs_commit_transaction
set current process to TASK_UNINTERRUPTIBLE state, then calls
btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and
btrfs_run_ordered_operations(). All of these function may call
cond_resched().
Hi,
When I test random write, I saw a lot of threads jump into btree_writepages()
and do noting and io throughput is zero for some time. Looks like there is a
live lock. See the code of btree_writepages():
	if (wbc->sync_mode == WB_SYNC_NONE) {
		struct btrfs_root *root = BTRFS_I(mapping->host)->root;
		u64 num_dirty;
		unsigned long thresh = 32 * 1024 * 1024;

		if (wbc->for_kupdate)
			return 0;

		/* this is a bit racy, but that's ok */
		num_dirty = root->fs_info->dirty_metadata_bytes;
		if (num_dirty < thresh)
			return 0;
	}
The marked line is quite intrusive. In my test, the live lock is caused by the thresh
check. The dirty_metadata_bytes < 32M. Without it, I can't see the live lock. Not
sure if this is related to the hang.

How much ram do you have?  The goal of the check is to avoid writing
metadata blocks because once we write them we have to do more IO to cow
them again if they are changed later.

This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory).

It shouldn't be looping hard in btrfs there, what was the workload?

The workload was the extraction of large tarballs (one at the time, about 300+ files extracted by second from a single tarball, which is pretty good), as you might expect, the disks were tested (read and write) for physical errors before I report this bug.


Jerome J. Ibanes

[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux