Martin Raiber 於 2018-12-17 22:00 寫到:
I had lockups with this patch as well. If you put e.g. a loop device
on
top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
loop causing delays. The task balancing dirty pages in
btrfs_finish_ordered_io doesn't have the flag and causes slow-downs.
In
my case it managed to cause a feedback loop where it queues other
btrfs_finish_ordered_io and gets stuck completely.
The data writepage endio will queue a work for
btrfs_finish_ordered_io() in a separate workqueue and clear page's
writeback, so throttling in btrfs_finish_ordered_io() should not slow
down flusher thread. One suspicious point is while the caller is
waiting a range of ordered_extents to complete, they will be
blocked until balance_dirty_pages_ratelimited() make some
progress, since we finish ordered_extents in
btrfs_finish_ordered_io().
Do you have call stack information for stuck processes or using
fsync/sync frequently? If this is the case, maybe we should pull
this thing out and try balance dirty metadata pages somewhere.
Yeah like,
[875317.071433] Call Trace:
[875317.071438] ? __schedule+0x306/0x7f0
[875317.071442] schedule+0x32/0x80
[875317.071447] btrfs_start_ordered_extent+0xed/0x120
[875317.071450] ? remove_wait_queue+0x60/0x60
[875317.071454] btrfs_wait_ordered_range+0xa0/0x100
[875317.071457] btrfs_sync_file+0x1d6/0x400
[875317.071461] ? do_fsync+0x38/0x60
[875317.071463] ? btrfs_fdatawrite_range+0x50/0x50
[875317.071465] do_fsync+0x38/0x60
[875317.071468] __x64_sys_fsync+0x10/0x20
[875317.071470] do_syscall_64+0x55/0x100
[875317.071473] entry_SYSCALL_64_after_hwframe+0x44/0xa9
so I guess the problem is that the calling balance_dirty_pages causes
fsyncs to the same btrfs (via my unusual setup of loop+fuse)? Those
fsyncs are deadlocked because they are called indirectly from
btrfs_finish_ordered_io... It is a unusal setup, which is why I did not
post it to the mailing list initially.
To me this is not like a real deadlock. The fsync call invokes two
steps:
(1) flushing dirty data pages, (2) update corresponding metadata to
point to those flushed data. Since step1 consume dirty pages and
step2 produce more dirty pages, in this patch we leave step1
unchanged and block step2 in btrfs_finish_ordered_io(), which
seems reasonable to a OOM fix. The problem is, if there are
other processes continually writing new data, the fsync call will
need to wait the metadata update for a long time, even its dirty
data has been flushed long time ago.
Back to the deadlock problem, what Chris found is really a deadlock,
and it can be fixed by adding a check of free space inode.
For the fsync delay problem, it depends on whether we want to
control dirty metadata page size to match dirty_bytes limit here.
If we don't want, we revert this patch and try flushing dirty metadata
pages somewhere. It should be enough to fix the OOM. But if we
want to control dirty metadata page size to match dirty_bytes limit, we
still need some throttling here. Setting __GFP_WRITE in extent_buffer
page is bad since in most of the time we use them as read cache to
speedup b-tree traversal. Adding PF_LESS_THROTTLE in
btrfs_finish_ordered_io() might help since it create a window where
new writes are blocked but all the flushing and metadata update
keep unblocked, and still remain some kind of throttling. But it
needs more test to know the delay of fsync in the worst case.