Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for your test script.

I succeeded in pinning down the problem.

The problem is, btrfs is invalidate pages that belongs to ordered extent(goes through writeback)

The ftrace(with more tracepoints to trace qgroup->reserved change) shows:
------
btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096, reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096, reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096, reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096, reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096, reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096


^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.

btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880, reserved=5222400, op=release

^^^^^ Here ordered_io finishes, write the whole 5M data, while QGROUP_RESERVED only returned 5M - 20K.
------

The problem is at btrfs_add_delayed_data_ref(), we use the assumption that, qgroup reserved space is the same as extent size(5M)
But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
Leaking the 20K.

Calltrace:
btrfs_finish_ordered_io()
|- inserted_reserved_file_extent()
   |- btrfs_alloc_reserved_file_extent()
   |  ^^ Here we tell delayed_ref to free 5M later
   |- btrfs_qgroup_release_data()
      ^^ We only released (5M - 20K), as the first 5 pages
         are freed by btrfs_invalidatepage()

If btrfs is designed to freeing pages which belongs to ordered_extent, then it totally breaks the qgroup assumption.

Then we have several ways to fix it:
1) Fix race between ordered extent(writeback) with invalidatepage()
   Make btrfs_invalidatepage() skips pages that are already in
   ordered_extent, then we're OK.

   This is much like your original fix, but I'm not sure if DIRTY page
   flag is bond to ordered_extent.
   And more comment will help.

2) Make btrfs_qgroup_release_data() return accurate num bytes
   And then pass it to delayed_ref head.

   This is quite anti-intuition. If we're adding a new extent, how
   could it happen that some pages are already invalidated?


Anyway, awesome work, exposed a quite strange race and makes us re-think the base of qgroup reserve space framework.

Thanks,
Qu

At 09/28/2016 10:19 AM, Goldwyn Rodrigues wrote:


On 09/27/2016 08:44 PM, Qu Wenruo wrote:

<snipped>

Finally reproduced it.

Although in a low chance, about 1/10.

Under most case, the final remove gets executed without error.

Change 50m to 500m while setting the qgroup limit, the probability will
increase.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux