On Sun, Jun 8, 2014 at 12:11 PM, Filipe David Manana <fdmanana@xxxxxxxxx> wrote:
> On Sun, Jun 8, 2014 at 11:54 AM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
>> On Fri, Jun 06, 2014 at 01:11:17PM +0100, Filipe David Manana wrote:
>>> On Fri, Jun 6, 2014 at 7:25 AM, Liu Bo <bo.li.liu@xxxxxxxxxx> wrote:
>>> > Several reports about leaf corruption has been floating on the list, one of them
>>> > points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
>>> > after __btrfs_drop_extents(), it's really a rare case but it does exist.
>>> >
>>> > The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
>>> >
>>> > So in btrfs_next_leaf(), we release the current path to re-search the last key of
>>> > the leaf for locating next leaf, and we've taken it into account that there might
>>> > be balance operations between leafs during this 'unlock and re-lock' dance, so
>>> > we check the path again and advance it if there are now more items available.
>>> > But things are a bit different if that last key happens to be removed and balance
>>> > gets a bigger key as the last one, and btrfs_search_slot will return it with
>>> > ret > 0,
>>>
>>> Hi Liu,
>>>
>>> That makes a lot of sense outside the context of btrfs_drop_extents().
>>>
>>> If I understand you correctly, you're saying that we have a file
>>> extent item that gets deleted after we release the path in
>>> btrfs_next_leaf and before btrfs_search_slot finishes for doing a
>>> lookup for that item.
>>
>> Hi Filipe,
>>
>> Not just "finishes", even before btrfs_search_slot "starts".
>
> Yes :)
>
>>
>>>
>>> But who calls btrfs_drop_extents(), is supposed to have locked the
>>> file range in the inode's io tree, right? Isn't the goal of locking
>>> ranges in the io tree to serialize manipulation of file extent items
>>> within a certain range? Unless there's some caller of _drop_extents()
>>> that isn't locking the range in the io tree, I'm not seeing how we can
>>> get the file extent item deleted or updated by another task.
>>
>> Oh, good question! I must admit I forget about that locking, just check the log,
>> so here is the answer -- the locking range and the extent range is not
>> consistent.
>>
>> In __btrfs_drop_extents(), there is such a case,
>>
>> /*
>> * | ---- range to drop ----- |
>> * | -------- extent -------- |
>> */
>> if (start > key.offset && end >= extent_end) {
>>
>> we do actually lock the range, but in this case, the extent is not included in
>> our range, so this give a chance for others to hack in to do 'confusing-us' things.
>
> Hum, yes. I thought about the cases where others split our extent
> item, but that shouldn't remove the key.
>
> There is one case (I thought about after) where it's obvious the
> issue you found in btrfs_next_leaf can affect btrfs_drop_extents: we
> have NO_HOLES set, our first tree seach returns us an extent that ends
> before our target range's start (therefore outside the range we locked
> in the io tree) and that extent is the last item in a leaf. And then
> the problem you described happens between the unlock and re-lock.
>
>>
>> I can add these to the commit log if you want, others may also feel confused.
>
> It is fine for me as it is.
>
> Thanks Liu
>
>>
>> thanks,
>> -liubo
>>
>>>
>>> Thanks
>>>
>>> > IOW, nothing change in this leaf except the new last key, then we think
>>> > we're okay because there is no more item balanced in, fine, we thinks we can
>>> > go to the next leaf.
>>> >
>>> > However, we should return that bigger key, otherwise we deserve leaf corruption,
>>> > for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
>>> > it has dropped all extent matched the required range and finish_ordered_io can
>>> > safely insert a new extent, but it actually doesn't and ends up a leaf
>>> > corruption.
>>> >
>>> > This takes the special case into account.
>>> >
>>> > Signed-off-by: Liu Bo <bo.li.liu@xxxxxxxxxx>
Reviewed-by: Filipe Manana <fdmanana@xxxxxxxxx>
Can affect other users of btrfs_next_leaf too that aren't necessarily
modifying anything in tree (like _drop_extents), makes readers miss
items.
>>> > ---
>>> > fs/btrfs/ctree.c | 18 ++++++++++++++++++
>>> > 1 file changed, 18 insertions(+)
>>> >
>>> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>>> > index 1bcfcdb..bbb256c 100644
>>> > --- a/fs/btrfs/ctree.c
>>> > +++ b/fs/btrfs/ctree.c
>>> > @@ -5736,6 +5736,24 @@ again:
>>> > ret = 0;
>>> > goto done;
>>> > }
>>> > + /*
>>> > + * So the above check misses one case:
>>> > + * - after releasing the path above, someone has removed the item that
>>> > + * used to be at the very end of the block, and balance between leafs
>>> > + * gets another one with bigger key.offset to replace it.
>>> > + *
>>> > + * This one should be returned as well, or we can get leaf corruption
>>> > + * later(esp. in __btrfs_drop_extents()).
>>> > + *
>>> > + * And a bit more explanation about this check,
>>> > + * with ret > 0, the key isn't found, the path points to the slot
>>> > + * where it should be inserted, so the path->slots[0] item must be the
>>> > + * bigger one.
>>> > + */
>>> > + if (nritems > 0 && ret > 0 && path->slots[0] == nritems - 1) {
>>> > + ret = 0;
>>> > + goto done;
>>> > + }
>>> >
>>> > while (level < BTRFS_MAX_LEVEL) {
>>> > if (!path->nodes[level]) {
>>> > --
>>> > 1.8.1.4
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Filipe David Manana,
>>>
>>> "Reasonable men adapt themselves to the world.
>>> Unreasonable men adapt the world to themselves.
>>> That's why all progress depends on unreasonable men."
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
> Unreasonable men adapt the world to themselves.
> That's why all progress depends on unreasonable men."
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html