On 04/27/2012 01:14 AM, Chris Mason wrote:
> On Thu, Apr 26, 2012 at 02:39:23PM +0800, Liu Bo wrote:
>> > Here is a simple scenario:
>> > $ dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k count=20;sync
>> > $ btrfs fi df /mnt/btrfs
>> >
>> > we get 20K used, but then
>> > $ dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k count=4 seek=4 conv=notrunc;sync
>> > $ btrfs fi df /mnt/btrfs
>> >
>> > we get 24K used.
>> > Here is the problem, it is possible that an _unshared_ file with lots of
>> > fragments costs nearly double space than its i_size, like:
>> > 0k 20k
>> > | --- extent --- | turned to be on disk <--- extent ---> <-- A -->
>> > | - A - | | -------------- | | ----- |
>> > 1k 19k 20k + 18k = 38k
>> >
>> > but what users want is <--- extent ---> <-- A -->
>> > | --- | | -- | | ----- |
>> > 1k + 1k + 18k = 20k
>> > so 18k is wasted.
>> >
>> > With the current backref design, there is no easy way to fix this, because it
>> > needs to touch several subtle parts, such as delayed ref stuff, extent backref.
>> >
>> > So here I give it a try by splitting the extent which we're processing(the idea
>> > comes from Chris :)).
>> >
>> > The benifits:
>> > As the above example shows, we'll get three individual extents: 1k + 1k + 18k,
>> > with their checksums are well splitted.
>> >
>> > The defects:
>> > Yes, it makes the code much uglier. And since we've disabled the merging of
>> > delayed refs, we'll get some performance regression.
>> >
>> > NOTE:
>> > The patch may still have some bugs since we need more time to tune the subtle
>> > things.
>
> Thanks for working on this. Could you please explain in detail what the
> pinned extents do?
Sure.
Let's take the above case:
0k 20k
| --- extent --- |
| - A - |
1k 19k
And we assume that this extent starts from disk_bytenr on its FS logical offset.
By splitting the [0k, 20k) extent, we'll get three delayed refs into the delayed-ref rbtree:
a) [0k, 20k), in which only [disk_bytenr+1k, disk_bytenr+19k) will be freed at the end.
b) [0k, 1k), which will _not_ allocate a new extent but use the remained space of [0k, 20k).
c) [19k, 20k), ditto.
And another ref [1k,19k) will get a new allocated space by our normal endio routine.
What I want is
free [0k, 20k), set this range DIRTY in the pinned_extents tree.
alloc [0k, 1k), clear this range DIRTY in the pinned_extents tree.
alloc [19k, 20k), ditto.
However, in my stress test, this three refs may not be ordered by a)->b)->c), but b)->a)->c) instead.
That would be a problem, because it will confuse our space_info's counter: bytes_reserved, bytes_pinned.
So I use EXTENT_PINNED flag to indicate this range will use pinned space rather than reserved space.
As for b)->a)->c), the logic will be:
b) alloc [0k, 1k), since no range covers this, set it PINNED in the pinned_extents tree.
a) free [0k, 20k), set this range DIRTY in the pinned_extents tree, and find [0k, 1k) tagged with PINNED,
clear [0k, 1k) PINNED & DIRTY in the pinned_extents tree, and update our space_info's counter: bytes_pinned.
c) alloc [19k, 20k), clear this range DIRTY in the pinned_extents tree.
Maybe I'd better pick up a new name for EXTENT_PINNED. :)
Actually I'm not satisfied with the patch as I do not want to make the code any uglier at all, so I kept it as RFC...
Any thoughts?
thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html