On 2019/1/3 上午8:57, Hans van Kranenburg wrote: > Hi, > > On 1/3/19 1:14 AM, Qu Wenruo wrote: >> >> >> On 2019/1/3 上午12:21, David Sterba wrote: >>> On Fri, Dec 28, 2018 at 05:28:13PM +0800, Qu Wenruo wrote: >>>> On 2018/12/28 下午5:15, Nikolay Borisov wrote: >>>>> On 28.12.18 г. 10:37 ч., Qu Wenruo wrote: >>>>>> This patchset can be fetched from: >>>>>> https://github.com/adam900710/linux/tree/bg_tree >>>>>> Which is based on v4.20-rc1 tag. >>>>>> >>>>>> This patchset will hugely reduce mount time of large fs by putting all >>>>>> block group items into its own tree. > > Thanks a lot for writing a proof of concept for this! This is great. Not PoC anymore. :) It passes all tests without regression. > >>>>>> The old behavior will try to read out all block group items at mount >>>>>> time, however due to the key of block group items are scattered across >>>>>> tons of extent items, we must call btrfs_search_slot() for each block >>>>>> group. >>>>>> >>>>>> It works fine for small fs, but when number of block groups goes beyond >>>>>> 200, such tree search will become a random read, causing obvious slow >>>>>> down. >>>>>> >>>>>> On the other hand, btrfs_read_chunk_tree() is still very fast, since we >>>>>> put CHUNK_ITEMS into their own tree and package them next to each other. >>>>>> >>>>>> >>>>>> Following this idea, we could do the same thing for block group items, >>>>>> so instead of triggering btrfs_search_slot() for each block group, we >>>>>> just call btrfs_next_item() and under most case we could finish in >>>>>> memory, and hugely speed up mount (see BENCHMARK below). > > +many, this is a usability "bug" that comes up regularly on mailing list > and IRC, people asking why their filesystem takes long to mount. > > I also have some filesystems that I have to set noauto in fstab, and > then after booting mount manually, and then do some other manual tasks, > because having them mount automatically during boot causes timeouts and > stuff. > >>>>>> The only disadvantage is, this method introduce an incompatible feature, >>>>>> so existing fs can't use this feature directly. >>>>>> Either specify it at mkfs time, or use btrfs-progs offline convert tool >>>>>> (*). >>>>> >>>>> What if we start recording block group items in the chunk tree? >>>> >>>> Then chunk tree will be too hot. >>>> >>>> Currently chunk tree is pretty stable, only get modified at bg >>>> creation/deletion time. >>>> >>>> Considering how important chunk tree is, I prefer to make chunk root as >>>> cold as possible. > > This makes sense. > >>>> On the other hand, block group items are pretty hot (although less hot >>>> compared to old extent tree), so it still makes sense to put them into >>>> one tree, allow chunk tree to be as cold as ice, while keep block group >>>> items relatively safe compared to old extent tree. >>> >>> A feature like this should come with an analysis of both approaches in >>> advance. Both have pros and cons that we need to weigh. Eg. I'm not more >>> for storing the items in an existing tree, possibly creating a new tree >>> item that would pack the bg items together at the beginning of the tree. >>> >>> The update frequency of the tree is an aspect that I haven't considered >>> before but I think it's a good point. >>> >>> The tree holding the bg items can be considered fundamental and requires >>> a backup pointer in the superblock. So this would need more work. >> >> Right, for backup root it indeed makes sense. > > I don't really understand why this backup roots mechanism keeps being a > thing, because technically btrfs cannot guarantee at all that there will > be anything usable left right after the old metadata extents are > unpinned...? It's a fail safe feature for btrfs-store or similar recovery method. But on the other hand, for RO usage, we don't even need block group items at all, just like my skip_bg mount option. And for RW mount, if all other trees are OK, btrfs-check should be able to rebuild it. > >> However for another key type method, I don't really think there is any pro. >> >> To pack bg items together, new key type is needed anyway. > > Not if the block group items are the only thing in the tree...? That's just the method the patchset is using. > >> With new key type, no matter where the new bg items are, older kernel >> won't be compatible, thus still INCOMPAT feature. >> >> And for whatever the tree holding block group items, it will be as hot >> as extent tree used to be, bring up the corruption possibility to the >> whatever the existing is. Or slow down the tree. >> >> So at least from my respect of view, storing (new) bg items in existing >> tree doesn't make sense. >> >> However I think we should put more discussion on the possible new block >> group item structure design. >> E.g. Remove chunk_objectid member, or even each block group has its own >> tree. > > Just thinking out loud... > > It seems to me that keeping the same key type and btrfs_block_group_item > struct and same key values as they have in extent tree would be > desirable if both old and new code has to co-exist in the kernel. Yes, that the main point of current implementation. > > This is easy to review... > > - struct btrfs_root *root = fs_info->extent_root; > + struct btrfs_root *root; > > [...] > > + if (btrfs_fs_incompat(fs_info, BG_TREE)) > + root = fs_info->bg_root; > + else > + root = fs_info->extent_root; > > ...but creating a new different struct and key type would cause much > more invasive code changes and duplication (and bugs) all over the > place, or wrappers to handle either scenario. > > I mean, who cares about some unused chunk_objectid field on a multi-TiB > filesystem... Well, we have similar feature already, skinny_metadata reduces metadata backref size, and it's now already mainlined. > > I'd vote for doing things, and more "design for today". Makes sense. > Otherwise the > same might happen that also happens with some other topics every time... > it ends up with the idea to rewrite half btrfs and then in the end > nothing happens at all, and the users are still unhappy. ;-) Well, in that case I prefer to create another fs, with "better" design from the very beginning. Thanks, Qu > > Even when splitting the extent tree into multiple trees ever, it would > still be a good idea to have this BG_TREE. >
