Hi! On 6/28/20 3:33 PM, Pablo Fabian Wagner Boian wrote: > Hi. > > Recently, it came to my knowledge that btrfs relies on disks honoring > fsync. So, when a transaction is issued, all of the tree structure is > updated (not in-place, bottom-up) and, lastly, the superblock is > updated to point to the new tree generation. If reordering happens (or > buggy firmware just drops its cache contents without updating the > corresponding sectors) then a power-loss could render the filesystem > unmountable. > > Upon more reading, ZFS seems to implement a circular buffer in which > new pointers are updated one after another. That means that, if older > generations (in btrfs terminology) of the tree are kept on disk you > could survive such situations by just using another (older) pointer. Btrfs does not keep older generations of trees on disk. *) Immediately after completing a transaction, the space that was used by the previous metadata can be overwritten again. IIRC when using the discard mount options, it's even directly freed up on disk level by unallocating the physical space by e.g. the FTL in an SSD. So, even while not overwritten yet, reading it back gives you zeros. *) Well, only for fs trees, and only if you explicitly ask for it, when making subvolume snapshots/clones. > I seem to recall having read somewhere that the btrfs superblock > maintains four pointers to such older tree generations. Yes, and they're absolutely useless and dangerous to use, **) since even if you manage to mount a filesystem, using one of them, any metadata in any distant corner of a tree could have been overwritten already. So, when trying that, directly umounting again and a throrough btrfschk should be done to verify that everything is present. But... ugh. **) So, except for one case, which is the filesystem or hardware royally messing up a transaction commit, and then only when using generation N-1 to recover, while there has not been any write to the filesystem in between... So, if you try to mount it and halfway it fails, then it's likely already too late, because it could have done some stuff like cleaning up orphan objects, or whatever else already causes writes during mount. > My question is: is the statement in this last paragraph true? If not: > could it be implemented in btrfs to not depend on correct fsync > behaviour? I assume it would require an on-disk format change. Lastly: > are there any downsides in this approach? Btrfs could be changed to use the same snapshotting techniques in the background as are already present for fs trees. In the very beginning of Btrfs, this was actually used for a little bit, by adding a new tree root item in metadata tree 1 and then after transaction commit removing the previous one. However, this was soon replaced by in memory magic that does not need to actually do changes on tree 1 because of the processing overhead. (See commit 5d4f98a28c7d334091c1b7744f48a1acdd2a4ae0 "Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") The btrfs wiki apparently still lives in 2009, and it has a section about how it worked before: https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging The filesystem trees (subvolumes) are reference counted, which makes it possible to snapshot them and then properly do long term on-disk administration of which little parts of metadata are shared or not between trees, so that when removing subvolumes or (part of) their contents, the fs knows what metadata pages to free up. The other trees (like extent tree) are 'cowonly', which means that all new writes are written to new empty space, so the fs can crash and recover (yes, if the hardware behaves like it expects, like you already said). But, instead of using reference counting, there's an in-memory blacklist of 'pinned extents', which list disk space which should not be overwritten yet, while there is no 'real' on-disk information about them. The obvious downside of making all trees fully snapshottable and reference counted, is that this will lead to a total absolute gigantic performance disaster, probably bringing the possibilities of actually using the filesystems of users to a screeching halt, while hammering on disk all day long. But yes, in that case you could theoretically snapshot the ENTIRE filesystem. Would be fun to do as experiment. \:D/ > I have skimmed the mailing list but couldn't find concise answers. > Bear in mind that I'm just an user so I would really appreciate a very > brief explanation attached to any technical aspect in the response. If > any of these questions have no merit (or this isn't the appropriate > place to ask) I'm sorry for the noise and, please, ignore this mail. It's not noise. Instead, it's a very good question. So, when browsing btrfs source code and history, some of the relevant words to look for are 'reference counted', 'pinned' and 'cowonly'. Have fun, Hans
