Hans van Kranenburg wrote:
Hi!
On 6/28/20 3:33 PM, Pablo Fabian Wagner Boian wrote:
Hi.
Recently, it came to my knowledge that btrfs relies on disks honoring
fsync. So, when a transaction is issued, all of the tree structure is
updated (not in-place, bottom-up) and, lastly, the superblock is
updated to point to the new tree generation. If reordering happens (or
buggy firmware just drops its cache contents without updating the
corresponding sectors) then a power-loss could render the filesystem
unmountable.
Upon more reading, ZFS seems to implement a circular buffer in which
new pointers are updated one after another. That means that, if older
generations (in btrfs terminology) of the tree are kept on disk you
could survive such situations by just using another (older) pointer.
Btrfs does not keep older generations of trees on disk. *) Immediately
after completing a transaction, the space that was used by the previous
metadata can be overwritten again. IIRC when using the discard mount
options, it's even directly freed up on disk level by unallocating the
physical space by e.g. the FTL in an SSD. So, even while not overwritten
yet, reading it back gives you zeros.
*) Well, only for fs trees, and only if you explicitly ask for it, when
making subvolume snapshots/clones.
So just out of curiosity... if BTRFS internally at every successful
mount did a 'btrfs subvolume create /mountpoint /mountpoint/fsbackup1'
you would always have a good filesystem tree to fall back to?! would
this be correct?!
And if so - this would mean that you would loose everything that
happened since last mount, but compared to having a catastrophic failure
this sound much much better.
And if I as just a regular BTRFS user with my (possibly distorted) view
see this, if you would leave the top level subvolume (5) untouched and
avoid updates to this except creating child subvolues you reduce the
risk of catastrophic failure in case a fsync does not work out as only
the child subvolumes (that are regularily updated) would be at risk.
And if BTRFS internally made alternating snapshots of the root
subvolumes (5)'s child subvolumes you would loose at maximum 30sec x 2
(or whatever the commit time is set to) of data.
E.g. keep only child subvolumes on the top level (5).
And if we pretend the top level has a child subvolume called rootfs,
then BTRFS could internally auto-snapshot (5)/rootfs every other time to
(5)/rootfs_autobackup1 and (5)/rootfs_autobackup2
Do I understand this correctly or would there be any (significant)
performance drawback to this? Quite frankly I assume it is or else I
guess it would have been done already , but it never hurts (that much)
to ask...