Re: Buggy disk firmware (fsync/FUA) and power-loss btrfs survability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

On 6/28/20 3:33 PM, Pablo Fabian Wagner Boian wrote:
> Hi.
> 
> Recently, it came to my knowledge that btrfs relies on disks honoring
> fsync. So, when a transaction is issued, all of the tree structure is
> updated (not in-place, bottom-up) and, lastly, the superblock is
> updated to point to the new tree generation. If reordering happens (or
> buggy firmware just drops its cache contents without updating the
> corresponding sectors) then a power-loss could render the filesystem
> unmountable.
> 
> Upon more reading, ZFS seems to implement a circular buffer in which
> new pointers are updated one after another. That means that, if older
> generations (in btrfs terminology) of the tree are kept on disk you
> could survive such situations by just using another (older) pointer.

Btrfs does not keep older generations of trees on disk. *) Immediately
after completing a transaction, the space that was used by the previous
metadata can be overwritten again. IIRC when using the discard mount
options, it's even directly freed up on disk level by unallocating the
physical space by e.g. the FTL in an SSD. So, even while not overwritten
yet, reading it back gives you zeros.

*) Well, only for fs trees, and only if you explicitly ask for it, when
making subvolume snapshots/clones.

> I seem to recall having read somewhere that the btrfs superblock
> maintains four pointers to such older tree generations.

Yes, and they're absolutely useless and dangerous to use, **) since even
if you manage to mount a filesystem, using one of them, any metadata in
any distant corner of a tree could have been overwritten already. So,
when trying that, directly umounting again and a throrough btrfschk
should be done to verify that everything is present. But... ugh.

**) So, except for one case, which is the filesystem or hardware royally
messing up a transaction commit, and then only when using generation N-1
to recover, while there has not been any write to the filesystem in
between... So, if you try to mount it and halfway it fails, then it's
likely already too late, because it could have done some stuff like
cleaning up orphan objects, or whatever else already causes writes
during mount.

> My question is: is the statement in this last paragraph true? If not:
> could it be implemented in btrfs to not depend on correct fsync
> behaviour? I assume it would require an on-disk format change. Lastly:
> are there any downsides in this approach?

Btrfs could be changed to use the same snapshotting techniques in the
background as are already present for fs trees. In the very beginning of
Btrfs, this was actually used for a little bit, by adding a new tree
root item in metadata tree 1 and then after transaction commit removing
the previous one. However, this was soon replaced by in memory magic
that does not need to actually do changes on tree 1 because of the
processing overhead. (See commit
5d4f98a28c7d334091c1b7744f48a1acdd2a4ae0 "Btrfs: Mixed back reference
(FORWARD ROLLING FORMAT CHANGE)")

The btrfs wiki apparently still lives in 2009, and it has a section
about how it worked before:

https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging

The filesystem trees (subvolumes) are reference counted, which makes it
possible to snapshot them and then properly do long term on-disk
administration of which little parts of metadata are shared or not
between trees, so that when removing subvolumes or (part of) their
contents, the fs knows what metadata pages to free up.

The other trees (like extent tree) are 'cowonly', which means that all
new writes are written to new empty space, so the fs can crash and
recover (yes, if the hardware behaves like it expects, like you already
said). But, instead of using reference counting, there's an in-memory
blacklist of 'pinned extents', which list disk space which should not be
overwritten yet, while there is no 'real' on-disk information about them.

The obvious downside of making all trees fully snapshottable and
reference counted, is that this will lead to a total absolute gigantic
performance disaster, probably bringing the possibilities of actually
using the filesystems of users to a screeching halt, while hammering on
disk all day long. But yes, in that case you could theoretically
snapshot the ENTIRE filesystem. Would be fun to do as experiment. \:D/

> I have skimmed the mailing list but couldn't find concise answers.
> Bear in mind that I'm just an user so I would really appreciate a very
> brief explanation attached to any technical aspect in the response. If
> any of these questions have no merit (or this isn't the appropriate
> place to ask) I'm sorry for the noise and, please, ignore this mail.

It's not noise. Instead, it's a very good question.

So, when browsing btrfs source code and history, some of the relevant
words to look for are 'reference counted', 'pinned' and 'cowonly'.

Have fun,
Hans



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux