On 2017-05-02 16:15, Kai Krakow wrote:
Am Tue, 2 May 2017 21:50:19 +0200
schrieb Goffredo Baroncelli <kreijack@xxxxxxxxx>:
On 2017-05-02 20:49, Adam Borowski wrote:
It could be some daemon that waits for btrfs to become complete.
Do we have something?
Such a daemon would also have to read the chunk tree.
I don't think that a daemon is necessary. As proof of concept, in the
past I developed a mount helper [1] which handled the mount of a
btrfs filesystem: this handler first checks if the filesystem is a
multivolume devices, if so it waits that all the devices are
appeared. Finally mount the filesystem.
It's not so simple -- such a btrfs device would have THREE states:
1. not mountable yet (multi-device with not enough disks present)
2. mountable ro / rw-degraded
3. healthy
My mount.btrfs could be "programmed" to wait a timeout, then it
mounts the filesystem as degraded if not all devices are present.
This is a very simple strategy, but this could be expanded.
I am inclined to think that the current approach doesn't fit well the
btrfs requirements. The roles and responsibilities are spread to too
much layer (udev, systemd, mount)... I hoped that my helper could be
adopted in order to concentrate all the responsibility to only one
binary; this would reduce the interface number with the other
subsystem (eg systemd, udev).
For example, it would be possible to implement a sane check that
prevent to mount a btrfs filesystem if two devices exposes the same
UUID...
Ideally, the btrfs wouldn't even appear in /dev until it was assembled
by udev. But apparently that's not the case, and I think this is where
the problems come from. I wish, btrfs would not show up as device nodes
in /dev that the mount command identified as btrfs. Instead, btrfs
would expose (probably through udev) a device node
in /dev/btrfs/fs_identifier when it is ready.
Apparently, the core problem of how to handle degraded btrfs still
remains. Maybe it could be solved by adding more stages of btrfs nodes,
like /dev/btrfs-incomplete (for unusable btrfs), /dev/btrfs-degraded
(for btrfs still missing devices but at least one stripe of btrfs raid
available) and /dev/btrfs as the final stage. That way, a mount process
could wait for a while, and if the device doesn't appear, it tries the
degraded stage instead. If the fs is opened from the degraded dev node
stage, udev (or other processes) that scan for devices should stop
assembling the fs if they still do so.
That won't work though because BTRFS is a _filesystem_ not a block
layer. We don't have any way of hiding things. Even if we did, we
would still need to parse the superblocks and chunk tree, and at that
point, it just makes more sense to try to mount the FS instead. IOW,
the correct way to determine if a BTRFS volume is mountable is to try to
mount it, not to wait and try to find all the devices.
bcache has a similar approach by hiding an fs within a protective
superblock. Unless bcache is setup, the fs won't show up in /dev, and
that fs won't be visible by other means. Btrfs should do something
similar and only show a single device node if assembled completely. The
component devices would have superblocks ignored by mount, and only the
final node would expose a virtual superblock and the compound device
after it. Of course, this makes things like compound device resizing
more complicated maybe even impossible.
Except there is no 'btrfs' device node for a filesystem. The only node
is /dev/btrfs-control, which is used for a small handful of things that
don't involve the mountability of any filesystem. To reiterate, we are
_NOT_ a block layer, so there is _NO_ associated block device for an
assembled multi-device volume, nor should there be.
If I'm not totally wrong, I think this is also how zfs exposes its
pools. You need user space tools to make the fs pools visible in the
tree. If zfs is incomplete, there's nothing to mount, and thus no race
condition. But I never tried zfs seriously, so I do not know.
For zvols, yes, this is how it works. For actual filesystem datasets,
it behaves almost identically to BTRFS AFAIK.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html