On 2018-01-27 17:42, Tomasz Pala wrote:
On Sat, Jan 27, 2018 at 14:26:41 +0100, Adam Borowski wrote:
It's quite obvious who's the culprit: every single remaining rc system
manages to mount degraded btrfs without problems. They just don't try to
outsmart the kernel.
Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.
I think you mean any setup that isn't sensibly layered. BCP for over a
decade has been to put multipathing at the bottom, then crypto, then
software RAID, than LVM, and then whatever filesystem you're using.
Multipathing has to be the bottom layer for a given node because it
interacts directly with hardware topology which gets obscured by the
other layers. Crypto essentially has to be next, otherwise you leak
info about the storage stack. Swapping LVM and software RAID ends up
giving you a setup which is difficult for most people to understand and
therefore is hard to reliably maintain.
Other init systems enforce things being this way because it maintains
people's sanity, not because they have significant difficulty doing
things differently (and in fact, it is _trivial_ to change the ordering
in some of them, OpenRC on Gentoo for example quite literally requires
exactly N-1 lines to change in each of N files when re-ordering N
layers), provided each layer occurs exactly once for a given device and
the relative ordering is the same on all devices. And you know what?
Given my own experience with systemd, it has exactly the same constraint
on relative ordering. I've tried to run split setups with LVM and
dm-crypt where one device had dm-crypt as the bottom layer and the other
had it as the top layer, and things locked up during boot on _every_
generalized init system I tried.
Recently I've started mdadm on top of bunch of LVM volumes, with others
using btrfs and others prepared for crypto. And you know what? systemd
assembled everything just fine.
So with argument just like yours:
It's quite obvious who's the culprit: every single remaining filesystem
manages to mount under systemd without problems. They just expose
informations about their state.
No, they don't (except ZFS). There is no 'state' to expose for anything
but BTRFS (and ZFS) except possibly if the filesystem needs checked or
not. You're conflating filesystems and volume management.
The alternative way of putting what you just said is:
Every single remaining filesystem manages to mount under systemd without
problems, because it doesn't try to treat them as a block layer.
This is not a systemd issue, but apparently btrfs design choice to allow
using any single component device name also as volume name itself.
And what other user interface would you propose? The only alternative I see
is inventing a device manager (like you're implying below that btrfs does),
which would needlessly complicate the usual, single-device, case.
The 'needless complication', as you named it, usually should be the default
to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
No easy way to RAID the drive (there are device-mapper tricks, they are
just way more complicated). Even attaching SSD cache is not trivial
without preparations (for bcache being the absolutely necessary, much
easier with LVM in place).
For a bog-standard client system, all of those _ARE_ overkill (and
actually, so is BTRFS in many cases too, it's just that we're the only
option for main-line filesystem-level snapshots at the moment).
If btrfs pretends to be device manager it should expose more states,
But it doesn't pretend to.
Why mounting sda2 requires sdb2 in my setup then?
First off, it shouldn't unless you're using a profile that doesn't
tolerate any missing devices and have provided the `degraded` mount
option. It doesn't in your case because you are using systemd.
Second, BTRFS is not a volume manager, it's a filesystem with
multi-device support. The difference is that it's not a block layer,
despite the fact that systemd is treating it as such. Yes, BTRFS has
failure modes that result in regular operations being refused based on
what storage devices are present, but so does every single distributed
filesystem in existence, and none of those are volume managers either.
especially "ready to be mounted, but not fully populated" (i.e.
"degraded mount possible"). Then systemd could _fallback_ after timing
out to degraded mount automatically according to some systemd-level
option.
You're assuming that btrfs somehow knows this itself.
"It's quite obvious who's the culprit: every single volume manager keeps
track of it's component devices".
Unlike the bogus
assumption systemd does that by counting devices you can know whether a
degraded or non-degraded mount is possible, it is in general not possible to
know whether a mount attempt will succeed without actually trying.
There is a term for such situation: broken by design.
So in other words, it's broken by design to try to connect to a remote
host without pinging it first to see if it's online? Or to try to send
a signal to a given process without first checking that it's still
running, or to open a file without first checking if we have permission
to read it, or to try to mount any other filesystem without first
checking if the superblock is valid?
In all of those cases, there is no advantage to trying to figure out if
what you're trying to do is going to work before doing it, because every
one of those operations is functionally atomic (it either happens or it
doesn't, period), and has a clear-cut return code that tells you
directly if it succeeded or not.
There's a name for the type of design you're saying we should have here,
it's called a time of check time of use (TOCTOU) race condition. It's
one of the easiest types of race conditions to find, and also one of the
easiest to fix. Ask any sane programmer, and he will say that _that_ is
broken by design.
Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
naive counting of this kind, it had to be replaced by actually checking
whether at least one copy of every block group is actually present.
And you still blame systemd for using BTRFS_IOC_DEVICES_READY?
Given that it's been proven that it doesn't work and the developers
responsible for it's usage don't want to accept that it doesn't work? Yes.
[...]
just slow to initialize (USB...). So, systemd asks sda how many devices
there are, answer is "3" (sdb and sdc would answer the same, BTW). It can
even ask for UUIDs -- all devices are present. So, mount will succeed,
right?
Systemd doesn't count anything, it asks BTRFS_IOC_DEVICES_READY as
implemented in btrfs/super.c.
Ie, the thing systemd can safely do, is to stop trying to rule everything,
and refrain from telling the user whether he can mount something or not.
Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.
Or maybe we should just remove it completely, because checking it _IS
WRONG_, which is why no other init system does it, and in fact no
_human_ who has any kind of basic knowledge of how BTRFS operates does
it either.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html