Re: Help interpreting RAID1 space allocation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chris Murphy posted on Sat, 24 Aug 2013 23:18:26 -0600 as excerpted:

> On Aug 24, 2013, at 11:24 AM, Joel Johnson <mrjoel@xxxxxxxxx> wrote:
>> 
>> Similar to what Duncan described in his response, on a hot-remove
>> (without doing the proper btrfs device delete), there is no opportunity
>> for a rebalance or metadata change on the pulled drives, so I would
>> expect there to be a signature of some sort for consistency checking
>> before readding it. At least, btrfs shouldn't add the readded device
>> back as an active device when it's really still inconsistent and not
>> being used, even if it indicates the same UUID.
> 
> Question: On hot-remove, does 'mount' show the volume as degraded?
> 
> I find the degraded mount option confusing. What does it mean to use -o
> degraded when mounting a volume for which all devices are present and
> functioning?

The degraded mount option does indeed simply ALLOW mounting without all 
devices.  If all devices can be found, btrfs will still integrate them 
all, regardless of the mount option.

Looked at in that way, therefore, having the degraded option remain when 
all devices were found and integrated makes sense.  It's simply denoting 
the historical fact at that point, that the degraded option was included 
when mounting, and thus that it WOULD have mounted without all devices, 
if it couldn't find them all, regardless of whether it found and 
integrated all devices or not.

And hot-remove won't change the options used to mount, either, so 
degraded won't (or shouldn't, I don't think it does but didn't actually 
check that case personally) magically appear in the options due to the 
hot-remove.

However, I /believe/ btrfs filesystem show should display MISSING when a 
device has been hot-removed, until it's added again.  That's what I 
understand Joel to be saying, at least, and it's consistent with my 
understanding of the situation.

(I would have tested that when I did my original testing, except I didn't 
know my way around multi-device btrfs well enough to properly grok either 
the commands I really should be running or their output.  I did run the 
commands, but I had the other device still attached even tho I'd 
originally mounted degraded, so it didn't show up as missing, and I 
didn't understand the significance of what I was seeing, except to the 
extent that I knew the results I got from the separate degraded writes 
followed by a non-degraded mount were NOT what I expected, and I simply 
resolved to steer well clear of degraded mounting in the first place, if 
I could help it, and to take steps to wipe and clean-add in the event 
something happened and I really NEEDED that degraded.)

>> Based on my experience with this and Duncan's feedback, I'd like to see
>> the wiki have some warnings about dealing with multidevice filesystems,
>> especially surrounding the degraded mount option.

Well, as I got told at one point, it's a wiki, knock yourself out. =:^/

Tho... in fairness, while I intend to register and do some of these 
changes at some point, in practice, I'm far more comfortable on 
newsgroups and mailing lists than in web forums or editing wikis, so 
unfortunately I've not gotten "the properly rounded tuit" yet. =:^(

But seriously, Joel, I agree it needs done, and if you get to it before I 
do... there'll be less I need to do.  So if you have the time and 
motivation to do it, please do so! =:^)  Plus you appear to be doing a 
bit more thorough testing with it than I did, so you're arguably better 
placed to do it anyway.

> To me, degraded is an array or volume state, not up to the user to set
> as an option. So I'd like to know if the option is temporary, to more
> easily handle a particular problem for now, but the intention is to
> handle it better (differently) in the future.

Hopefully the above helped with that.  AFAIK the degraded mount-option 
will remain more or less as it is -- simply allowing the filesystem to 
start instead of error-out if it can't find all devices, but effectively 
doing nothing if it does find all devices.

Meanwhile, I'm used to running beta and at times alpha software, and what 
we have now is clearly classic alpha, not all primary features 
implemented yet, let alone all the sharp edges removed and the chrome 
polished up.  Classic beta has all the baseline features, and we are 
getting close, but still has sharp edges/bugs that can hurt if one isn't 
careful around them.  I honestly expect btrfs should be hitting that by 
end of year or certainly early next, as it really is getting close now.

What that means in context is that I expect and hope that once the last 
few primary features get added, finish up raid5/6 mode and get full N-way 
mirroring (not just the 2-way referred to as raid1 currently), possibly 
dedup, finish up send/receive (it's there but rather too buggy to be 
entirely practical at present)... AFAIK, that's about it on the primary 
features list.

Then it's beta, with the full focus turning to debugging and getting rid 
of those sharp corners, and I expect THAT is when we'll see some of these 
really bare and sharp-cornered features such as multi-device raidN get 
rounded out in userspace, with the tools actually turning into something 
reasonably usable, not the bare-bones alpha proof-of-concept userspace 
tools we have for the multi-device features at present.

> For raid1 and raid10 this seems a problem for a file system that can
> become very large. The devices have enough information to determine
> exactly how far behind temporarily kicked devices are; it seems they
> effectively have an mdraid write-intent bitmap.

With atomic tree updates not taking effect until the root node is finally 
written, and with btrfs keeping a list of the last several root nodes as 
it has actually been doing for several versions now (since 3.0 at least, 
I believe), I /believe/ it's even better than a 1-deep write-intent 
bitmap, as it's effectively an N-deep stack of such bitmaps. =:^)

The problem is, as I explained above, btrfs is still effectively alpha, 
and the tools we are using to work with it are effectively bare-bones 
proof-of-concept alpha level tools, since not all features have yet been 
fully implemented, let alone having time to flesh anything out properly.

It'll take time...

> I think it's a problem if there isn't an write-intent bitmap equivalent
> for btrfs raid1/raid10, and right now there doesn't seem to be one.

As I explained I believe btrfs has even better.  It's simply that there's 
no proper tools available to use it yet...

> A compulsory rebalance means hours or days of rebalance just because one
> drive was dropped for a short while.

I think I consider myself lucky.  One thing I learned with my years of 
playing with mdraid is how to make proper use of partitions, with only 
the ones I actually needed active and the filesystems mounted, and 
activating/mounting read-only where possible, so if a device did drop out 
for whatever reason, between the split-up mounts meaning relatively 
little actual data was affected, and the write-intent bitmaps, I was back 
online right away.

While btrfs doesn't YET expose its root-node-stack as a stack of write-
intent-bitmaps as it COULD, and I believe eventually WILL, unlike back 
when I was running mdraid and dealing with the write-intent bitmaps 
there, I'm on SSD for my btrfs filesystems today, and they're MUCH faster.

*So* much so that between the multiple relatively small partitions (fully 
independent, I don't want all my eggs in one filesystem tree basket, so 
no subvolumes, they're fully independent filesystems/partitions) and the 
fact that they're on ssd...

Here, a full filesystem balance typically takes on the order of seconds 
to a minute, depending on the filesystem/partition.  That's rewriting ALL 
data and metadata on the filesystem!

So while I understand the concept of a full multi-terabyte filesystem 
rebalance taking on the order of days, the contrast between that concept, 
and the reality of a few gigabytes of data in its own dedicated 
filesystem on ssd rebalancing in a few tens of seconds here...

Makes a world of difference!

Let's just say I'm glad it isn't the other way around! =:^)

>> This then drives the question, how does one check the degraded state of
>> a filesystem if not the mount flag. I (quite likely with an md-raid
>> bias) expected to use the 'filesystem show' output, listed the devices
>> as well as a status flag of fully-consistent or rebalance in progress.
>> If that's not the correct or intended location, then provide
>> documentation on how to properly check the consistency state and
>> degraded state of a filesystem.
> 
> Yeah I think something functionally equivalent to a combination of mdadm
> -D and -E. mdadm distinguishes between array status/metadata vs member
> device status/metadata with those two commands.

While the bare-bones-alpha tools-state explains the current situation, 
never-the-less I believe these sorts of conversations are important, as 
they will very possibly help drive the shaping of the tools as they flesh 
out.

And yes, I do hope that the btrfs tools eventually get something 
comparable to mdadm -D and -E.  But I think it's equally important to 
realize that mdadm is actually a second-generation solution, that what 
we're looking at in mdadm is the end product of several years of it 
maturing plus the raid-tools solution before that, and that even then, 
those were already patterned after commercial and other raid product 
administration tools from before that.

Meanwhile, while there's some analogies between btrfs and md, and others 
between btrfs and zfs, really, this whole field of having filesystems do 
all that btrfs is attempting to do is relatively new ground, and we 
cannot and should not expect to directly compare the state of btrfs tools 
even after it's first declared stable, with the state of either mdadm or 
zfs tools, today.  It'll take some time to get there.

But get there I believe it will eventually do. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux