HI Everyone, I suppose I have an answer to my initial question. Thanks for all the discussion. I'd just like to stress the importance in my opinion of btrfs understanding that drives are missing/dead and to halt all operations that would advance the metadata in the case of a temporary disconnection of a portion of the drives. Even if it requires a tool to restore consistency after this sort of failure. I mentioned the btrfs rescue command with the mismatching fsid message. After dd'ing /dev/zero to all but the boot drive, the fsid mismatch went away, but the tool still segfaults on the filesystem after losing 1/2 of the drives, so at best, the fsid mismatch error was just cosmetic. -Dave > -----Original Message----- > From: linux-btrfs-owner@xxxxxxxxxxxxxxx [mailto:linux-btrfs- > owner@xxxxxxxxxxxxxxx] On Behalf Of Duncan > Sent: Thursday, December 24, 2015 5:23 PM > To: linux-btrfs@xxxxxxxxxxxxxxx > Subject: Re: Loss of connection to Half of the drives > > Chris Murphy posted on Thu, 24 Dec 2015 13:57:35 -0700 as excerpted: > > >> All this makes me ask why? Why implement Raid10 in this non-standard > >> fashion and create this mess of compromise? > > > > Because it was a straightforward extension of how the file system > > already behaves. To implement drive based copies rather than chunk > > based copies is a totally different strategy that actually negates how > > btrfs does allocation, and would require things like logically > > checking for mirrored pairs being the same size +/- maybe 1% similar to > mdadm. > > > > And keep in mind the raid10 multiple device failure is not fixed, not > > just any additional failure is OK. It just depends on aviation's > > equivalent of "big sky theory" for air traffic separation. Yes the > > probability of mirror A's two drives dying is next to zero, but it's > > not zero. If you're building arrays depending on it being zero, well > > that's not a good idea. The way to look at it is more of a bonus of > > uptime, rather than depending on it in design. You design for it's > > scaleable performance, which it does have. > > This. > > Raid10 doesn't guard against any random two devices going down, let alone a > random half of all devices, and anyone running a raid10 with the assumption > that it does is simply asking for trouble. > > What it /does/ do, in the device-scope raid10 case, is minimize the /chance/ > that two devices down will take out the entire array, particularly on big raid10 > arrays, because the chances of any random two devices being the two devices > mirroring the same content goes down as the number of total devices goes up. > > But as Chris Murphy says, btrfs is inherently chunk-scope, not drive- scope. In > fact, that's a very large part of its multi-device flexibility in the first place. And > raid10 functionality was a straightforward extension of the existing raid1 and > raid0 functionality, simply combining them into one at the same filesystem > level with comparatively little extra code. And that, again, was due to the > incredible flexibility that chunk-scope granularity exposes. > > Of course one drawback is that with chunk-scope allocation, the per- device > allocation of successive chunks is likely to vary, meaning you lose the low > device-scope chance of two random devices taking the entire array down, > because the chances of those two random devices containing /both/ mirrors of > _some_ chunk-strips is much higher than it is with device-scope allocation and > both copies of the device-scope mirror, but that's a taken tradeoff that allowed > the exposure of striped-mirrors > raid10 functionality in the first place, and as Chris and I are both saying, any > admin relying on chance to cover his *** in the two-device failure case on a > raid10 is already asking for trouble. > > But there are known workarounds for that problem, the layers on top of layers > scenario, raid0+1 or raid1+0, each with its own advantages and disadvantages. > Of course, btrfs arguably being a layering violation incorporating both > filesystem and block level layers, tho it's done with specific advantages in mind, > does by definition of implementation have to be the top layer, which does > impose some limits if other btrfs features such as checksumming and data > integrity are wanted, but it remains simply a question of matching the tradeoffs > the technology makes against the ones you're willing to make, within the > limitations of the available tradeoffs pool, of course. > > > Meanwhile, there has been discussion of enhancements to the chunk allocator > that would let you pick allocation schemes. Presumably, this would include the > ability to nail down mirror allocation to specific devices, which seems to be the > requested feature here. However, while definitely possible within the flexible > framework btrfs' chunk-scope allocation provides, to my knowledge at least, > this isn't anywhere on the existing near or intermediate term roadmap, so > implementation by current developers is likely out beyond the five year time > frame, along with a lot of other such features, making it effectively "bluesky", > aka, possible, and would be nice, but no near or intermediate term plans, tho if > someone with that itch to scratch appears with the patches ready to go, who > moreover is willing to join the btrfs team and help maintain them longer term, > assuming there's no huge personality clash, the feature could be implemented > rather sooner, perhaps with initial implementation in a year or two and relative > stability in two to three. > > In that regard, it's more ENOTIMPLEMENTED, rather than EBLACKLISTED. > There's all sorts of features that /could/ be implemented, and this one simply > hasn't been a priority for existing developers, given the other features they've > found to be more pressing. But it may indeed eventually come, five or ten > years out, sooner if a suitable developer with suitable interest and social > compatibility with existing devs is found to champion the cause. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- and if you use the program, he > is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body > of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n�����{����n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�
