Chris Murphy posted on Thu, 24 Dec 2015 13:57:35 -0700 as excerpted: >> All this makes me ask why? Why implement Raid10 in this non-standard >> fashion and create this mess of compromise? > > Because it was a straightforward extension of how the file system > already behaves. To implement drive based copies rather than chunk based > copies is a totally different strategy that actually negates how btrfs > does allocation, and would require things like logically checking for > mirrored pairs being the same size +/- maybe 1% similar to mdadm. > > And keep in mind the raid10 multiple device failure is not fixed, not > just any additional failure is OK. It just depends on aviation's > equivalent of "big sky theory" for air traffic separation. Yes the > probability of mirror A's two drives dying is next to zero, but it's not > zero. If you're building arrays depending on it being zero, well that's > not a good idea. The way to look at it is more of a bonus of uptime, > rather than depending on it in design. You design for it's scaleable > performance, which it does have. This. Raid10 doesn't guard against any random two devices going down, let alone a random half of all devices, and anyone running a raid10 with the assumption that it does is simply asking for trouble. What it /does/ do, in the device-scope raid10 case, is minimize the /chance/ that two devices down will take out the entire array, particularly on big raid10 arrays, because the chances of any random two devices being the two devices mirroring the same content goes down as the number of total devices goes up. But as Chris Murphy says, btrfs is inherently chunk-scope, not drive- scope. In fact, that's a very large part of its multi-device flexibility in the first place. And raid10 functionality was a straightforward extension of the existing raid1 and raid0 functionality, simply combining them into one at the same filesystem level with comparatively little extra code. And that, again, was due to the incredible flexibility that chunk-scope granularity exposes. Of course one drawback is that with chunk-scope allocation, the per- device allocation of successive chunks is likely to vary, meaning you lose the low device-scope chance of two random devices taking the entire array down, because the chances of those two random devices containing /both/ mirrors of _some_ chunk-strips is much higher than it is with device-scope allocation and both copies of the device-scope mirror, but that's a taken tradeoff that allowed the exposure of striped-mirrors raid10 functionality in the first place, and as Chris and I are both saying, any admin relying on chance to cover his *** in the two-device failure case on a raid10 is already asking for trouble. But there are known workarounds for that problem, the layers on top of layers scenario, raid0+1 or raid1+0, each with its own advantages and disadvantages. Of course, btrfs arguably being a layering violation incorporating both filesystem and block level layers, tho it's done with specific advantages in mind, does by definition of implementation have to be the top layer, which does impose some limits if other btrfs features such as checksumming and data integrity are wanted, but it remains simply a question of matching the tradeoffs the technology makes against the ones you're willing to make, within the limitations of the available tradeoffs pool, of course. Meanwhile, there has been discussion of enhancements to the chunk allocator that would let you pick allocation schemes. Presumably, this would include the ability to nail down mirror allocation to specific devices, which seems to be the requested feature here. However, while definitely possible within the flexible framework btrfs' chunk-scope allocation provides, to my knowledge at least, this isn't anywhere on the existing near or intermediate term roadmap, so implementation by current developers is likely out beyond the five year time frame, along with a lot of other such features, making it effectively "bluesky", aka, possible, and would be nice, but no near or intermediate term plans, tho if someone with that itch to scratch appears with the patches ready to go, who moreover is willing to join the btrfs team and help maintain them longer term, assuming there's no huge personality clash, the feature could be implemented rather sooner, perhaps with initial implementation in a year or two and relative stability in two to three. In that regard, it's more ENOTIMPLEMENTED, rather than EBLACKLISTED. There's all sorts of features that /could/ be implemented, and this one simply hasn't been a priority for existing developers, given the other features they've found to be more pressing. But it may indeed eventually come, five or ten years out, sooner if a suitable developer with suitable interest and social compatibility with existing devs is found to champion the cause. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
