Re: RAID1, SSD+non-SSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Feb 7, 2015 at 1:39 AM, Duncan <1i5t5.duncan@xxxxxxx> wrote:
> Brian B posted on Fri, 06 Feb 2015 15:01:30 -0500 as excerpted:
>
>> The only reason I'm doing the [btrfs] RAID1 is for the self-healing. I
>> realize writing large amounts of data will be slower than the SSD
>> alone, but is it possible to set it up to only read from the magnetic
>> drive if there's an error reading from the SSD?
>
> Chris Murphy is correct.  Btrfs raid1 doesn't have the write-mostly
> option that mdraid has.
>
> I'll simply expand on what he mentioned with two points, #1 being the
> more important for your case.
>
> 1) The btrfs raid1 read-mode device choice algorithm is known to be sub-
> optimal, and the plan is to change and optimize it in the longer term.
> Basically, it's an easy first implementation that's simple enough to be
> reasonably bug-free and to stay out of the developer's way while they
> work on on other things, while still allowing easy testing of both
> devices.
>
> Specifically, it's a very simple even/odd parity assignment based on the
> PID making the request.  Thus, a single PID read task will consistently
> read from the same device (unless the block checksum on that device is
> bad, then it tries the other device), no matter how much there is to read
> and how backed up that device might be, or how idle the other one might
> be. Even a second read task from another PID, or a 10th, or the 100th, if
> they're all even or all odd parity PIDs, will all be assigned to read
> from the same device, even if the other one is entirely idle.
>
> Which ends up being worst-case for a multi-threaded heavy-read focused
> task where all read threads happen to be even or odd, say if read and
> compute threads are paired and always spawned in the same order, with
> nothing else going on to throw the parity ordering off.  But that's how
> it's currently implemented.  =:^(
>
> And it /does/ make for easily repeatable test results, while being simple
> enough to stay out of the way while development interest focuses
> elsewhere, after all pretty important factors early in a project of this
> scope. =:^)
>
>
> Obviously, that's going to be bad news for you, too, unless your use-case
> is specific enough that you can tune the read PIDs to favor the parity
> that hits the SSD. =:^(
>
>
> The claim is made that btrfs is stabilizing, and in fact, as a regular
> here for some time, I can vouch for that.  But I think it's reasonable to
> argue that until this sort of read-scheduling algorithm is replaced with
> something a bit more optimized, and of course that replacement well
> tested, it's definitely premature to call btrfs fully stable.  This sort
> of painfully bad in some cases mis-optimization just doesn't fit with
> stable, and regardless of how long it takes, until development quiets
> down far enough that the devs can feel comfortable focusing on something
> like this, it's extremely hard to argue that development has quieted down
> enough to fairly call it stable in the first place.
>
> Well, my opinion anyway.
>
> So the short of it is, at least until btrfs optimizes this a bit better,
> for SSD paired with spinning-rust raid1 optimization, as Chris Murphy
> suggested, use some sort of caching mechanism, bcache or dmcache.
>
> Tho you'll want to compare notes with someone who has already tried it,
> as there were some issues with at least btrfs and bcache earlier.  I
> believe they're fixed now, but as explained above, btrfs itself isn't
> really entirely stable yet, so I'd definitely recommend keeping backups,
> and comparing notes with others who have tried it.  (I know there's some
> on the list, tho they may not see this.  But hopefully they'll respond to
> a new thread with bcache or dmcache in the title, if you decide to go
> that way.)
>
>
> 2) While this doesn't make a significant difference in the two-device
> btrfs raid1 case, it does with three or more devices in the btrfs raid1,
> and with other raid forms the difference is even stronger.  I noticed you
> wrote RAID1 in ALL CAPS form.  Btrfs' raid implementations aren't quite
> like traditional RAID, and I recall a dev (Chris Mason, actually, IIRC)
> pointing out that the choice to use small-letters raidX nomenclature was
> deliberate, in ordered to remind people that there is a difference.
>
> Specifically for btrfs raid1, as contrasted to, for instance, md/RAID-1,
> at present btrfs raid1 is always pair-mirrored, regardless of the number
> of devices (above two, of course).  While a three-device md/RAID-1 will
> have three mirrors and a four-device md/RAID-1 will have four, simply
> adding redundant mirrors while maintaining capacity (in the simple all-
> the-same-size case, anyway), a three-device btrfs raid1 will have 1.5x
> the capacity of a two-device btrfs raid1, and a four-device btrfs raid1
> will have twice the two-device capacity, while maintaining a constant
> pair-mirroring regardless of the number of devices in the btrfs raid1.
>
> For btrfs raid10, the pair-mirroring is there, but for odd numbers of
> devices there's also a difference of uneven striping, because of the odd
> one out in the mirroring and the difference in chunk size between data
> and metadata chunks.
>
> And of course there's the difference that data and metadata are treated
> separately in btrfs, and don't have to have the same raid levels, nor are
> they the same by default.  A filesystem agnostic raid such as mdraid or
> dmraid will by definition treat data and metadata alike as it won't be
> able to tell the difference -- if it did it wouldn't be filesystem
> agnostic.
>
>
> Now that btrfs raid56 mode is basically complete with kernel 3.19, the
> next thing on the raid side of the roadmap is N-way-mirroring.  I'm
> really looking forward to that as I really like btrfs' self-repair
> capacities as well, but for me the ideal balance is three-way-mirroring,
> just in case two copies fail checksum.  Tho the fact of the matter is,
> btrfs only now is getting to the point where a third mirror has some
> reasonable chance of being useful, as until now btrfs itself was unstable
> enough that the chances of it having a bug were far higher than of both
> devices going bad for a checksummed block at the same time.  But btrfs
> really is much more stable than it was, and it's stable enough now that
> the possibility of a third mirror really should start making statistical
> sense pretty soon, if it doesn't already.
>
> But given the time raid56 took, I'm not holding my breath.  I guess
> they'll be focused on the remaining raid56 bugs thru 3.20, and figure
> it'll be at least three kernel cycles later, so second half of the year
> at best, before we see N-way-mirroring in mainstream.  This time next
> year would actually seem more reasonable, and 2H-2016 or into 2017
> wouldn't surprise me in the least, again, given the time raid56 mode
> took.  Hopefully it'll be there before 2018...
>
>
> Tho as I said, for the two-device case, if both data and metadata are
> raid1 mode, those differences can for the most part be ignored.  Thus,
> this point is mostly for others reading, and for you in the future should
> you end up working with a btrfs raid1 with more than two devices.  I
> mostly mentioned it due to seeing that all-caps RAID1.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks, very informative about the read alg.  Sounds like it makes
more sense to simply do backups to the slower drive and manually
restore from those if I ever have a checksum error.

My main goal here was protection from undetectable sector corruption
("bitrot" etc.) without having to halve my SSD, but on btrfs I suppose
it's impossible for bitrot errors to creep into backups, because I'd
get a checksum error before that happened right?  Then I could just
restore it from a previous backup.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux