Re: Raid10 performance issues during copy and balance, only half the spindles used for reading data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Duncan,

thank you for that explanation

The existing algorithm is a very simple even/odd PID-based algorithm.
Thus, single-thread testing will indeed always read from the same side of
the pair-mirror (since btrfs raid1 and raid10 are pair-mirrored, no N-way-
mirroring available yet, tho it's the next new feature on the raid roadmap
now that raid56 is essentially code-complete altho not yet well bug-
flushed, with 3.19).  With a reasonably balanced mix of even/odd-PID
readers, however, you should indeed get reasonable balanced read activity.

OK, that is really no good design for production:
eveness/oddity of PIDs is not related (not a good criterion to predict) to what PIDs
will request I/O from the pool at a given time.

In Raid10 (N0.. to come) probably all read requests from all PIDs need to be queued and
spread across the available number of redundant stripesets
(or simple mirrors, if the "striped mirrors" layout 1+0 à la zfs is used or were possible)
Would apply for pure-ssd pool as well, though some fancy "near/far/seek time"
considerations are obsoleted
Something like that...

Probably an option-parameter in analogy to (single-spindle pre-ssd ideas for the I/O scheduler) like

elevator=cfq
(for btrfs="try to balance reads between devices by common read queue" resp "max out all resources and distribute fairly to requesting apps")
(optimum for large reads (and parallel several of them such as balance / send/receive/copy/tar with the pool and the same time to external backup...)

elevator=noop (assign by even/odd, current behavior (testing)

elevator=jumpy (e.g. assign a read to the stripeset which has the smallest number of other reads on it, every rand x
secs switch stripeset if number of "customers" on other 1..N redundant stripesets has decreased
(similar to core switching of a long-running process in kernel)
(optimised for smaller r/w operations such as many users accessing a central server)
etc..

would bring room to experiment in the years till 2020 as you outlined and to review,
whether mdadms raid10 near/far/offset should be considered when most future storage will be non-rotary...
Some kind of self-optimzing should also be included, i.e. if the filesystem new,
how much is to be read, it could know if it made sense to try different methods
and find the fastest, such as gparteds' block size adapting...
Again, it would be interessting if the impact on non-rotary storage would be insignificant

In my use case it's a simple rsync or cp -a or nemo/nautilus
copying between zfs and btrfs pools. Those are single-threaded I guess
and I understand in btrfs there is not such a ton of z_read - processess that
probably account for the "flying" zfs-reads on 6disk raidz2 or 3x2 or 2x3
vdev layout compared to btrfs reads.


I still find it strange, that also a balance also only uses half the spindles,
but it is explainable when the same logic is used as for any other reading from
the array. At least the scrub reads all data and not only one copy ;-)

Interesting enough, all my other btrfses are single-SSD for operating system with auto-snap to be able to revert...
and one is a 2-disk raid 0 for throw away data, so I never had a setup that would expose this behaviour...

Goodbye,

Sven.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux