On 8/16/2012 2:52 AM, David Brown wrote:
> On 16/08/2012 07:58, Stan Hoeppner wrote:
>> On 8/15/2012 9:56 PM, vincent Ferrer wrote:
>>
>>> - My storage server has upto 8 cores running linux kernel 2.6.32.27.
>>> - I created a raid5 device of 10 SSDs .
>>> - It seems I only have single raid5 kernel thread, limiting my
>>> WRITE throughput to single cpu core/thread.
>>
>> The single write threads of md/RAID5/6/10 are being addressed by patches
>> in development. Read the list archives for progress/status. There were
>> 3 posts to the list today regarding the RAID5 patch.
>>
>>> Question : What are my options to make my raid5 thread use all the
>>> CPU cores ?
>>> My SSDs can do much more but single raid5 thread
>>> from mdadm is becoming the bottleneck.
>>>
>>> To overcome above single-thread-raid5 limitation (for now) I
>>> re-configured.
>>> 1) I partitioned all my 10 SSDs into 8 partitions:
>>> 2) I created 8 raid5 threads. Each raid5 thread having
>>> partition from each of the 8 SSDs
>>> 3) My WRITE performance quadrupled because I have 8 RAID5
>>> threads.
>>> Question: Is this workaround a normal practice or may give me
>>> maintenance problems later on.
>>
>> No it is not normal practice. I 'preach' against it regularly when I
>> see OPs doing it. It's quite insane. The glaring maintenance problem
>> is that when one SSD fails, and at least one will, you'll have 8 arrays
>> to rebuild vs one. This may be acceptable to you, but not to the
>> general population. With rust drives, and real workloads, it tends to
>> hammer the drive heads prodigiously, increasing latency and killing
>> performance, and decreasing drive life. That's not an issue with SSD,
>> but multiple rebuilds is. That and simply keeping track of 80
>> partitions.
>>
>
> The rebuilds will, I believe, be done sequentially rather than in
> parallel. And each rebuild will take 1/8 of the time a full array
> rebuild would have done. So it really should not be much more time or
> wear-and-tear for a rebuild of this monster setup, compared to a single
> raid5 array rebuild. (With hard disks, it would be worse due to head
> seeks - but still not as bad as you imply, if I am right about the
> rebuilds being done sequentially.)
>
> However, there was a recent thread here about someone with a similar
> setup (on hard disks) who had a failure during such a rebuild and had
> lots of trouble. That makes me sceptical to this sort of multiple array
> setup (in addition to Stan's other points).
>
> And of course, all Stan's other points about maintenance, updates to
> later kernels with multiple raid5 threads, etc., still stand.
>
>> There are a couple of sane things you can do today to address your
>> problem:
>>
>> 1. Create a RAID50, a layered md/RAID0 over two 5 SSD md/RAID5 arrays.
>> This will double your threads and your IOPS. It won't be as fast as
>> your Frankenstein setup and you'll lose one SSD of capacity to
>> additional parity. However, it's sane, stable, doubles your
>> performance, and you have only one array to rebuild after an SSD
>> failure. Any filesystem will work well with it, including XFS if
>> aligned properly. It gives you an easy upgrade path-- as soon as the
>> threaded patches hit, a simple kernel upgrade will give your two RAID5
>> arrays the extra threads, so you're simply out one SSD of capacity. You
>> won't need to, and probably won't want to rebuild the entire thing after
>> the patch. With the Frankenstein setup you'll be destroying and
>> rebuilding arrays. And if these are consumer grade SSDs, you're much
>> better off having two drives worth of redundancy anyway, so a RAID50
>> makes good sense all around.
>>
>> 2. Make 5 md/RAID1 mirrors and concatenate them with md/RAID linear.
>> You'll get one md write thread per RAID1 device utilizing 5 cores in
>> parallel. The linear driver doesn't use threads, but passes offsets to
>> the block layer, allowing infinite core scaling. Format the linear
>> device with XFS and mount with inode64. XFS has been fully threaded for
>> 15 years. Its allocation group design along with the inode64 allocator
>> allows near linear parallel scaling across a concatenated device[1],
>> assuming your workload/directory layout is designed for parallel file
>> throughput.
>>
>> #2, with a parallel write workload, may be competitive with your
>> Frankenstein setup in both IOPS and throughput, even with 3 fewer RAID
>> threads and 4 fewer SSD "spindles". It will outrun the RAID50 setup
>> like it's standing still. You'll lose half your capacity to redundancy
>> as with RAID10, but you'll have 5 write threads for md/RAID1, one per
>> SSD pair. One core should be plenty to drive a single SSD mirror, with
>> plenty of cycles to spare for actual applications, while sparing 3 cores
>> for apps as well. You'll get unlimited core scaling with both md/linear
>> and XFS. This setup will yield the best balance of IOPS and throughput
>> performance for the amount of cycles burned on IO, compared to
>> Frankenstein and the RAID50.
>
> For those that don't want to use XFS, or won't have balanced directories
> in their filesystem, or want greater throughput of larger files (rather
> than greater average throughput of multiple parallel accesses), you can
> also take your 5 raid1 mirror pairs and combine them with raid0. You
> should get similar scaling (the cpu does not limit raid0). For some
> applications (such as mail server, /home mount, etc.), the XFS over a
> linear concatenation is probably unbeatable. But for others (such as
> serving large media files), a raid0 over raid1 pairs could well be
> better. As always, it depends on your load - and you need to test with
> realistic loads or at least realistic simulations.
Sure, a homemade RAID10 would work as it avoids the md/RAID10 single
write thread. I intentionally avoided mentioning this option for a few
reasons:
1. Anyone needing 10 SATA SSDs obviously has a parallel workload
2. Any thread will have up to 200-500MB/s available (one SSD)
with a concat, I can't see a single thread needing 4.5GB/s of B/W
If so, md/RAID isn't capable, not on COTS hardware
3. With a parallel workload requiring this many SSDs, XFS is a must
4. With a concat, mkfs.xfs is simple, no stripe aligning, etc
~$ mkfs.xfs /dev/md0
--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
[ATA RAID]
[Linux SCSI Target Infrastructure]
[Managing RAID on Linux]
[Linux IDE]
[Linux SCSI]
[Linux Hams]
[Device-Mapper]
[Kernel]
[Linux Books]
[Linux Admin]
[Linux Net]
[GFS]
[RPM]
[git]
[Photos]
[Yosemite Photos]
[Yosemite News]
[AMD 64]
[Linux Networking]