On 6/23/12 10:38 AM, Sage Weil wrote:
On Fri, 22 Jun 2012, Alexandre DERUMIER wrote:Hi Sage, thanks for your response.If you turn off the journal compeletely, you will see bursty write commits>from the perspective of the client, because the OSD is periodically doinga sync or snapshot and only acking the writes then. If you enable the journal, the OSD will reply with a commit as soon as the write is stable in the journal. That's one reason why it is there--file system commits of heavyweight and slow.Yes of course, I don't wan't to desactivate journal, using a journal on a fast ssd or nvram is the right way.If we left the file system to its own devices and did a sync every 10 seconds, the disk would sit idle while a bunch of dirty data accumulated in cache, and then the sync/snapshot would take a really long time. This is horribly inefficient (the disk is idle half the time), and useless (the delayed write behavior makes sense for local workloads, but not servers where there is a client on the other end batching its writes). To prevent this, 'filestore flusher' will prod the kernel to flush out any written data to the disk quickly. Then, when we get around to doing the sync/snapshot it is pretty quick, because only fs metadata and just-written data needs to be flushed.mmm, I disagree. If you flush quickly, it's works fine with sequential write workload. But if you have a lot of random write with 4k block by exemple, you are going to have a lot of disk seeks. The way zfs works or netapp san storage works, they take random writes in a fast journal then flush them sequentially each 20s to slow storage.Oh, I see what you're getting at. Yes, that is not ideal for small random writes. There is a branch in ceph.git called wip-flushmin that just sets a minimum write size for the flush that will probably do a decent job of dealing with this: small writes won't get flushed, large ones will. Picking the right value will depend on how expensive seeks are for your storage system. You'll want to cherry-pick just the top commit on top of whatever it is you're running...
I was just talking with Elder on IRC yesterday about looking into how much small network transfers are hurting us in cases like these. Even with SSD based OSDs I haven't seen a very dramatic improvement in small request performance. How tough would it be to aggregate requests into larger network transactions? There would be a latency penalty of course, but we could flush a client side dirty cache pretty quickly and still benefit if we are getting bombarded with lots of tiny requests.
Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html