Re: filestore flusher = false , correct my problem of constant write (need info on this parameter)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



Hi,
maybe this can help:

I have tunned
filestore queue max ops = 50000

now I'm able to achieve 4000io/s  (with some spikes)

with 3 nodes with  1 x osd (1 x 15k drive by osd), journal on tmpfs
or
3 nodes with 5 osd (1 x 15k drive by osd), journal on tmpfs

same result for both conf.




----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
À: "Sage Weil" <sage@xxxxxxxxxxx> 
Cc: "Mark Nelson" <mark.nelson@xxxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> 
Envoyé: Dimanche 24 Juin 2012 10:10:48 
Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) 

ok, I have done tests with more than 1 client. 

3 kvm guest on 3 differents kvm host server and 3 kvm guest on same server. 

I have the same result , around 2000 io/s shared between clients. So it doesn't scale. 


I have also tried with 3 nodes x 5 osd 15k drive + 5 tmpfs journal 
and 3 nodes x 1 osd (hardware raid0 with 5 x 15kdisk) + 1 tmpfs journal 

the results are same, around 2000 io/s 


But 
if I try with 3 nodes x 1 osd with 1x15k drive, 
I got around 500 io/s 


I also known that stefan priebe has achieve around 12000io/s with ssds in osd. 

So it's seem related to osd drive speed 

So, are we sure that journal is acking to client before flushing to disk ? 



benchmark used is fio, with directio, write 100MB of 4K block. (so journal is big enough to handle all the write) 

fio --filename=[disk] --direct=1 --rw=randwrite --bs=4k --size=100M --numjobs=50 --runtime=30 --group_reporting --name=file1 


----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
À: "Sage Weil" <sage@xxxxxxxxxxx> 
Cc: "Mark Nelson" <mark.nelson@xxxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> 
Envoyé: Samedi 23 Juin 2012 20:21:05 
Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) 

>>Is that 2000 ios from a single client? You might try multiple clients and 
>>see if the sum of the ios will scale any higher. 

yes from a single client. (qemu-kvm guest). 

Tomorrow,I'll retry with 3 qemu-kvm guest, on same host and 3 differents hosts. 
I'll also try on bigger cpu machine to compra.(I see a lot of cpu on my kvm guest process, more than with iscsi) 

I'll keep you in touch. 

Thanks 

Alexandre 

----- Mail original ----- 

De: "Sage Weil" <sage@xxxxxxxxxxx> 
À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
Cc: "Mark Nelson" <mark.nelson@xxxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> 
Envoyé: Samedi 23 Juin 2012 20:12:49 
Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) 

On Sat, 23 Jun 2012, Alexandre DERUMIER wrote: 
> >>I was just talking with Elder on IRC yesterday about looking into how 
> >>much small network transfers are hurting us in cases like these. Even 
> >>with SSD based OSDs I haven't seen a very dramatic improvement in small 
> >>request performance. How tough would it be to aggregate requests into 
> >>larger network transactions? There would be a latency penalty of 
> >>course, but we could flush a client side dirty cache pretty quickly and 
> >>still benefit if we are getting bombarded with lots of tiny requests. 
> 
> Yes, I see no improvement with journal on tmpfs ...this is strange.. 
> 
> Also, I have tried with rbd_cache=true, so ios should be already aggregate in bigger transaction. 
> But I didnt't have see any improvement. 
> 
> I'm around 2000 ios. 
> 
> Do you know what is the bottleneck ? rbd protocol (some kind of overhead 
> for each io ?) 

Is that 2000 ios from a single client? You might try multiple clients and 
see if the sum of the ios will scale any higher. That will tell us 
whether it is in the messenger or osd request pipeline. The latter 
definitely needs some work, although there may be a quick fix to the msgr 
that will buy us some too. 

sage 


> 
> 
> ----- Mail original ----- 
> 
> De: "Mark Nelson" <mark.nelson@xxxxxxxxxxx> 
> À: "Sage Weil" <sage@xxxxxxxxxxx> 
> Cc: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> 
> Envoyé: Samedi 23 Juin 2012 18:40:28 
> Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) 
> 
> On 6/23/12 10:38 AM, Sage Weil wrote: 
> > On Fri, 22 Jun 2012, Alexandre DERUMIER wrote: 
> >> Hi Sage, 
> >> thanks for your response. 
> >> 
> >>>> If you turn off the journal compeletely, you will see bursty write commits 
> >>> >from the perspective of the client, because the OSD is periodically doing 
> >>>> a sync or snapshot and only acking the writes then. 
> >>>> If you enable the journal, the OSD will reply with a commit as soon as the 
> >>>> write is stable in the journal. That's one reason why it is there--file 
> >>>> system commits of heavyweight and slow. 
> >> 
> >> Yes of course, I don't wan't to desactivate journal, using a journal on a fast ssd or nvram is the right way. 
> >> 
> >>>> If we left the file system to its own devices and did a sync every 10 
> >>>> seconds, the disk would sit idle while a bunch of dirty data accumulated 
> >>>> in cache, and then the sync/snapshot would take a really long time. This 
> >>>> is horribly inefficient (the disk is idle half the time), and useless (the 
> >>>> delayed write behavior makes sense for local workloads, but not servers 
> >>>> where there is a client on the other end batching its writes). To prevent 
> >>>> this, 'filestore flusher' will prod the kernel to flush out any written 
> >>>> data to the disk quickly. Then, when we get around to doing the 
> >>>> sync/snapshot it is pretty quick, because only fs metadata and 
> >>>> just-written data needs to be flushed. 
> >> 
> >> mmm, I disagree. 
> >> 
> >> If you flush quickly, it's works fine with sequential write workload. 
> >> 
> >> But if you have a lot of random write with 4k block by exemple, you are 
> >> going to have a lot of disk seeks. The way zfs works or netapp san 
> >> storage works, they take random writes in a fast journal then flush them 
> >> sequentially each 20s to slow storage. 
> > 
> > Oh, I see what you're getting at. Yes, that is not ideal for small random 
> > writes. There is a branch in ceph.git called wip-flushmin that just sets 
> > a minimum write size for the flush that will probably do a decent job of 
> > dealing with this: small writes won't get flushed, large ones will. 
> > Picking the right value will depend on how expensive seeks are for your 
> > storage system. 
> > 
> > You'll want to cherry-pick just the top commit on top of whatever it is 
> > you're running... 
> 
> I was just talking with Elder on IRC yesterday about looking into how 
> much small network transfers are hurting us in cases like these. Even 
> with SSD based OSDs I haven't seen a very dramatic improvement in small 
> request performance. How tough would it be to aggregate requests into 
> larger network transactions? There would be a latency penalty of 
> course, but we could flush a client side dirty cache pretty quickly and 
> still benefit if we are getting bombarded with lots of tiny requests. 
> 
> Mark 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@xxxxxxxxxxxxxxx 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
> 
> -- 
> 
> -- 
> 
> 
> 
> 
> 
> Alexandre D e rumier 
> 
> Ingénieur Systèmes et Réseaux 
> 
> 
> Fixe : 03 20 68 88 85 
> 
> Fax : 03 20 68 90 88 
> 
> 
> 45 Bvd du Général Leclerc 59100 Roubaix 
> 12 rue Marivaux 75002 Paris 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@xxxxxxxxxxxxxxx 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 


-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 



	

Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[CEPH Users]     [Information on CEPH]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Free Online Dating]     [Linux Kernel]     [Linux SCSI]     [XFree86]

Add to Google Powered by Linux