On Tue, Mar 25, 2014 at 12:13:50PM +0000, Martin wrote:
> On 25/03/14 01:49, Marc MERLIN wrote:
> > I had a tree with some amount of thousand files (less than 1 million)
> > on top of md raid5.
> >
> > It took 18H to rm it in 3 tries:
I ran another test after typing the original Email:
gargamel:/mnt/dshelf2/backup/polgara# time du -sh 20140312-feisty/; time find 20140 312-feisty/ | wc -l
17G 20140312-feisty/
real 245m19.491s
user 0m2.108s
sys 1m0.508s
728507 <- number of files
real 11m41.853s <- 11mn to restat them when they should all be in cache ideally
user 0m1.040s
sys 0m4.360s
4 hours to stat 700K files. That's bad...
Even 11mn to restat them just to count them looks bad too.
> > I checked that btrfs scrub is not running.
> > What else can I check from here?
>
> "noatime" set?
I have relatime
gargamel:/mnt/dshelf2/backup/polgara# df .
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/dshelf2 7814041600 3026472436 4760588292 39% /mnt/dshelf2/backup
gargamel:/mnt/dshelf2/backup/polgara# grep /mnt/dshelf2/backup /proc/mounts
/dev/mapper/dshelf2 /mnt/dshelf2/backup btrfs rw,relatime,compress=lzo,space_cache 0 0
> What's your cpu hardware wait time?
Sorry, not sure how to get that.
> And is not *the 512kByte raid chunk* going to give you horrendous write
> amplification?! For example, rm updates a few bytes in one 4kByte
> metadata block and the system has to then do a read-modify-write on
> 512kBytes...
That's probably not great, but
1) rm -rf should bunch a lot of writes together before they start
hitting the block layer for writes, so I'm not sure that is too much a
problem with the caching layer in between
2) this does not explain 4H to just run du with relatime, which
shouldn't generate any writing, correct?
iostat seems to confirm:
gargamel:~# iostat /dev/md8 1 20
Linux 3.14.0-rc5-amd64-i915-preempt-20140216c (gargamel.svh.merlins.org) 03/25/2014 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
75.19 0.00 10.13 8.61 0.00 6.08
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
md8 98.00 392.00 0.00 392 0
md8 96.00 384.00 0.00 384 0
md8 83.00 332.00 0.00 332 0
md8 153.00 612.00 0.00 612 0
md8 82.00 328.00 0.00 328 0
md8 55.00 220.00 0.00 220 0
md8 69.00 276.00 0.00 276 0
> Also, the 64MByte chunk bit-intent map will add a lot of head seeks to
> anything you do on that raid. (The map would be better on a separate SSD
> or other separate drive.)
That's true for writing, but not reading, right?
> So... That sort of setup is fine for archived data that is effectively
> read-only. You'll see poor performance for small writes/changes.
So I agree with you that the write case can be improved, especially since I also have a layer
of dmcrypt in the middle
gargamel:/mnt/dshelf2/backup/polgara# cryptsetup luksDump /dev/md8
LUKS header information for /dev/md8
Cipher name: aes
Cipher mode: xts-plain64
Hash spec: sha1
Payload offset: 8192
(I used cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64)
I'm still not convinced that a lot of file IO don't get all collated in memory
before hitting disk in bigger blocks, but maybe not.
If I were to recreate this array entirely, what would you use for the raid creation
and cryptsetup?
More generally, before I go through all that trouble (it will likely
take 1 week of data copying back and forth), I'd like to debug why my reads are
so slow first.
Thanks,
Marc
On Tue, Mar 25, 2014 at 02:57:57PM +0100, Xavier Nicollet wrote:
> Le 25 mars 2014 à 12:13, Martin a écrit:
> > On 25/03/14 01:49, Marc MERLIN wrote:
> > > It took 18H to rm it in 3 tries:
>
> > And is not *the 512kByte raid chunk* going to give you horrendous write
> > amplification?! For example, rm updates a few bytes in one 4kByte
> > metadata block and the system has to then do a read-modify-write on
> > 512kBytes...
>
> My question would be naive, but would it be possible to have a syscall or something to do
> a fast "rm -rf" or du ?
Well, that wouldn't hurt either, even if it wouldn't address my underlying problem.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html