On Fri, Jan 3, 2020 at 2:08 AM Leszek Dubiel <leszek@xxxxxxxxx> wrote: > > >> # iotop -d30 > >> > >> Total DISK READ: 34.12 M/s | Total DISK WRITE: 40.36 M/s > >> Current DISK READ: 34.12 M/s | Current DISK WRITE: 79.22 M/s > >> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > >> 4596 be/4 root 34.12 M/s 37.79 M/s 0.00 % 91.77 % btrfs > > > > Not so bad for many small file reads and writes with HDD. I've see > > this myself with single spindle when doing small file reads and > > writes. It's not small files directly. It's the number of write requests per second, resulting in high latency seeks. And the reason for the seeking needs a second opinion, to be certain it's related to small files. I'm not really sure why there are hundreds of write requests per second. Seems to me with thousands of small files, Btrfs can aggregate them into a single sequential write (mostly sequential anyway) and do the same for metadata writes; yes there is some back and forth seeking since metadata and data block groups are in different physical locations. But hundreds of times per second? Hmmm. I'm suspicious why. It must be trying to read and write hundreds of small files *in different locations* causing the seeks, and the ensuing latency. The typical work around for this these days is add more disks or add SSD. If you add a fourth disk, you reduce your one bottle neck: > root@wawel:~# btrfs dev usag / > /dev/sda2, ID: 2 > Device size: 5.45TiB > Device slack: 0.00B > Data,RAID1: 2.62TiB > Metadata,RAID1: 22.00GiB > Unallocated: 2.81TiB > > /dev/sdb2, ID: 3 > Device size: 5.45TiB > Device slack: 0.00B > Data,RAID1: 2.62TiB > Metadata,RAID1: 21.00GiB > System,RAID1: 32.00MiB > Unallocated: 2.81TiB > > /dev/sdc3, ID: 4 > Device size: 10.90TiB > Device slack: 3.50KiB > Data,RAID1: 5.24TiB > Metadata,RAID1: 33.00GiB > System,RAID1: 32.00MiB > Unallocated: 5.62TiB OK this is important. Two equal size drives, and the third is much larger. This means writes are going to be IO bound to that single large device because it's always going to be written to. The reads get spread out somewhat. Again, maybe the every day workload is the one to focus on because it's not such a big deal for a device replace to take overnight. Even though it would be good for everyone's use case if it turns out there's some optimization possible to avoid hundreds of write requests per second, just because of small files. -- Chris Murphy
