On Thu, Jan 02, 2020 at 04:22:37PM -0700, Chris Murphy wrote: > On Thu, Jan 2, 2020 at 3:39 PM Leszek Dubiel <leszek@xxxxxxxxx> wrote: > > > > Almost no reads, all writes, but slow. And rather high write request > > > per second, almost double for sdc. And sdc is near it's max > > > utilization so it might be ear to its iops limit? > > > > > > ~210 rareq-sz = 210KiB is the average size of the read request for > > sda and sdb > > > > > > Default mkfs and default mount options? Or other and if so what other? > > > > > > Many small files on this file system? Or possibly large files with a > > > lot of fragmentation? > > > > Default mkfs and default mount options. > > > > This system could have a few million (!) of small files. > > On reiserfs it takes about 40 minutes, to do "find /". > > Rsync runs for 6 hours to backup data. > > There is a mount option: max_inline=<bytes> which the man page says > (default: min(2048, page size) ) It's half the page size per a commit from some years ago. For compressed size, it's the compressed data size (i.e. you can have a 4095-byte inline file with max_inline=2048 due to the compression). > I've never used it, so in theory the max_inline byte size is 2KiB. > However, I have seen substantially larger inline extents than 2KiB > when using a nodesize larger than 16KiB at mkfs time. > > I've wondered whether it makes any difference for the "many small > files" case to do more aggressive inlining of extents. > > I've seen with 16KiB leaf size, often small files that could be > inlined, are instead put into a data block group, taking up a minimum > 4KiB block size (on x64_64 anyway). I'm not sure why, but I suspect > there just isn't enough room in that leaf to always use inline > extents, and yet there is enough room to just reference it as a data > block group extent. When using a larger node size, a larger percentage > of small files ended up using inline extents. I'd expect this to be > quite a bit more efficient, because it eliminates a time expensive (on > HDD anyway) seek. Putting a lot of inline file data into metadata pages makes them less dense, which is either good or bad depending on which bottleneck you're currently hitting. If you have snapshots there is an up-to-300x metadata write amplification penalty to update extent item references every time a shared metadata page is unshared. Inline extents reduce the write amplification. On the other hand, if you are doing a lot of 'find'-style tree sweeps, then inline extents will reduce their efficiency because more pages will have to be read to scan the same number of dirents and inodes. For workloads that reiserfs was good at, there's no reliable rule of thumb to guess which is better--you have to try both, and measure results. > Another optimization, using compress=zstd:1, which is the lowest > compression setting. That'll increase the chance a file can use inline > extents, in particular with a larger nodesize. > > And still another optimization, at the expense of much more > complexity, is LVM cache with an SSD. You'd have to pick a suitable > policy for the workload, but I expect that if the iostat utilizations > you see of often near max utilization in normal operation, you'll see > improved performance. SSD's can handle way higher iops than HDD. But a > lot of this optimization stuff is use case specific. I'm not even sure > what your mean small file size is. I've found an interesting result in cache configuration testing: btrfs's writes with datacow seem to be very well optimized, to the point that adding a writeback SSD cache between btrfs and a HDD makes btrfs commits significantly slower. A writeback cache adds latency to the write path without removing many seeks--btrfs already does writes in big contiguous bursts--so the extra latency makes the writeback cache slow compared to writethrough. A writethrough SSD cache helps with reads (which are very seeky and benefit a lot from caching) without adding latency to writes, and btrfs reads a _lot_ during commits. > > # iotop -d30 > > > > Total DISK READ: 34.12 M/s | Total DISK WRITE: 40.36 M/s > > Current DISK READ: 34.12 M/s | Current DISK WRITE: 79.22 M/s > > TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > > 4596 be/4 root 34.12 M/s 37.79 M/s 0.00 % 91.77 % btrfs > > Not so bad for many small file reads and writes with HDD. I've see > this myself with single spindle when doing small file reads and > writes. > > > -- > Chris Murphy
Attachment:
signature.asc
Description: PGP signature
