On Tue, Nov 15, 2016 at 07:26:53AM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 16:10, Zygo Blaxell wrote:
> >Why is deduplicating thousands of blocks of data crazy? I already
> >deduplicate four orders of magnitude more than that per week.
> You missed the 'tiny' quantifier. I'm talking really small blocks, on the
> order of less than 64k (so, IOW, stuff that's not much bigger than a few
> filesystem blocks), and that is somewhat crazy because it ends up not only
> taking _really_ long to do compared to larger chunks (because you're running
> more independent hashes than with bigger blocks), but also because it will
> often split extents unnecessarily and contribute to fragmentation, which
> will lead to all kinds of other performance problems on the FS.
Like I said, millions of extents per week...
64K is an enormous dedup block size, especially if it comes with a 64K
alignment constraint as well.
These are the top ten duplicate block sizes from a sample of 95251
dedup ops on a medium-sized production server with 4TB of filesystem
(about one machine-day of data):
total bytes extent count dup size
2750808064 20987 131072
803733504 1533 524288
123801600 975 126976
103575552 8429 12288
97443840 793 122880
82051072 10016 8192
77492224 18919 4096
71331840 645 110592
64143360 540 118784
63897600 650 98304
all bytes all extents average dup size
6129995776 95251 64356
128K and 512K are the most common sizes due to btrfs compression (it
limits the block size to 128K for compressed extents and seems to limit
uncompressed extents to 512K for some reason). 12K is #4, and 3 of the
top ten sizes are below 16K. The average size is just a little below 64K.
These are the duplicates with block sizes smaller than 64K:
total bytes extent count extent size
41615360 635 65536
46264320 753 61440
45817856 799 57344
41267200 775 53248
45760512 931 49152
46948352 1042 45056
43417600 1060 40960
47296512 1283 36864
59277312 1809 32768
49029120 1710 28672
43745280 1780 24576
53616640 2618 20480
43466752 2653 16384
103575552 8429 12288
82051072 10016 8192
77492224 18919 4096
all bytes <=64K extents <=64K average dup size <=64K
870641664 55212 15769
14% of my duplicate bytes are in blocks smaller than 64K or blocks not
aligned to a 64K boundary within a file. It's too large a space saving
to ignore on machines that have constrained storage.
It may be worthwhile skipping 4K and 8K dedups--at 250 ms per dedup,
they're 30% of the total run time and only 2.6% of the total dedup bytes.
On the other hand, this machine is already deduping everything fast enough
to keep up with new data, so there's no performance problem to solve here.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Attachment:
signature.asc
Description: Digital signature
