On 2016-12-08 21:54, Chris Murphy wrote:
On Thu, Dec 8, 2016 at 7:26 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
On Thu, Dec 08, 2016 at 05:45:40PM -0700, Chris Murphy wrote:
OK something's wrong.
Kernel 4.8.12 and duperemove v0.11.beta4. Brand new file system
(mkfs.btrfs -dsingle -msingle, default mount options) and two
identical files separately copied.
[chris@f25s]$ ls -li /mnt/test
total 2811904
260 -rw-r--r--. 1 root root 1439694848 Dec 8 17:26
Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso
259 -rw-r--r--. 1 root root 1439694848 Dec 8 17:26
Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso2
[chris@f25s]$ filefrag /mnt/test/*
/mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso: 3 extents found
/mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso2: 2 extents found
[chris@f25s duperemove]$ sudo ./duperemove -dv /mnt/test/*
Using 128K blocks
Using hash: murmur3
Gathering file list...
Using 4 threads for file hashing phase
[1/2] (50.00%) csum: /mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso
[2/2] (100.00%) csum: /mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso2
Total files: 2
Total hashes: 21968
Loading only duplicated hashes from hashfile.
Using 4 threads for dedupe phase
[0xba8400] (00001/10947) Try to dedupe extents with id e47862ea
[0xba84a0] (00003/10947) Try to dedupe extents with id ffed44f2
[0xba84f0] (00002/10947) Try to dedupe extents with id ffeefcdd
[0xba8540] (00004/10947) Try to dedupe extents with id ffe4cf64
[0xba8540] Add extent for file
"/mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso" at offset
1182924800 (4)
[0xba8540] Add extent for file
"/mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso2" at offset
1182924800 (5)
[0xba8540] Dedupe 1 extents (id: ffe4cf64) with target: (1182924800,
131072), "/mnt/test/Fedora-Workstation-Live-x86_64-25_Beta-1.1.iso"
Ew, it's deduping these two 1.4GB files 128K at a time, which results in
12000 ioctl calls. Each of those 12000 calls has to lock the two
inodes, read the file contents, remap the blocks, etc. instead of
finding the maximal identical range and making a single call for the
whole range.
That's probably why it's taking forever to dedupe.
Yes but it looks like it's also heavily fragmenting the files as a
result as well.
This kind of reinforces what I've been telling people recently, namely
that while generic batch deduplication generally works, it's quite often
better to do a custom tool that understands your data-set and knows how
to handle it efficiently.
As an example, one of the cases where I use deduplication is on a set of
directories that are disjoint sets of a larger tree. So, the
directories look something like this:
+ a
| + file1
| \ file2
+ b
| + file3
| \ file2
\ c
+ file1
\ file3
In this case, I know that if a/file1 and c/file1 have the same mtime and
size, they're (supposed to be) copies of the same file. Given this, the
tool I use for this just checks for duplicate names with the same size
and mtime, and then counts on the ioctl's check to verify that the files
are actually identical (and throws a warning if they aren't), and does
some special stuff to submit things such that any given file both has
the fewest possible number of extents and all the extents are roughly
the same size. On average, even with the fancy extent size calculation
logic, this still takes less than a quarter of the time that duperemove
took on the same data-set.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html