Re: Data Deduplication with the help of an online filesystem check

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thomas Glanzmann schrieb:
Ric,

I would not categorize it as offline, but just not as inband (i.e., you can run a low priority background process to handle dedup).

Offline windows are extremely rare in production sites these days and
it could take a very long time to do dedup at the block level over a
large file system :-)

let me rephrase, by offline I meant asynchronous during off hours.

Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented), but with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That means, I have to have *lots* of comparisions (size of file - blocksize). Even it's not the same, it must be very fast and that's the same problem like the one discussed here.

My solution (not yet implemented) is as follows (hopefully I remember well):

I calculate a checksum of 24 bit. (there can be another size)

This means, I can have 2^24 different checksums.

Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well, I'm just in a hotel and have no calculator): one bit for each possibility. This verctor is initialized with zeros.

For each calculated checksum of a block, I set the according bit in the bit vector.

It's very fast, to check if a block with a special checksum exists in the filesystem (backup for me) by checking the appropriate bit in the bit vector.

If it doesn't exist, it's a new block

If it exists, there need to be a separate 'real' check if it's really the same block (which is slow, but's that's happening <<1% of the time).

I hope it is possible to understand my thoughts. I'm in a hotel and I possibly cannot track the emails in this list in the next hours or days.

Regards, HJC
1/3 is not sufficient for dedup in my opinion - you can get that with normal compression at the block level.

1/3 is what gives me real time data of an production environment in a
mixed VM setup without compression.

        Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux