Re: Data Deduplication with the help of an online filesystem check

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 5 May 2009 07:29:45 +1000
Dmitri Nikulin <dnikulin@xxxxxxxxx> wrote:

> On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes <hjclaes@xxxxxx> wrote:
> > Hi, during the last half year I thought a little bit about doing dedup for
> > my backup program: not only with fixed blocks (which is implemented), but
> > with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That
> > means, I have to have *lots* of comparisions (size of file - blocksize).
> > Even it's not the same, it must be very fast and that's the same problem
> > like the one discussed here.
> >
> > My solution (not yet implemented) is as follows (hopefully I remember well):
> >
> > I calculate a checksum of 24 bit. (there can be another size)
> >
> > This means, I can have 2^24 different checksums.
> >
> > Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well,
> > I'm just in a hotel and have no calculator): one bit for each possibility.
> > This verctor is initialized with zeros.
> >
> > For each calculated checksum of a block, I set the according bit in the bit
> > vector.
> >
> > It's very fast, to check if a block with a special checksum exists in the
> > filesystem (backup for me) by checking the appropriate bit in the bit
> > vector.
> >
> > If it doesn't exist, it's a new block
> >
> > If it exists, there need to be a separate 'real' check if it's really the
> > same block (which is slow, but's that's happening <<1% of the time).
> 
> Which means you have to refer to each block in some unique way from
> the bit vector, making it a block pointer vector instead. That's only
> 64 times more expensive for a 64 bit offset...
> 
It was not the idea to have a pointer vector, only a bit vector. A pointer vector would be too big to hold it in RAM. Therefore, I need access to the disk after using the more exact md5sum (I wanted to use). The bitvector is only needed to have a very quick decision for most of the cases (speedup).
But I have no idea if it fits to this use case. I'm not a filesystem developer ;-)

> Since the overwhelming majority of combinations will never appear in
> practice, you are much better served with a self-sizing data structure
> like a hash map or even a binary tree, or a hash map with each bucket
> being a binary tree, etc... You can use any sized hash and it won't
> affect the number of nodes you have to store. You can trade off to CPU
> or RAM easily as required, just by selecting an appropriate data
> structure. A bit vector and especially a pointer vector have extremely
> bad "any" case RAM requirements because even if you're deduping a mere
> 10 blocks you're still allocating and initialising 2^24 offsets. The
> least you could do is adaptively switch to a more efficient data
> structure if you see the number of blocks is low enough.
> 
> -- 
> Dmitri Nikulin
> 
> Centre for Synchrotron Science
> Monash University
> Victoria 3800, Australia
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Heinz-Josef Claes <hjclaes@xxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux