Morey Roof <moreyroof@xxxxxxxxx> writes: > I have been thinking about a new feature to start work on that I am > interested in and I was hoping people could give me some feedback and > ideas of how to tackle it. Anyways, I want to create a data > deduplication system that can work in two different modes. One mode > is that when the system is idle or not beyond a set load point a > background process would scan the volume for duplicate blocks. The > other mode would be used for systems that are nearline or backup > systems that don't really care about the performance and it would do > the deduplication during block allocation. Seems like a special case of compression? Perhaps compression would help more? > One of the ways I was thinking of to find the duplicate blocks would > be to use the checksums as a quick compare. If the checksums match > then do a complete compare before adjusting the nodes on the files. > However, I believe that I will need to create a tree based on the > checksum values. If you really want to do deduplication: It might be advantageous to do this on larger units. If you assume that data is usually shared between similar files (which is a reasonable assumption) and do the deduplication on whole files you can also use the size as an index and avoid checksumming all files with a unique size. I wrote a user level duplicated file checker some time ago that used this trick successfully. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
