On Wed, Mar 17, 2010 at 4:25 PM, Hubert Kario <hka@xxxxxxxxxx> wrote: > On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote: >> Hi, >> >> just want to add one correction to your thoughts: >> >> Storage is not cheap if you think about enterprise storage on a SAN, >> replicated to another data centre. Using dedup on the storage boxes leads >> to performance issues and other problems - only NetApp is offering this at >> the moment and it's not heavily used (because of the issues). > > there are at least two other suppliers with inline dedup products and there is > OSS solution: lessfs > >> So I think it would be a big advantage for professional use to have dedup >> build into the filesystem - processors are faster and faster today and not >> the cost drivers any more. I do not think it's a problem to "spend" on >> core of a 2 socket box with 12 cores for this purpose. >> Storage is cost intensive: >> - SAN boxes are expensive >> - RAID5 in two locations is expensive >> - FC lines between locations is expensive (depeding very much on where you >> are). > > In-line dedup is expensive in two ways: first you have to cache the data going > to disk and generate checksum for it, then you have to look if such block is > already stored -- if the database doesn't fit into RAM (for a VM host it's more > than likely) it requires at least few disk seeks, if not a few dozen for > really big databases. Then you should read the block/extent back and compare > them bit for bit. And only then write the data to the disk. That reduces your > IOPS by at least an order of maginitude, if not more. Sun decided that with SHA256 (which ZFS uses for normal checksumming) collisions are unlikely enough to skip the read/compare step: http://blogs.sun.com/bonwick/entry/zfs_dedup . That's not the case, of course, with btrfs-used CRC32, but a switch to a stronger hash would be recommended to reduce collisions anyway. And yes, for the truly paranoid, a forced verification (after the hashes match) is always an option. > > For post-process dedup you can go as fast as your HDDs will allow you. And > then, when your machine is mostly idle you can go and churn through the data. > > IMHO in-line dedup is a good thing only as storage for backups -- when you > have high probability that the stored data is duplicated (and with a 1:10 > dedup ratio you have 90% probability, it is). > > So the CPU cost is only one factor. HDDs are a major bottleneck too. > > All things considered, it would be best to have both post-process and in-line > data deduplication, but I think, that in-line dedup will see much less use. > >> >> Naturally, you would not use this feature for all kind of use cases (eg. >> heavily used database), but I think there is enough need. >> >> my 2 cents, >> Heinz-Josef Claes > -- > Hubert Kario > QBS - Quality Business Software > 02-656 Warszawa, ul. Ksawerów 30/85 > tel. +48 (22) 646-61-51, 646-74-24 > www.qbs.com.pl > > System Zarządzania Jakością > zgodny z normą ISO 9001:2000 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
