I've been experimenting lately with btrfs RAID1 implementation and have to say that it is performing quite well, but there are few problems: * when I purposefully damage partitions on which btrfs stores data (for example, by changing the case of letters) it will read the other copy and return correct data. It doesn't report in dmesg this fact every time, but it does correct the one with wrong checksum * when both copies are damaged it returns the damaged block as it is written(!) and only adds a warning in the dmesg with exact same wording as with the single block corruption(!!) * from what I could find, btrfs doesn't remember anywhere the number of detected and fixed corruptions I don't know if it's the final design and while the first and last points are minor inconveniences the second one is quite major. At this time it doesn't prevent silent corruption from going unnoticed. I think that reading from such blocks should return EIO (unless mounted nodatasum) or at least a broadcast message noting that a corrupted block is being returned to userspace. I've also been thinking about tiered storage (meaning 2+, not only two-tiered) and have some ideas about it. I think that there need to be 3 different mechanisms working together to achieve high performance: * ability to store all metadata on selected volumes (probably read optimised SSDs) * ability to store all newly written data on selected volumes (write optimised SSDs) * ability to differentiate between often written, often read and infrequently accessed data (and based on this information, ability to move this data to fast SSDs, slow SSDs, fast RAID, slow RAID or MAID) While the first two are rather straight-forward, the third one needs some explanation. I think that for this to work, we should save not only the time of last access to file and last change time but also few past values (I think that at least 8 to 16 ctimes and atimes are necessary but this will need testing). I'm not sure about how and exactly when to move this data around to keep the arrays balanced but a userspace daemon would be most flexible. This solution won't work well for file systems with few very large files of which very few parts change often, in other words it won't be doing block- level tiered storage. From what I know, databases would benefit most from such configuration, but then most databases can already partition tables to different files based on access rate. As such, making its granularity on file level would make this mechanism easy to implement while still useful. On second thought: it won't make it exactly file-level granular, if we introduce snapshots in the mix, the new version can have the data regularly accessed while the old snapshot won't, this way the obsolete blocks can be moved to slow storage. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. KsawerÃw 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
