On Sat, Mar 31, 2018 at 11:36:50AM +0300, Andrei Borzenkov wrote: > 31.03.2018 11:16, Goffredo Baroncelli пишет: > > On 03/31/2018 09:43 AM, Zygo Blaxell wrote: > >>> The key is that if a data write is interrupted, all the transaction > >>> is interrupted and aborted. And due to the COW nature of btrfs, the > >>> "old state" is restored at the next reboot. > > > >> This is not presently true with raid56 and btrfs. RAID56 on btrfs uses > >> RMW operations which are not COW and don't provide any data integrity > >> guarantee. Old data (i.e. data from very old transactions that are not > >> part of the currently written transaction) can be destroyed by this. > > > > Could you elaborate a bit ? > > > > Generally speaking, updating a part of a stripe require a RMW cycle, because > > - you need to read all data stripe (with parity in case of a problem) > > - then you should write > > - the new data > > - the new parity (calculated on the basis of the first read, and the new data) > > > > However the "old" data should be untouched; or you are saying that the "old" data is rewritten with the same data ? > > > > If old data block becomes unavailable, it can no more be reconstructed > because old content of "new data" and "new priority" blocks are lost. > Fortunately if checksum is in use it does not cause silent data > corruption but it effectively means data loss. > > Writing of data belonging to unrelated transaction affects previous > transactions precisely due to RMW cycle. This fundamentally violates > btrfs claim of always having either old or new consistent state. Correct. To fix this, any RMW stripe update on raid56 has to be written to a log first. All RMW updates must be logged because a disk failure could happen at any time. Full stripe writes don't need to be logged because all the data in the stripe belongs to the same transaction, so if a disk fails the entire stripe is either committed or it is not. One way to avoid the logging is to change the btrfs allocation parameters so that the filesystem doesn't allocate data in RAID stripes that are already occupied by data from older transactions. This is similar to what 'ssd_spread' does, although the ssd_spread option wasn't designed for this and won't be effective on large arrays. This avoids modifying stripes that contain old committed data, but it also means the free space on the filesystem will become heavily fragmented over time. Users will have to run balance *much* more often to defragment the free space.
Attachment:
signature.asc
Description: PGP signature
