Zygo Blaxell posted on Mon, 03 Nov 2014 23:31:45 -0500 as excerpted: > On Mon, Nov 03, 2014 at 10:11:18AM -0700, Chris Murphy wrote: >> >> On Nov 2, 2014, at 8:43 PM, Zygo Blaxell <zblaxell@xxxxxxxxxxxxxxx> >> wrote: >> > btrfs seems to assume the data is correct on both disks (the >> > generation numbers and checksums are OK) but gets confused by equally >> > plausible but different metadata on each disk. It doesn't take long >> > before the filesystem becomes data soup or crashes the kernel. >> >> This is a pretty significant problem to still be present, honestly. I >> can understand the "catchup" mechanism is probably not built yet, >> but clearly the two devices don't have the same generation. The lower >> generation device should probably be booted/ignored or declared missing >> in the meantime to prevent trashing the file system. > > The problem with generation numbers is when both devices get divergent > generation numbers but we can't tell them apart [snip very reasonable scenario] > Now we have two disks with equal generation numbers. > Generations 6..9 on sda are not the same as generations 6..9 on sdb, so > if we mix the two disks' metadata we get bad confusion. > > It needs to be more than a sequential number. If one of the disks > disappears we need to record this fact on the surviving disks, and also > cope with _both_ disks claiming to be the "surviving" one. Zygo's absolutely correct. There is an existing catchup mechanism, but the tracking is /purely/ sequential generation number based, and if the two generation sequences diverge, "Welcome to the (data) Twilight Zone!" I noted this in my own early pre-deployment raid1 mode testing as well, except that I didn't at that point know about sequence numbers and never got as far as letting the filesystem make data soup of itself. What I did was this: 1) Create a two-device raid1 data and metadata filesystem, mount it and stick some data on it. 2) Unmount, pull a device, mount degraded the remaining device. 3) Change a file. 4) Unmount, switch devices, mount degraded the other device. 5) Change the same file in an different/incompatible way. 6) Unmount, plug both devices in again, mount (not degraded). 7) Wait for the sync I was used to from mdraid, which of course didn't occur. 8) Check the file to see which version showed up. I don't recall which version it was, but it wasn't the common pre-change version. 9) Unmount, pull each device one at a time, mounting the other one degraded and checking the file again. 10) The file on each device remained different, without a warning or indication of any problem at all when I mounted undegraded in 6/7. Had I initiated a scrub, presumably it would have seen the difference and if one was a newer generation, it would have taken it, overwriting the other. I don't know what it would have done if both were the same generation, tho the file being small (just a few line text file, big enough to test the effect of differing edits), I guess it would take one version or the other. If the file was large enough to be multiple extents, however, I've no idea whether it'd take one or the other, or possibly combine the two, picking extents where they differed more or less randomly. By that time the lack of warning and absolute resolution to one version or the other even after mounting undegraded and accessing the file with incompatible versions on each of the two devices was bothering me sufficiently that I didn't test any further. Being just me I have to worry about (unlike a multi-admin corporate scenario where you can never be /sure/ what the other admins will do regardless of agreed procedure), I simply set myself a set of rules very similar to what Zygo proposed: 1) If for whatever reason I ever split a btrfs raid1 with the intent or even the possibility of bringing the pieces back together again, if at all possible, never mount the split pieces writable -- mount read-only. 2) If a writable mount is required, keep the writable mounts to one device of the split. As long as the other device is never mounted writable, it will have an older generation when they're reunited and a scrub should take care of things, reliably resolving to the updated written device, rewriting the older generation on the other device. What I'd do here is physically put the removed side of the raid1 in storage, far enough from the remaining side that I couldn't possibly get them mixed up. I'd clearly label it as well, creating a "defense in depth" of at least two, the labeling and the physical separation and storage of the read-only device. 3) If for whatever reason the originally read-only side must be mounted writable, very clearly mark the originally mounted-writable device POISONED/TOXIC!! *NEVER* *EVER* let such a POISONED device anywhere near its original raid1 mate, until it is wiped, such that there's no possibility of btrfs getting confused and contaminated with the poisoned data. Given how unimpressed I was with btrfs' ability to do the right thing in such cases, I'd be tempted to wipefs the device, then dd from /dev/zero to it, then badblocks write-pattern test a couple patterns, then (if it was a full physical device not just a partition) hardware secure-erase it, then mkfs it to ext4 or vfat, then dd from /dev/zero it again and again hardware secure-erase it, then FINALLY mkfs.btrfs it again. Of course being ssd, a single mkfs.btrfs would issue a trim and that should suffice, but I was really REALLY not impressed with btrfs' ability to reliably do the right thing, and would effectively be tearing up the schoolbooks (at least the workbooks, since they couldn't be bought back) and feeding them to the furnace at the end of the year, as I used to do when I was a kid, not because it made a difference, but because it was so emotionally rewarding! =:^) Or maybe I'd make that an excuse to try dban[1]. But I'd probably just dd from /dev/zero or secure-erase it, or badblocks- write-test a couple patterns if I wanted to badblocks-test it anyway, or mkfs.btrfs it to get the trim from that. But I'd have fun doing it. =:^) And then I'd plug it back in and btrfs replace the missing device. Anyway, the point is, either don't reintroduce absent devices once split out of a btrfs raid1, or ensure they don't get written and immediately do a scrub to update them when reintroduced, or if they were written and the other device was too, separately, be sure the one is wiped (Destroy them with Lasers![2]) before using a full btrfs replace, to keep the remaining device(s) and the data on them healthy. =:^) --- [1] https://www.google.com/search?q=dban [2] Destroy them with Lazers! by Knife Party https://www.google.com/search?q=destroy+them+with+lazers -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
