Re: btrfs check inconsistency with raid1, part 1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Kai Krakow posted on Tue, 22 Dec 2015 02:48:04 +0100 as excerpted:

> I just wondered if btrfs allows for the case where both stripes could
> have valid checksums despite of btrfs-RAID - just because a failure
> occurred right on the spot.
> 
> Is this possible? What happens then? If yes, it would mean not to
> blindly trust the RAID without doing the homeworks.

The one case where btrfs could get things wrong that I know of is as I 
discovered in my initial pre-btrfs-raid1-deployment testing...

1) Create a two-device btrfs raid1 (data and metadata) and ensure some 
data on it, including a test file with some content to be modified later. 
Sync and unmount normally.

2) Remove one of the two devices.

3) Mount the remaining device degraded-writable (it shouldn't allow 
mounting without degraded) and modify that test file.  Sync and unmount.

4) Switch devices and repeat, modifying that test file in some other 
incompatible way.  Sync and unmount.

To this point, everything should be fine, except that you now have two 
incompatible versions of the test file, potentially with the same 
separate-but-equal generation numbers after the separate degraded-
writable mount, modify, unmount, cycles.

5) Plug both devices in and mount normally.  Unless this has changed 
since my tests, btrfs will neither complain in dmesg nor otherwise 
provide any hint than anything is wrong.  If you read the file, it'll 
give you one of the versions, still not complaining or providing any hint 
that something's wrong.  Again unmount, without writing anything to the 
test file this time.

6) Try separately mounting each device individually again (without the 
other one available so degraded, can be writable or read-only this time) 
and check the file.  Each incompatible copy should remain in place on its 
respective device.  Reading the one copy (randomly chosen or more 
precisely, chosen based on PID even/odd, as that's what the btrfs raid1 
read-scheduler uses to decide which copy to read) didn't change the other 
one -- btrfs remained oblivious to the incompatible versions.  Again 
unmount.

7) Plug both devices in and mount the combined filesystem writable once 
again.  Scrub.

Back when I did my testing, I stopped at step 6 as I didn't understand 
that scrub was what I should use to resolve the problem.  However, based 
on quite a bit of later experience due to keeping a failing device (more 
and more sectors replaced with spares, turns out at least the SSD I was 
working with had way more spares than I would have expected, and even 
after several months when I finally gave up and replaced it, I was only 
down to about 85% of spares left, 15% used) around in raid1 mode for 
awhile, this should *NORMALLY* not be a problem.  As long as the 
generations differ, btrfs scrub can sort things out and catch up the 
"behind" device, resolving all differences to the latest generation copy.

8) But if both generations happen to be the same, having both been 
mounted separately and written so they diverged, but so they end up at 
the same generation when recombined...

>From all I know and from everything others told me when I asked at the 
time, which copy you get then is entirely unpredictable, and worse yet, 
you might get btrfs acting on divergent metadata when writing to the 
other device.


The caution, therefore, is to do your best not to ever let the two copies 
be both mounted degraded-writable, separately.  If only one copy is 
written to, then its generation will be higher than the other one, and 
scrub should have no problem resolving things.  Even if both copies are 
separately written to incompatibly, in most real-world cases one's going 
to have more generations written than the other and scrub should reliably 
and predictably resolve differences in favor of that one.  The problem 
only appears if they actually happen to have the same generation number, 
relatively unlikely except under controlled test conditions, but that has 
the potential to be a *BIG* problem should it actually occur.

So if for some reason you MUST mount both copies degraded-writable 
separately, the following are your options:

a) don't ever recombine them, doing a device replace missing with a third 
device instead (or a convert to single/dup); use one of the options below 
if you do need to recombine, or...

b) manually verify (using btrfs-show-super or the like) that the supers 
on each don't have the same generation before attempting a recombine, 
or...

c) wipe the one device and treat it as a new device add, so btrfs can't 
get mixed up with differing versions at the same generation number, or...

d) simply take your chances and hope that the generation numbers don't 
match.

(D should in practice be "good enough" if one was only mounted writable a 
very short time, while the other was written to over a rather longer 
period, such that it almost certainly had far more intervening commits 
and thus generations than the other.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux