Re: btrfs replace seems to corrupt the file system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Mordechay Kaganer posted on Mon, 29 Jun 2015 08:02:01 +0300 as excerpted:

> On Sun, Jun 28, 2015 at 10:32 PM, Chris Murphy <lists@xxxxxxxxxxxxxxxxx>
> wrote:
>> On Sun, Jun 28, 2015 at 1:20 PM, Mordechay Kaganer <mkaganer@xxxxxxxxx>
>> wrote:
>>
>> Use of dd can cause corruption of the original.
>>
> But doing a block-level copy and taking care that the original volume is
> hidden from the kernel while mounting the new one is safe, isn't it?

As long as neither one is mounted while doing the copy, and one or the 
other is hidden before an attempt to mount, it should be safe, yes.

The base problem is that btrfs can be multi-device, and that it tracks 
the devices belonging to the filesystem based on UUID, so as soon as it 
sees another device with the same UUID, it considers it part of the same 
filesystem.  Writes can go to any of the devices it considers a component 
device, and after a write creates a difference, reads can end up coming 
from the stale one.

Meanwhile, unlike many filesystems, btrfs uses the UUID as part of the 
metadata, so changing the UUID isn't as simple as rewriting a superblock; 
the metadata must be rewritten to the new UUID.  There's actually a tool 
now available to do just that, but it's new enough I'm not even sure it's 
available in release form yet; if so, it'll be latest releases.  
Otherwise, it'd be in integration branch.

And FWIW a different aspect of the same problem can occur in raid1 mode, 
when a device drops out and is later reintroduced, with both devices 
separately mounted rw,degraded and updated in the mean time.  Normally, 
btrfs will track the generation, a monotonically increasing integer, and 
will read from the higher/newer generation, but with separate updates to 
each, if they both happen to have the same generation at reunite...

So for raid1 mode, the recommendation is that if there's a split and one 
continues to be updated, be sure the other one isn't separately mounted 
writable and then the two combined again, or if both must be separately 
mounted writable and then recombined, wipe the one and add it as a new 
device, thus avoiding the possibility of confusion.

> Anyway, what is the "strait forward" and recommended way of replacing
> the underlying device on a single-device btrfs not using any raid
> features? I can see 3 options:
> 
> 1. btrfs replace - as far as i understand, it's primarily intended for
> replacing the member disks under btrfs's raid.

It seems this /can/ work.  You demonstrated that much.  But I'm not sure 
whether btrfs replace was actually designed to do the single-device 
replace.  If not, it almost certainly hasn't been tested for it.  Even if 
so, I'm sure I'm not the only one who hadn't thought of using it that 
way, so while it might have been development-tested for single-device-
replace, it's unlikely to have had the same degree of broader testing of 
actual usage, simply because few even thought of using it that way.

Regardless, you seem to have flushed out some bugs.  Now that they're 
visible and the weekend's over, the devs will likely get to work tracing 
them down and fixing them.

> 2, Add a new volume, then remove the old one. Maybe this way we'll need
> to do a full balance after that?

This is the alternative I'd have used in your scenario (but see below).  
Except a manual balance shouldn't be necessary.  The device add part 
should go pretty fast as it would simply make more space available.  The 
device remove will go much slower as in effect it'll trigger that 
balance, forcing everything over to the just added pretty much empty 
device.

You'd do a manual balance if you wanted to convert to raid or some such, 
but from single device to single device, just the add/remove should do it.

> 3. Block-level copy of the partition, then hide the original from the
> kernel to avoid confusion because of the same UUID. Of course, this way
> the volume is going to be off-line until the copy is finished.

This could work too, but in addition to being forced to keep the 
filesystem offline the entire time, the block-level copy will copy any 
problems, etc, too.


But what I'd /prefer/ to do would be to take the opportunity to create a 
new filesystem, possibly using different mkfs.btrfs options or at least 
starting new with a fresh filesystem and thus eliminating any as yet 
undetected or still developing problems with the old filesystem.  Since 
the replace or device remove will end up rewriting everything anyway, 
might as well make a clean break and start fresh, would be my thinking.

You could then use send/receive to copy all the snapshots, etc, over.  
Currently, that would need to be done one at a time, but there's 
discussion of adding a subvolume-recursive mode.

Tho while on the subject of snapshots, it should be noted that btrfs 
operations such as balance don't scale so well with tens of thousands of 
snapshots.  So the recommendation is to try to keep it to 250 snapshots 
or so per subvolume, under 2000 snapshots total, if possible, which of 
course at 250 per would be 8 separate subvolumes.  You can go above that 
to 3000 or so if absolutely necessary, but if it reaches near 10K, expect 
more problems in general, and dramatically increased memory and time 
requirements, for balance, check, device replace/remove, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux