Wow, thank you so much for the detailed response, I really appreciate you taking the time to help me understand this. I definitely misunderstood the situation after reading that bug report. I'll explore these options. On Mon, Dec 9, 2019 at 2:27 AM Filipe Manana <fdmanana@xxxxxxxxx> wrote: > > On Sun, Dec 8, 2019 at 9:32 AM Kyle Ambroff-Kao <kyle@xxxxxxxxxxxxxx> wrote: > > > > As long as the target device has enough capacity for the total bytes > > of the source device, allow the replacement to occur. > > > > Just changing the size validation isn't enough though, since the > > rest of the replacement code just assumes that the source device is > > identical to the target device. The current code just blindly > > copies the total disk size from the source to the target. > > > > A btrfs resize <devid>:max could be performed, but we might as well > > just set the disk size for the new target device correctly in the > > first place before initiating a scrub, which is what this patch does. > > > > When initializing the target device, the size in bytes is calculated > > in the same way that btrfs_init_new_device does it. > > > > When the replace operation completes, btrfs_dev_replace_finishing no > > longer clobbers the target device size with the source device size. > > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112741 > > Signed-off-by: Kyle Ambroff-Kao <kyle@xxxxxxxxxxxxxx> > > --- > > fs/btrfs/dev-replace.c | 11 +++++------ > > 1 file changed, 5 insertions(+), 6 deletions(-) > > > > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c > > index f639dde2a679..6a7a83ccab56 100644 > > --- a/fs/btrfs/dev-replace.c > > +++ b/fs/btrfs/dev-replace.c > > @@ -216,7 +216,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, > > > > > > if (i_size_read(bdev->bd_inode) < > > - btrfs_device_get_total_bytes(srcdev)) { > > + btrfs_device_get_bytes_used(srcdev)) { > > Hi, > > Unfortunately things are not that simple... > > The new device may have enough space, but that's not enough to guarantee > that the replace operation will succeed. > Consider the following example: > > #!/bin/bash -x > > losetup -d /dev/loop1 &> /dev/null > losetup -d /dev/loop2 &> /dev/null > losetup -d /dev/loop3 &> /dev/null > > rm -f /mnt/disk1 /mnt/disk2 /mnt/disk3 > fallocate -l 8G /mnt/disk1 > fallocate -l 8G /mnt/disk2 > fallocate -l 4G /mnt/disk3 > losetup /dev/loop1 /mnt/disk1 > losetup /dev/loop2 /mnt/disk2 > losetup /dev/loop3 /mnt/disk3 > > mkdir /mnt/test &> /dev/null > umount /mnt/test &> /dev/null > mkfs.btrfs -f /dev/loop1 /dev/loop2 > mount /dev/loop1 /mnt/test > > fallocate -l 1G /mnt/test/foo1 > fallocate -l 1G /mnt/test/foo2 > fallocate -l 1G /mnt/test/foo3 > fallocate -l 1G /mnt/test/foo4 > fallocate -l 1G /mnt/test/foo5 > fallocate -l 1G /mnt/test/foo6 > fallocate -l 1G /mnt/test/foo7 > fallocate -l 1G /mnt/test/foo8 > sync > rm -f /mnt/test/foo{2,3,4,5,6,7} > > umount /mnt/test > mount /dev/loop1 /mnt/test > sync # ensure empty block groups are removed > > btrfs filesystem du /mnt/test > btrfs filesystem df /mnt/test > > btrfs replace start -B /dev/loop2 /dev/loop3 /mnt/test > > umount /mnt/test > > # end of script > > > This will fail, despite the new device having a size of 4Gb which is more > than enough as there's little more than 2Gb used by the filesystem: > > (...) > + btrfs filesystem du /mnt/test > Total Exclusive Set shared Filename > 1.00GiB 1.00GiB - /mnt/test/foo1 > 1.00GiB 1.00GiB - /mnt/test/foo8 > 2.00GiB 2.00GiB 0.00B /mnt/test > > + btrfs filesystem df /mnt/test > Data, RAID0: total=4.85GiB, used=2.00GiB > System, RAID1: total=8.00MiB, used=16.00KiB > Metadata, RAID1: total=256.00MiB, used=112.00KiB > GlobalReserve, single: total=3.25MiB, used=0.00B > > + btrfs replace start -B /dev/loop2 /dev/loop3 /mnt/test > ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/test": Input/output error > (...) > > > If you look at dmesg/syslog you'll also see a lot of error messages regarding > writes beyond the target device's size, which is the replace operation fails > with error -EIO (which is -5): > > $ dmesg > (...) > [242142.682576] loop3: rw=1, want=8904704, limit=8388608 > [242142.682623] attempt to access beyond end of device > [242142.682625] loop3: rw=1, want=8904832, limit=8388608 > [242142.682671] attempt to access beyond end of device > [242142.682672] loop3: rw=1, want=8904960, limit=8388608 > [242142.682716] attempt to access beyond end of device > [242142.682718] loop3: rw=1, want=8905088, limit=8388608 > [242148.882448] BTRFS error (device loop1): > btrfs_scrub_dev(/dev/loop2, 2, /dev/loop3) failed -5 > [242148.903248] ------------[ cut here ]------------ > [242148.903375] WARNING: CPU: 1 PID: 22989 at > fs/btrfs/dev-replace.c:508 btrfs_dev_replace_by_ioctl+0x711/0x8b0 > [btrfs] > [242148.903383] Modules linked in: btrfs xor raid6_pq libcrc32c loop > intel_rapl_msr intel_rapl_common kvm_intel kvm bochs_drm > drm_vram_helper ttm irqbypass drm_kms_helper crct10dif_pclmul > crc32_pclmul ghash_clmulni_intel drm aesni_intel crypto_simd cryptd > joydev glue_helper evdev sg pcspkr parport_pc serio_raw button > qemu_fw_cfg ppdev lp parport ip_tables x_tables autofs4 ext4 > crc32c_generic crc16 mbcache jbd2 sd_mod virtio_scsi ata_generic > crc32c_intel virtio_pci virtio_ring psmouse virtio e1000 ata_piix > libata scsi_mod i2c_piix4 > [242148.903502] CPU: 1 PID: 22989 Comm: btrfs Tainted: G W > 5.4.0-rc8-btrfs-next-65 #1 > [242148.903509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 > [242148.903618] RIP: 0010:btrfs_dev_replace_by_ioctl+0x711/0x8b0 [btrfs] > [242148.903630] Code: 89 f7 e8 82 e7 fa ff 8b 45 b8 e9 8a fe ff ff 48 > c7 c2 83 ff ff ff 48 8b 4d c0 48 89 51 08 e9 25 fe ff ff 0f 0b e9 cd > fd ff ff <0f> 0b e9 68 fe ff ff 48 c7 c6 c0 31 9c c0 48 89 df e8 43 9f > 02 00 > [242148.903637] RSP: 0018:ffffa9c740803cb0 EFLAGS: 00010282 > [242148.903648] RAX: 00000000fffffffb RBX: ffff99d87ab7c000 RCX: > 0000000000000000 > [242148.903655] RDX: 0000000000000000 RSI: ffff99d87ab7ee68 RDI: > 0000000000000246 > [242148.903663] RBP: ffffa9c740803d10 R08: 0000000000000000 R09: > 0000000000000000 > [242148.903669] R10: 0000000000000000 R11: 0000000000000000 R12: > ffff99d8b35ef000 > [242148.903677] R13: 0000000000000000 R14: ffff99d87b38d800 R15: > ffff99d87ab7ee88 > [242148.903686] FS: 00007f1fc75cd8c0(0000) GS:ffff99d8b6a80000(0000) > knlGS:0000000000000000 > [242148.903694] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [242148.903701] CR2: 00007ffe01b4bec8 CR3: 00000001fca5a005 CR4: > 00000000003606e0 > [242148.903717] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [242148.903724] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [242148.903730] Call Trace: > [242148.903864] btrfs_ioctl+0x2928/0x37e0 [btrfs] > [242148.903894] ? __lock_acquire+0x37a/0x1e10 > [242148.903910] ? __lock_acquire+0x37a/0x1e10 > [242148.903961] ? do_sigaction+0xf3/0x240 > [242148.903992] ? do_vfs_ioctl+0xa2/0x700 > [242148.904003] do_vfs_ioctl+0xa2/0x700 > [242148.904017] ? do_sigaction+0xf3/0x240 > [242148.904051] ksys_ioctl+0x70/0x80 > [242148.904063] ? trace_hardirqs_off_thunk+0x1a/0x20 > [242148.904086] __x64_sys_ioctl+0x16/0x20 > [242148.904097] do_syscall_64+0x5c/0x250 > [242148.904114] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [242148.904125] RIP: 0033:0x7f1fc63d5dd7 > [242148.904136] Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 > 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 > 01 48 > [242148.904145] RSP: 002b:00007fffd8716078 EFLAGS: 00000202 ORIG_RAX: > 0000000000000010 > [242148.904154] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: > 00007f1fc63d5dd7 > [242148.904162] RDX: 00007fffd8716ee0 RSI: 00000000ca289435 RDI: > 0000000000000003 > [242148.904167] RBP: 0000000000000004 R08: 0000000000000000 R09: > 0000000000000000 > [242148.904173] R10: 0000000000000008 R11: 0000000000000202 R12: > 0000000000000000 > [242148.904181] R13: 0000000000000003 R14: 00007fffd87186a6 R15: > 00005555cb872050 > [242148.904234] irq event stamp: 1705866 > [242148.904246] hardirqs last enabled at (1705865): > [<ffffffff896c65d9>] _raw_spin_unlock_irqrestore+0x59/0x70 > [242148.904258] hardirqs last disabled at (1705866): > [<ffffffff88c048ea>] trace_hardirqs_off_thunk+0x1a/0x20 > [242148.904317] softirqs last enabled at (1705764): > [<ffffffff8949359e>] peernet2id+0x4e/0x70 > [242148.904328] softirqs last disabled at (1705762): > [<ffffffff8949357f>] peernet2id+0x2f/0x70 > [242148.904334] ---[ end trace f2fa14238388d729 ]--- > > > So device replace fails with -EIO and lots of errors in dmesg/syslog, > something that will mislead a user into thinking there's some problem > with the hardware. > Each device is using 2880634880 bytes (it's what btrfs_device_get_bytes_used() > returns) right before the replace operation, a value clearly smaller > then 4Gb (the size > of the new device). > > So, why does this happens? > Because device replace simply copies extents (both data and metadata) directly, > if something is at offset 7Gb in the source device, it tries to copy that extent > also to the offset 7Gb of the target device. This is very simple, reliable and > predictable, it actually decreases the chances for data/metadata loss. > > For extents already on disk before the device replace starts, the process > consists of iterating all chunks and then copy them from the source to > the target device. Once a chunk is processed we don't process it anymore. > > If any extent (data or metadata) is written during the replace procedure, > after we processed the corresponding chunk, the respective extent is still > copied to the target device - this is because during the initialization > of device replace we set up a mechanism that results in once a write is > performed to the source device, it is automatically forwarded to the target > device as well (using the same offset as the source device) - the key places > to see where this happens are: btrfs_dev_replace_start() and > __btrfs_map_block(). > > So, that's why we don't allow the target device to be smaller then the source > device - a device may have only 2Gb of space allocated to chunks for example, > but one 1Gb chunk may be at device offset 0 and the other 1Gb chunk at device > offset 6Gb, so replacing with a 4Gb device will never work. > > In order to allow a smaller device for device replace, I see two approaches: > > 1) The more complex one: have a mechanism to remap chunk offsets. This implies > building some mapping (in memory), so that for the above case for example, > if the first data chunk is at offset 1GB it can be mapped to 1GB in > the target > device as well, but for the second data chunk, we have to remap the offset > (something > 4Gb) to something <= 3Gb on the target device. I think > this may actually > be harder then it may seem. It's not only that copy part (and > dealing with incoming > writes while replace is in progress), because when replace > finishes, you also have > to update the mappings in the chunk tree for every chunk with the > new start offset, > which can take a while and require a lot of metadata space to be > reserved. In other > words, it will require significant changes to how device replace > copies extents just > for one use case. > > 2) A simple solution, but often less efficient: before starting the actual > replace operation, shrink the source device to the size of the > target device - > just use the existing btrfs_shrink_device(), which will relocate chunks > beyond the new size, and if there's not enough space it just returns -ENOSPC. > This means no changes to the actual way replace copies data - it > does extra IO, > due to the relocation but keeps things simple, and it should still > be significantly > more efficient then doing a device remove + device add operation, > maybe except if > all or most of the allocated chunks (in the device to be replaced) > cross or start > beyond an offset matching the new device's size. > > Also, since the shrink can take some time due to relocation of > chunks, we would need > to teach btrfs_shrink_device() to check for device replace cancel > requests as well. > And such request is detected, restore the device's size to the > original value. > > I think option 2 may actually be acceptable for an initial version. Option 1 is > complex and increases the risk for data loss. Also, for option 2, there's the > possible downside of requiring writes to the source device - one might > be replacing > it because the device is not healthy, writes into some regions are > failing, which > can prevent the shrink/relocation process from suceeding, in that case only > a device remove followed by a device add operation would work. > > Either way, we will also need to have test cases in fstests to cover > all possible > scenarios we can think of, including stress tests where we replace the > device with > a smaller one and write/delete/create to files while replace is in > progress. Then > for each scenario scrub the devices, run fsck, etc, to verify > everything is fine. > If you decide to go for option 2, also compare performance against a > device remove > + device add operation. > > Thanks. > > > > btrfs_err(fs_info, > > "target device is smaller than source device!"); > > ret = -EINVAL; > > @@ -243,8 +243,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, > > device->io_width = fs_info->sectorsize; > > device->io_align = fs_info->sectorsize; > > device->sector_size = fs_info->sectorsize; > > - device->total_bytes = btrfs_device_get_total_bytes(srcdev); > > - device->disk_total_bytes = btrfs_device_get_disk_total_bytes(srcdev); > > + device->total_bytes = round_down( > > + i_size_read(bdev->bd_inode), > > + fs_info->sectorsize); > > + device->disk_total_bytes = device->total_bytes; > > device->bytes_used = btrfs_device_get_bytes_used(srcdev); > > device->commit_total_bytes = srcdev->commit_total_bytes; > > device->commit_bytes_used = device->bytes_used; > > @@ -671,9 +673,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, > > memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp)); > > memcpy(tgt_device->uuid, src_device->uuid, sizeof(tgt_device->uuid)); > > memcpy(src_device->uuid, uuid_tmp, sizeof(src_device->uuid)); > > - btrfs_device_set_total_bytes(tgt_device, src_device->total_bytes); > > - btrfs_device_set_disk_total_bytes(tgt_device, > > - src_device->disk_total_bytes); > > btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used); > > tgt_device->commit_bytes_used = src_device->bytes_used; > > > > -- > > 2.20.1 > > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.”
