On Fri, Nov 20, 2015 at 02:12:26PM +0000, fdmanana@xxxxxxxxxx wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> During the final phase of a device replace operation, I ran into a
> transaction abort that resulted in the following trace:
>
> [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
> [23919.664742] BTRFS: Transaction aborted (error -2)
> [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
> [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
> [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
> [23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
> [23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
> [23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
> [23919.696716] Call Trace:
> [23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
> [23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
> [23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
> [23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
> [23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
> [23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
> [23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
> [23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
> [23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
> [23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
> [23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
> [23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
> [23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
> [23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
> [23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
> [23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
> [23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
> [23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
> [23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
> [23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
> [23919.733330] ---[ end trace 166ef301a335832a ]---
>
> This is due to a race between device replace and chunk allocation, which
> the following diagram illustrates:
>
> CPU 1 CPU 2
>
> btrfs_dev_replace_finishing()
>
> at this point
> dev_replace->tgtdev->devid ==
> BTRFS_DEV_REPLACE_DEVID (0ULL)
>
> ...
>
> btrfs_start_transaction()
> btrfs_commit_transaction()
>
> btrfs_fallocate()
> btrfs_alloc_data_chunk_ondemand()
> btrfs_join_transaction()
> --> starts a new transaction
> do_chunk_alloc()
> lock fs_info->chunk_mutex
> btrfs_alloc_chunk()
> --> creates extent map for
> the new chunk with
> em->bdev->map->stripes[i]->dev->devid
> == X (X > 0)
> --> extent map is added to
> fs_info->mapping_tree
> --> initial phase of bg A
> allocation completes
> unlock fs_info->chunk_mutex
>
> lock fs_info->chunk_mutex
>
> btrfs_dev_replace_update_device_in_mapping_tree()
> --> iterates fs_info->mapping_tree and
> replaces the device in every extent
> map's map->stripes[] with
> dev_replace->tgtdev, which still has
> an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
>
> btrfs_end_transaction()
> btrfs_create_pending_block_groups()
> --> starts final phase of
> bg A creation (update device,
> extent, and chunk trees, etc)
> btrfs_finish_chunk_alloc()
>
> btrfs_update_device()
> --> attempts to update a device
> item with ID == 0ULL
> (BTRFS_DEV_REPLACE_DEVID)
> which is the current ID of
> bg A's
> em->bdev->map->stripes[i]->dev->devid
> --> doesn't find such item
> returns -ENOENT
> --> the device id should have been X
> and not 0ULL
>
> got -ENOENT from
> btrfs_finish_chunk_alloc()
> and aborts current transaction
>
> finishes setting up the target device,
> namely it sets tgtdev->devid to the value
> of srcdev->devid, which is X (and X > 0)
>
> So fix this by taking the device list mutex when processing the chunk's
> extent map stripes to update the device items.
>
> This happened while running fstest btrfs/071.
The code looks good to me.
Reviewed-by: Liu Bo <bo.li.liu@xxxxxxxxxx>
Thanks,
-liubo
>
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
> ---
>
> V2: Made the attribution of a new id from the old device to happen
> after replacing the device for the extent maps with the new device,
> as obviously it was missing before.
>
> V3: Make it really race free. The previous approach was still racy, but in a
> different way. Need to really synchronize the finishing phase of device
> replace that sets up the new device with the final phase of chunk allocation
> that processes the stripes and updates the device items.
>
> V4: Moved critical section a bit further, as there's further usage of a
> stripe's device.
>
> fs/btrfs/volumes.c | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 45f2025..2290469 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4827,20 +4827,32 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
> goto out;
> }
>
> + /*
> + * Take the device list mutex to prevent races with the final phase of
> + * a device replace operation that replaces the device object associated
> + * with the map's stripes, because the device object's id can change
> + * at any time during that final phase of the device replace operation
> + * (dev-replace.c:btrfs_dev_replace_finishing()).
> + */
> + mutex_lock(&chunk_root->fs_info->fs_devices->device_list_mutex);
> for (i = 0; i < map->num_stripes; i++) {
> device = map->stripes[i].dev;
> dev_offset = map->stripes[i].physical;
>
> ret = btrfs_update_device(trans, device);
> if (ret)
> - goto out;
> + break;
> ret = btrfs_alloc_dev_extent(trans, device,
> chunk_root->root_key.objectid,
> BTRFS_FIRST_CHUNK_TREE_OBJECTID,
> chunk_offset, dev_offset,
> stripe_size);
> if (ret)
> - goto out;
> + break;
> + }
> + if (ret) {
> + mutex_unlock(&chunk_root->fs_info->fs_devices->device_list_mutex);
> + goto out;
> }
>
> stripe = &chunk->stripe;
> @@ -4853,6 +4865,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
> memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
> stripe++;
> }
> + mutex_unlock(&chunk_root->fs_info->fs_devices->device_list_mutex);
>
> btrfs_set_stack_chunk_length(chunk, chunk_size);
> btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html