On Fri, May 20, 2016 at 12:44 AM, <fdmanana@xxxxxxxxxx> wrote: > From: Filipe Manana <fdmanana@xxxxxxxx> > > When we do a device replace, for each device extent we find from the > source device, we set the corresponding block group to readonly mode to > prevent writes into it from happening while we are copying the device > extent from the source to the target device. However just before we set > the block group to readonly mode some concurrent task might have already > allocated an extent from it or decided it could perform a nocow write > into one of its extents, which can make the device replace process to > miss copying an extent since it uses the extent tree's commit root to > search for extents and only once it finishes searching for all extents > belonging to the block group it does set the left cursor to the logical > end address of the block group - this is a problem if the respective > ordered extents finish while we are searching for extents using the > extent tree's commit root and no transaction commit happens while we > are iterating the tree, since it's the delayed references created by the > ordered extents (when they complete) that insert the extent items into > the extent tree (using the non-commit root of course). > Example: > > CPU 1 CPU 2 > > btrfs_dev_replace_start() > btrfs_scrub_dev() > scrub_enumerate_chunks() > --> finds device extent belonging > to block group X > > <transaction N starts> > > starts buffered write > against some inode > > writepages is run against > that inode forcing dellaloc > to run > > btrfs_writepages() > extent_writepages() > extent_write_cache_pages() > __extent_writepage() > writepage_delalloc() > run_delalloc_range() > cow_file_range() > btrfs_reserve_extent() > --> allocates an extent > from block group X > (which is not yet > in RO mode) > btrfs_add_ordered_extent() > --> creates ordered extent Y > flush_epd_write_bio() > --> bio against the extent from > block group X is submitted > > btrfs_inc_block_group_ro(bg X) > --> sets block group X to readonly > > scrub_chunk(bg X) > scrub_stripe(device extent from srcdev) > --> keeps searching for extent items > belonging to the block group using > the extent tree's commit root > --> it never blocks due to > fs_info->scrub_pause_req as no > one tries to commit transaction N > --> copies all extents found from the > source device into the target device > --> finishes search loop > > bio completes > > ordered extent Y completes > and creates delayed data > reference which will add an > extent item to the extent > tree when run (typically > at transaction commit time) > > --> so the task doing the > scrub/device replace > at CPU 1 misses this > and does not copy this > extent into the new/target > device > > btrfs_dec_block_group_ro(bg X) > --> turns block group X back to RW mode > > dev_replace->cursor_left is set to the > logical end offset of block group X > > So fix this by waiting for all cow and nocow writes after setting a block > group to readonly mode. > > Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx> Reviewed-by: Josef Bacik <jbacik@xxxxxx> Thanks! Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
