On Fri, May 20, 2016 at 4:30 PM, Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > On Fri, May 20, 2016 at 12:45 AM, <fdmanana@xxxxxxxxxx> wrote: >> From: Filipe Manana <fdmanana@xxxxxxxx> >> >> While iterating and copying extents from the source device, the device >> replace code keeps adjusting a left cursor that is used to make sure that >> once we finish processing a device extent, any future writes to extents >> from the corresponding block group will get into both the source and >> target devices. This left cursor is also used for resuming the device >> replace operation at mount time. >> >> However using this left cursor to decide whether writes go into both >> devices or only the source device is not enough to guarantee we don't >> miss copying extents into the target device. There are two cases where >> the current approach fails. The first one is related to when there are >> holes in the device and they get allocated for new block groups while >> the device replace operation is iterating the device extents (more on >> this explained below). The second one is that when that loop over the >> device extents finishes, we start dellaloc, wait for all ordered extents >> and then commit the current transaction, we might have got new block >> groups allocated that are now using a device extent that has an offset >> greater then or equals to the value of the left cursor, in which case >> writes to extents belonging to these new block groups will get issued >> only to the source device. >> >> For the first case where the current approach of using a left cursor >> fails, consider the source device currently has the following layout: >> >> [ extent bg A ] [ hole, unallocated space ] [extent bg B ] >> 3Gb 4Gb 5Gb >> >> While we are iterating the device extents from the source device using >> the commit root of the device tree, the following happens: >> >> CPU 1 CPU 2 >> >> <we are at transaction N> >> >> scrub_enumerate_chunks() >> --> searches the device tree for >> extents belonging to the source >> device using the device tree's >> commit root >> --> 1st iteration finds extent belonging to >> block group A >> >> --> sets block group A to RO mode >> (btrfs_inc_block_group_ro) >> >> --> sets cursor left to found_key.offset >> which is 3Gb >> >> --> scrub_chunk() starts >> copies all allocated extents from >> block group's A stripe at source >> device into target device >> >> btrfs_alloc_chunk() >> --> allocates device extent >> in the range [4Gb, 5Gb[ >> from the source device for >> a new block group C >> >> extent allocated from block >> group C for a direct IO, >> buffered write or btree node/leaf >> >> extent is written to, perhaps >> in response to a writepages() >> call from the VM or directly >> through direct IO >> >> the write is made only against >> the source device and not against >> the target device because the >> extent's offset is in the interval >> [4Gb, 5Gb[ which is larger then >> the value of cursor_left (3Gb) >> >> --> scrub_chunks() finishes >> >> --> updates left cursor from 3Gb to >> 4Gb >> >> --> btrfs_dec_block_group_ro() sets >> block group A back to RW mode >> >> <we are still at transaction N> >> >> --> 2nd iteration finds extent belonging to >> block group B - it did not find the new >> extent in the range [4Gb, 5Gb[ for block >> group C because we are using the device >> tree's commit root or even because the >> block group's items are not all yet >> inserted in the respective btrees, that is, >> the block group is still attached to some >> transaction handle's new_bgs list and >> btrfs_create_pending_block_groups() was >> not called yet against that transaction >> handle, so the device extent items were >> not yet inserted into the devices tree >> >> <we are still at transaction N> >> >> --> so we end not copying anything from the newly >> allocated device extent from the source device >> to the target device >> >> So fix this by making __btrfs_map_block() always redirect writes to the >> target device as well, independently of the left cursor's value. With >> this change the left cursor is now used only for the purpose of tracking >> progress and allow a mount operation to resume a device replace. >> >> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx> > > So if we commit the transaction and the left cursor points at the 4gib > chunk and then we immediately crash, when we start back up we will > re-copy the new extents onto the target device. Or we don't even have > to crash, if we commit the transaction before the 2nd iteration begins > then we will find the new extent and copy it again. Indeed, while I thought about that second case and the result of double copying the same extent, I forgot to mention it in the change log. I went for the dead simple solution as I couldn't find any solution that was also simple (and equally correct, not racy) but avoided redirecting all writes to both devices regardless of the left cursor's current value. Thanks Josef! > I guess this > isn't a problem per-se, but maybe it would be good to keep track of > when we created the new device extent and see if we already have it on > the target device so we can avoid copying stuff we already have. > Anyway this seems ok > > Reviewed-by: Josef Bacik <jbacik@xxxxxx> > > Thanks, > > Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
