Re: [PATCH 6/6] Btrfs: fix race between device replace and chunk allocation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 20, 2016 at 12:45 AM,  <fdmanana@xxxxxxxxxx> wrote:
> From: Filipe Manana <fdmanana@xxxxxxxx>
>
> While iterating and copying extents from the source device, the device
> replace code keeps adjusting a left cursor that is used to make sure that
> once we finish processing a device extent, any future writes to extents
> from the corresponding block group will get into both the source and
> target devices. This left cursor is also used for resuming the device
> replace operation at mount time.
>
> However using this left cursor to decide whether writes go into both
> devices or only the source device is not enough to guarantee we don't
> miss copying extents into the target device. There are two cases where
> the current approach fails. The first one is related to when there are
> holes in the device and they get allocated for new block groups while
> the device replace operation is iterating the device extents (more on
> this explained below). The second one is that when that loop over the
> device extents finishes, we start dellaloc, wait for all ordered extents
> and then commit the current transaction, we might have got new block
> groups allocated that are now using a device extent that has an offset
> greater then or equals to the value of the left cursor, in which case
> writes to extents belonging to these new block groups will get issued
> only to the source device.
>
> For the first case where the current approach of using a left cursor
> fails, consider the source device currently has the following layout:
>
>   [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
>   3Gb             4Gb                         5Gb
>
> While we are iterating the device extents from the source device using
> the commit root of the device tree, the following happens:
>
>         CPU 1                                            CPU 2
>
>                       <we are at transaction N>
>
>   scrub_enumerate_chunks()
>     --> searches the device tree for
>         extents belonging to the source
>         device using the device tree's
>         commit root
>     --> 1st iteration finds extent belonging to
>         block group A
>
>         --> sets block group A to RO mode
>             (btrfs_inc_block_group_ro)
>
>         --> sets cursor left to found_key.offset
>             which is 3Gb
>
>         --> scrub_chunk() starts
>             copies all allocated extents from
>             block group's A stripe at source
>             device into target device
>
>                                                            btrfs_alloc_chunk()
>                                                              --> allocates device extent
>                                                                  in the range [4Gb, 5Gb[
>                                                                  from the source device for
>                                                                  a new block group C
>
>                                                            extent allocated from block
>                                                            group C for a direct IO,
>                                                            buffered write or btree node/leaf
>
>                                                            extent is written to, perhaps
>                                                            in response to a writepages()
>                                                            call from the VM or directly
>                                                            through direct IO
>
>                                                            the write is made only against
>                                                            the source device and not against
>                                                            the target device because the
>                                                            extent's offset is in the interval
>                                                            [4Gb, 5Gb[ which is larger then
>                                                            the value of cursor_left (3Gb)
>
>         --> scrub_chunks() finishes
>
>         --> updates left cursor from 3Gb to
>             4Gb
>
>         --> btrfs_dec_block_group_ro() sets
>             block group A back to RW mode
>
>                              <we are still at transaction N>
>
>     --> 2nd iteration finds extent belonging to
>         block group B - it did not find the new
>         extent in the range [4Gb, 5Gb[ for block
>         group C because we are using the device
>         tree's commit root or even because the
>         block group's items are not all yet
>         inserted in the respective btrees, that is,
>         the block group is still attached to some
>         transaction handle's new_bgs list and
>         btrfs_create_pending_block_groups() was
>         not called yet against that transaction
>         handle, so the device extent items were
>         not yet inserted into the devices tree
>
>                              <we are still at transaction N>
>
>         --> so we end not copying anything from the newly
>             allocated device extent from the source device
>             to the target device
>
> So fix this by making __btrfs_map_block() always redirect writes to the
> target device as well, independently of the left cursor's value. With
> this change the left cursor is now used only for the purpose of tracking
> progress and allow a mount operation to resume a device replace.
>
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>

So if we commit the transaction and the left cursor points at the 4gib
chunk and then we immediately crash, when we start back up we will
re-copy the new extents onto the target device.  Or we don't even have
to crash, if we commit the transaction before the 2nd iteration begins
then we will find the new extent and copy it again.  I guess this
isn't a problem per-se, but maybe it would be good to keep track of
when we created the new device extent and see if we already have it on
the target device so we can avoid copying stuff we already have.
Anyway this seems ok

Reviewed-by: Josef Bacik <jbacik@xxxxxx>

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux