Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Aug 9, 2013 at 2:07 PM, Stefan Behrens
<sbehrens@xxxxxxxxxxxxxxxx> wrote:
> On Thu,  8 Aug 2013 21:00:52 +0100, Filipe David Borba Manana wrote:
>> Since all code paths that update the number of devices in the
>> super copy (fs_info->super_copy) first lock the device list
>> (fs_info->fs_devices->device_list_mutex), and write_all_supers()
>> also needs to lock the devices list mutex, make write_all_supers()
>> read the number of devices from the super copy after it locks
>> the device list mutex (and before unlocking it of course).
>>
>> The only code path that doesn't lock the device list mutex
>> before updating the number of devices in the super copy is
>> disk-io.c:next_root_backup(), called by open_ctree() during
>> mount time where concurrency issues can't happen.
>>
>> Signed-off-by: Filipe David Borba Manana <fdmanana@xxxxxxxxx>
>> ---
>>  fs/btrfs/disk-io.c |    2 +-
>>  fs/btrfs/volumes.c |   11 ++++-------
>>  2 files changed, 5 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 254cdc8..c4b24c7 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3313,7 +3313,6 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors)
>>       int total_errors = 0;
>>       u64 flags;
>>
>> -     max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>       do_barriers = !btrfs_test_opt(root, NOBARRIER);
>>       backup_super_roots(root->fs_info);
>>
>> @@ -3322,6 +3321,7 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors)
>>
>>       mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>>       head = &root->fs_info->fs_devices->devices;
>> +     max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>
>>       if (do_barriers) {
>>               ret = barrier_all_devices(root->fs_info);
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 090f57c..eddf386 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1568,11 +1568,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>       if (ret)
>>               goto error_undo;
>>
>> -     /*
>> -      * TODO: the superblock still includes this device in its num_devices
>> -      * counter although write_all_supers() is not locked out. This
>> -      * could give a filesystem state which requires a degraded mount.
>> -      */
>>       ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
>
> The problem that I had seen when I added that comment is something
> different than what you are addressing.
>
> The call to btrfs_rm_dev_item() is the place where the device is removed
> in the filesystem device tree. The transaction is commited.

So, it would only be super correct if the call to btrfs_rm_dev_item()
(and the following code) is run inside the critical section delimited
by the device list mutex (and have the super_copy num devices updated
inside that section too, like I did).

Other than a potentially much longer critical section, or mutex
deadlock (because btrfs_scrub_cancel locks scrub_lock), any reason to
not do it?

>
> root->fs_info->super_copy is not updated and still includes the device
> that is not part of the device tree anymore.
>
> 19 lines later, the device_list_mutex is acquired. Until then, nobody
> prevents write_all_supers() to write the superblock to disk. This means,
> until then, you can create a state on disk with an updated device tree
> and a num_devices value which is too high by one.
>
> If you now crash or the power drops, the on-disk state is not
> consistent. However, this is not a severe problem. btrfs_rm_device()
> relocates all chunks that are located on the removed device. On next
> mount, at first the device items are read which do not include the
> deleted device anymore, afterwards the chunks are checked, whether they
> reference a device that is not present. And this is not the case.
> Therefore this situation is not a severe problem and my comment was
> wrong that says "could require a degraded mount".
>
> But the field num_devices in the superblock will stay wrong for the
> lifetime of the filesystem, causing malfunction of the ioctl
> BTRFS_IOC_DEVICES_READY, and potentially causing trouble in the future
> when somebody adds code that relies on fs_devices->total_devices being
> correct.
>
> It's simply not correct like it is now. And your patch doesn't fix the
> issue that the TODO comment describes.

Thanks for the explanation, very helpful.

Indeed, it doesn't fix the issue you described. I thought more about
fixing the following issue:

1) Write super gets a number of N devices from super_copy, so it will
not panic if it fails to write dbs for N - 1 devices;

2) Then tries to acquire device_list_mutex, but blocks because
btrfs_rm_device() got it first

3) btrfs_rm_device() removes the device from the list, and does all
those things it does and then unlocks the dev list mutex;

4) write_all_supers() acquires the mutex, iterates over all devices in
the list and gets N - 1 errors (failed to write db to all devices)

5) Because N - 1 is less than N, it thinks all is ok, when it's not
because there's actually only N - 1 devices now. Therefore the
BUG_ON() won't get executed.

This is more likely to happen for a small number of devices only (2 ->
1 for e.g.). I will revert re-add your comment, as this fixes
something different.

>
>
>>       if (ret)
>>               goto error_undo;
>> @@ -1588,7 +1583,9 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>       /*
>>        * the device list mutex makes sure that we don't change
>>        * the device list while someone else is writing out all
>> -      * the device supers.
>> +      * the device supers. Whoever is writing all supers, should
>> +      * lock the device list mutex before getting the number of
>> +      * devices in the super block (super_copy).
>>        */
>>
>>       cur_devices = device->fs_devices;
>> @@ -1612,10 +1609,10 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>               device->fs_devices->open_devices--;
>>
>>       call_rcu(&device->rcu, free_device);
>> -     mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>
>>       num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>       btrfs_set_super_num_devices(root->fs_info->super_copy, num_devices);
>> +     mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>
>>       if (cur_devices->open_devices == 0) {
>>               struct btrfs_fs_devices *fs_devices;
>>
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux