On Fri, Aug 9, 2013 at 2:07 PM, Stefan Behrens
<sbehrens@xxxxxxxxxxxxxxxx> wrote:
> On Thu, 8 Aug 2013 21:00:52 +0100, Filipe David Borba Manana wrote:
>> Since all code paths that update the number of devices in the
>> super copy (fs_info->super_copy) first lock the device list
>> (fs_info->fs_devices->device_list_mutex), and write_all_supers()
>> also needs to lock the devices list mutex, make write_all_supers()
>> read the number of devices from the super copy after it locks
>> the device list mutex (and before unlocking it of course).
>>
>> The only code path that doesn't lock the device list mutex
>> before updating the number of devices in the super copy is
>> disk-io.c:next_root_backup(), called by open_ctree() during
>> mount time where concurrency issues can't happen.
>>
>> Signed-off-by: Filipe David Borba Manana <fdmanana@xxxxxxxxx>
>> ---
>> fs/btrfs/disk-io.c | 2 +-
>> fs/btrfs/volumes.c | 11 ++++-------
>> 2 files changed, 5 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 254cdc8..c4b24c7 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3313,7 +3313,6 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors)
>> int total_errors = 0;
>> u64 flags;
>>
>> - max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>> do_barriers = !btrfs_test_opt(root, NOBARRIER);
>> backup_super_roots(root->fs_info);
>>
>> @@ -3322,6 +3321,7 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors)
>>
>> mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>> head = &root->fs_info->fs_devices->devices;
>> + max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>
>> if (do_barriers) {
>> ret = barrier_all_devices(root->fs_info);
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 090f57c..eddf386 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1568,11 +1568,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>> if (ret)
>> goto error_undo;
>>
>> - /*
>> - * TODO: the superblock still includes this device in its num_devices
>> - * counter although write_all_supers() is not locked out. This
>> - * could give a filesystem state which requires a degraded mount.
>> - */
>> ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
>
> The problem that I had seen when I added that comment is something
> different than what you are addressing.
>
> The call to btrfs_rm_dev_item() is the place where the device is removed
> in the filesystem device tree. The transaction is commited.
So, it would only be super correct if the call to btrfs_rm_dev_item()
(and the following code) is run inside the critical section delimited
by the device list mutex (and have the super_copy num devices updated
inside that section too, like I did).
Other than a potentially much longer critical section, or mutex
deadlock (because btrfs_scrub_cancel locks scrub_lock), any reason to
not do it?
>
> root->fs_info->super_copy is not updated and still includes the device
> that is not part of the device tree anymore.
>
> 19 lines later, the device_list_mutex is acquired. Until then, nobody
> prevents write_all_supers() to write the superblock to disk. This means,
> until then, you can create a state on disk with an updated device tree
> and a num_devices value which is too high by one.
>
> If you now crash or the power drops, the on-disk state is not
> consistent. However, this is not a severe problem. btrfs_rm_device()
> relocates all chunks that are located on the removed device. On next
> mount, at first the device items are read which do not include the
> deleted device anymore, afterwards the chunks are checked, whether they
> reference a device that is not present. And this is not the case.
> Therefore this situation is not a severe problem and my comment was
> wrong that says "could require a degraded mount".
>
> But the field num_devices in the superblock will stay wrong for the
> lifetime of the filesystem, causing malfunction of the ioctl
> BTRFS_IOC_DEVICES_READY, and potentially causing trouble in the future
> when somebody adds code that relies on fs_devices->total_devices being
> correct.
>
> It's simply not correct like it is now. And your patch doesn't fix the
> issue that the TODO comment describes.
Thanks for the explanation, very helpful.
Indeed, it doesn't fix the issue you described. I thought more about
fixing the following issue:
1) Write super gets a number of N devices from super_copy, so it will
not panic if it fails to write dbs for N - 1 devices;
2) Then tries to acquire device_list_mutex, but blocks because
btrfs_rm_device() got it first
3) btrfs_rm_device() removes the device from the list, and does all
those things it does and then unlocks the dev list mutex;
4) write_all_supers() acquires the mutex, iterates over all devices in
the list and gets N - 1 errors (failed to write db to all devices)
5) Because N - 1 is less than N, it thinks all is ok, when it's not
because there's actually only N - 1 devices now. Therefore the
BUG_ON() won't get executed.
This is more likely to happen for a small number of devices only (2 ->
1 for e.g.). I will revert re-add your comment, as this fixes
something different.
>
>
>> if (ret)
>> goto error_undo;
>> @@ -1588,7 +1583,9 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>> /*
>> * the device list mutex makes sure that we don't change
>> * the device list while someone else is writing out all
>> - * the device supers.
>> + * the device supers. Whoever is writing all supers, should
>> + * lock the device list mutex before getting the number of
>> + * devices in the super block (super_copy).
>> */
>>
>> cur_devices = device->fs_devices;
>> @@ -1612,10 +1609,10 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>> device->fs_devices->open_devices--;
>>
>> call_rcu(&device->rcu, free_device);
>> - mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>
>> num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>> btrfs_set_super_num_devices(root->fs_info->super_copy, num_devices);
>> + mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>
>> if (cur_devices->open_devices == 0) {
>> struct btrfs_fs_devices *fs_devices;
>>
>
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html