On 2019/11/19 下午6:05, Anand Jain wrote:
> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>> [PROBLEM]
>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>> enough devices:
>
> Its better to keep it like this for now until there is a fix for the
> write hole. Otherwise hitting the write hole bug in case of degraded
> raid1 will be more prevalent.
Write hole should be a problem for RAID5/6, not the degraded chunk
feature itself.
Furthermore, this design will try to avoid allocating chunks using
missing devices.
So even for 3 devices RAID5, new chunks will be allocated by using
existing devices (2 devices RAID5), so no new write hole is introduced.
>
> I proposed a RFC a long time before [1] (also in there, there
> is a commit id which turned the degraded raid1 profile into single
> profile (without much write-up on it)).
>
> [1] [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks
My point for this patchset is:
- Create regular chunk if we have enough devices
- Create degraded chunk only when we have not enough devices
I guess since you didn't get the point of my preparation patches, your
patches aren't that good to avoid missing devices.
>
> Similarly the patch related to the reappearing missing device [2]
> falls under the same category. Will push for the integration after
> the write hole fix.
>
> [2] [PATCH] btrfs: handle dynamically reappearing missing device
> (test case 154).
That's another case, and I didn't see how it affects this feature.
>
> If you look close enough the original author has quite nicely made
> sure write hole bug will be very difficultly to hit. These fixes
> shall make it easy to hit. So its better to work on the write hole
> first.
If you're talking about RAID5/6, you are talking at the wrong thread.
Go implement some write-a-head log for RAID5/6, or mark all degraded
RAID5/6 chunks read-only at mount time.
>
> I am trying to fix write hole. First attempt has limited success
> (works fine in two disk raid1 only). Now trying other ways to fix.
>
>> # mkfs.btrfs -f /dev/test/scratch[12] -m raid1 -d raid1
>> # wipefs -fa /dev/test/scratch2
>> # mount -o degraded /dev/test/scratch1 /mnt/btrfs
>> # fallocate -l 1G /mnt/btrfs/foobar
>> # btrfs ins dump-tree -t chunk /dev/test/scratch1
>> item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 1674575872) itemoff
>> 15511 itemsize 80
>> length 536870912 owner 2 stripe_len 65536 type DATA
>> New data chunk will fallback to SINGLE.
>>
>> If user doesn't balance those SINGLE chunks, even with missing devices
>> replaced, the fs is no longer full RAID1, and a missing device can break
>> the tolerance.
>
> As its been discussed quite a lot of time before, the current
> re-silver/recovery approach for degraded-raid1 (with offload to Single)
> requires balance. Its kind of known.
I'd call such "well-known" behavior BS.
All other raid1 implementation can accept single device RAID1 and
resilver itself with more device into a full RAID1 setup.
But for BTRFS you're calling SINGLE profile "well-known"?
It's "well-known" because it's not working properly, that's why I'm
trying to fix it.
>
> Thanks, Anand
>
>
>> [CAUSE]
>> The cause is pretty simple, when mounted degraded, missing devices can't
>> be used for chunk allocation.
>> Thus btrfs has to fall back to SINGLE profile.
>>
>> [ENHANCEMENT]
>> To avoid such problem, this patch will:
>> - Make all profiler reducer/updater to consider missing devices as part
>> of num_devices
>> - Make chunk allocator to fallback to missing_list as last resort
>>
>> If we have enough rw_devices, then go regular chunk allocation code.
>
>> This can avoid allocating degraded chunks.
>> E.g. for 3 devices RAID1 degraded mount, we will use the 2 existing
>> devices to allocate chunk, avoid degraded chunk.
>
>> But if we don't have enough rw_devices, then we check missing devices to
>> allocate degraded chunks.
>> E.g. for 2 devices RAID1 degraded mount, we have to allocate degraded
>> chunks to keep the RAID1 profile.
>>
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>> ---
>> fs/btrfs/block-group.c | 10 +++++++---
>> fs/btrfs/volumes.c | 18 +++++++++++++++---
>> 2 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index bf7e3f23bba7..1686fd31679b 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -52,11 +52,13 @@ static u64 get_restripe_target(struct
>> btrfs_fs_info *fs_info, u64 flags)
>> */
>> static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info,
>> u64 flags)
>> {
>> - u64 num_devices = fs_info->fs_devices->rw_devices;
>> + u64 num_devices;
>> u64 target;
>> u64 raid_type;
>> u64 allowed = 0;
>> + num_devices = fs_info->fs_devices->rw_devices +
>> + fs_info->fs_devices->missing_devices;
>> /*
>> * See if restripe for this chunk_type is in progress, if so try to
>> * reduce to the target profile
>> @@ -1986,7 +1988,8 @@ static u64 update_block_group_flags(struct
>> btrfs_fs_info *fs_info, u64 flags)
>> if (stripped)
>> return extended_to_chunk(stripped);
>> - num_devices = fs_info->fs_devices->rw_devices;
>> + num_devices = fs_info->fs_devices->rw_devices +
>> + fs_info->fs_devices->missing_devices;
>> stripped = BTRFS_BLOCK_GROUP_RAID0 |
>> BTRFS_BLOCK_GROUP_RAID56_MASK |
>> BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
>> @@ -2981,7 +2984,8 @@ static u64 get_profile_num_devs(struct
>> btrfs_fs_info *fs_info, u64 type)
>> num_dev =
>> btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max;
>> if (!num_dev)
>> - num_dev = fs_info->fs_devices->rw_devices;
>> + num_dev = fs_info->fs_devices->rw_devices +
>> + fs_info->fs_devices->missing_devices;
>> return num_dev;
>> }
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a462d8de5d2a..4dee1974ceb7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5052,8 +5052,9 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>> max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
>> max_chunk_size);
>> - devices_info = kcalloc(fs_devices->rw_devices,
>> sizeof(*devices_info),
>> - GFP_NOFS);
>> + devices_info = kcalloc(fs_devices->rw_devices +
>> + fs_devices->missing_devices,
>> + sizeof(*devices_info), GFP_NOFS);
>> if (!devices_info)
>> return -ENOMEM;
>> @@ -5067,7 +5068,18 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>> max_stripe_size, dev_stripes);
>> if (ret < 0)
>> goto error;
>> -
>> + /*
>> + * If rw devices can't fullfil the request, fallback to missing
>> devices
>> + * as last resort.
>> + */
>> + if (ndevs < devs_min) {
>> + ret = gather_dev_holes(info, devices_info + ndevs, &ndevs,
>> + &fs_devices->missing_list,
>> + fs_devices->missing_devices,
>> + max_stripe_size, dev_stripes);
>> + if (ret < 0)
>> + goto error;
>> + }
>> /*
>> * now sort the devices by hole size / available space
>> */
>>
>
Attachment:
signature.asc
Description: OpenPGP digital signature
