Re: Problems balancing BTRFS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 22/11/2019 14:07, devel@xxxxxxxxxxxxxx wrote:
> On 22/11/2019 13:56, Qu Wenruo wrote:
>> On 2019/11/22 下午9:20, devel@xxxxxxxxxxxxxx wrote:
>>> On 22/11/2019 13:10, Qu Wenruo wrote:
>>>> On 2019/11/22 下午8:37, devel@xxxxxxxxxxxxxx wrote:
>>>>> So been discussing this on IRC but looks like more sage advice is needed.
>>>> You're not the only one hitting the bug. (Not sure if that makes you
>>>> feel a little better)
>>>
>>> Hehe.. well always help to know you are not slowly going crazy by oneself.
>>>
>>>>> The csum error is from data reloc tree, which is a tree to record the
>>>>> new (relocated) data.
>>>>> So the good news is, your old data is not corrupted, and since we hit
>>>>> EIO before switching tree blocks, the corrupted data is just deleted.
>>>>>
>>>>> And I have also seen the bug just using single device, with DUP meta and
>>>>> SINGLE data, so I believe there is something wrong with the data reloc tree.
>>>>> The problem here is, I can't find a way to reproduce it, so it will take
>>>>> us a longer time to debug.
>>>>>
>>>>>
>>>>> Despite that, have you seen any other problem? Especially ENOSPC (needs
>>>>> enospc_debug mount option).
>>>>> The only time I hit it, I was debugging ENOSPC bug of relocation.
>>>>>
>>> As far as I can tell the rest of the filesystem works normally. Like I
>>> show scrubs clean etc.. I have not actively added much new data since
>>> the whole point is to balance the fs so a scrub does not take 18 hours.
>> Sorry my point here is, would you like to try balance again with
>> "enospc_debug" mount option?
>>
>> As for balance, we can hit ENOSPC without showing it as long as we have
>> a more serious problem, like the EIO you hit.
>
> Oh I see .. Sure I can start the balance again.
>
>
>>> So really I am not sure what to do. It only seems to appear during a
>>> balance, which as far as I know is a much needed regular maintenance
>>> tool to keep a fs healthy, which is why it is part of the
>>> btrfsmaintenance tools 
>> You don't need to be that nervous just for not being able to balance.
>>
>> Nowadays, balance is no longer that much necessary.
>> In the old days, balance is the only way to delete empty block groups,
>> but now empty block groups will be removed automatically, so balance is
>> only here to address unbalanced disk usage or convert.
>>
>> For your case, although it's not comfortable to have imbalanced disk
>> usages, but that won't hurt too much.
>
> Well going from 1Tb to 6Tb devices means there is a lot of weighting
> going the wrong way. Initially there was only ~ 200Gb on each of the new
> disks and so that was just unacceptable it was getting better until I
> hit this balance issue. But I am wary of putting too much new data
> unless it is symptomatic of something else.
>
>
>
>> So for now, you can just disable balance and call it a day.
>> As long as you're still writing into that fs, the fs should become more
>> and more balanced.
>>
>>> Are there some other tests to try and isolate what the problem appears
>>> to be?
>> Forgot to mention, is that always reproducible? And always one the same
>> block group?
>>
>> Thanks,
>> Qu
>
> So far yes. The balance will always fall at the same ino and offset
> making it impossible to continue.
>
>
> Let me run it with debug on and get back to you.
>
>
> Thanks.
>
>
>
>




OK so I mounted the fs with enospc_debug


> /dev/sdb on /mnt/media type btrfs
(rw,relatime,space_cache,enospc_debug,subvolid=1001,subvol=/media)


Re- ran a balance and it did a little more. but then errored out again..


However I don't see any more info in dmesg..

[Fri Nov 22 15:13:40 2019] BTRFS info (device sdb): relocating block
group 8963019112448 flags data|raid5
[Fri Nov 22 15:14:22 2019] BTRFS info (device sdb): found 61 extents
[Fri Nov 22 15:14:41 2019] BTRFS info (device sdb): found 61 extents
[Fri Nov 22 15:14:59 2019] BTRFS info (device sdb): relocating block
group 8801957838848 flags data|raid5
[Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
-9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 1
[Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
-9 ino 305 off 131764224 csum 0xd009e874 expected csum 0x00000000 mirror 1
[Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
-9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 2
[Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
-9 ino 305 off 131764224 csum 0xd009e874 expected csum 0x00000000 mirror 2
[Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
-9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 1
[Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
-9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 2
[Fri Nov 22 15:15:13 2019] BTRFS info (device sdb): balance: ended with
status: -5


What should I do now to get more information on the issue ?


Thank.



-- 
==

D LoCascio

Director

RooSoft Ltd




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux