On 22/11/2019 14:07, devel@xxxxxxxxxxxxxx wrote: > On 22/11/2019 13:56, Qu Wenruo wrote: >> On 2019/11/22 下午9:20, devel@xxxxxxxxxxxxxx wrote: >>> On 22/11/2019 13:10, Qu Wenruo wrote: >>>> On 2019/11/22 下午8:37, devel@xxxxxxxxxxxxxx wrote: >>>>> So been discussing this on IRC but looks like more sage advice is needed. >>>> You're not the only one hitting the bug. (Not sure if that makes you >>>> feel a little better) >>> >>> Hehe.. well always help to know you are not slowly going crazy by oneself. >>> >>>>> The csum error is from data reloc tree, which is a tree to record the >>>>> new (relocated) data. >>>>> So the good news is, your old data is not corrupted, and since we hit >>>>> EIO before switching tree blocks, the corrupted data is just deleted. >>>>> >>>>> And I have also seen the bug just using single device, with DUP meta and >>>>> SINGLE data, so I believe there is something wrong with the data reloc tree. >>>>> The problem here is, I can't find a way to reproduce it, so it will take >>>>> us a longer time to debug. >>>>> >>>>> >>>>> Despite that, have you seen any other problem? Especially ENOSPC (needs >>>>> enospc_debug mount option). >>>>> The only time I hit it, I was debugging ENOSPC bug of relocation. >>>>> >>> As far as I can tell the rest of the filesystem works normally. Like I >>> show scrubs clean etc.. I have not actively added much new data since >>> the whole point is to balance the fs so a scrub does not take 18 hours. >> Sorry my point here is, would you like to try balance again with >> "enospc_debug" mount option? >> >> As for balance, we can hit ENOSPC without showing it as long as we have >> a more serious problem, like the EIO you hit. > > Oh I see .. Sure I can start the balance again. > > >>> So really I am not sure what to do. It only seems to appear during a >>> balance, which as far as I know is a much needed regular maintenance >>> tool to keep a fs healthy, which is why it is part of the >>> btrfsmaintenance tools >> You don't need to be that nervous just for not being able to balance. >> >> Nowadays, balance is no longer that much necessary. >> In the old days, balance is the only way to delete empty block groups, >> but now empty block groups will be removed automatically, so balance is >> only here to address unbalanced disk usage or convert. >> >> For your case, although it's not comfortable to have imbalanced disk >> usages, but that won't hurt too much. > > Well going from 1Tb to 6Tb devices means there is a lot of weighting > going the wrong way. Initially there was only ~ 200Gb on each of the new > disks and so that was just unacceptable it was getting better until I > hit this balance issue. But I am wary of putting too much new data > unless it is symptomatic of something else. > > > >> So for now, you can just disable balance and call it a day. >> As long as you're still writing into that fs, the fs should become more >> and more balanced. >> >>> Are there some other tests to try and isolate what the problem appears >>> to be? >> Forgot to mention, is that always reproducible? And always one the same >> block group? >> >> Thanks, >> Qu > > So far yes. The balance will always fall at the same ino and offset > making it impossible to continue. > > > Let me run it with debug on and get back to you. > > > Thanks. > > > > OK so I mounted the fs with enospc_debug > /dev/sdb on /mnt/media type btrfs (rw,relatime,space_cache,enospc_debug,subvolid=1001,subvol=/media) Re- ran a balance and it did a little more. but then errored out again.. However I don't see any more info in dmesg.. [Fri Nov 22 15:13:40 2019] BTRFS info (device sdb): relocating block group 8963019112448 flags data|raid5 [Fri Nov 22 15:14:22 2019] BTRFS info (device sdb): found 61 extents [Fri Nov 22 15:14:41 2019] BTRFS info (device sdb): found 61 extents [Fri Nov 22 15:14:59 2019] BTRFS info (device sdb): relocating block group 8801957838848 flags data|raid5 [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root -9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 1 [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root -9 ino 305 off 131764224 csum 0xd009e874 expected csum 0x00000000 mirror 1 [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root -9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 2 [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root -9 ino 305 off 131764224 csum 0xd009e874 expected csum 0x00000000 mirror 2 [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root -9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 1 [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root -9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 2 [Fri Nov 22 15:15:13 2019] BTRFS info (device sdb): balance: ended with status: -5 What should I do now to get more information on the issue ? Thank. -- == D LoCascio Director RooSoft Ltd
