On Sun, 2019-09-29 at 07:37 +0800, Qu Wenruo wrote: > > On 2019/9/29 上午2:36, Cebtenzzre wrote: > > On Mon, 2019-09-16 at 17:20 -0400, Cebtenzzre wrote: > > > On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote: > > > > Hi, > > > > > > > > I started a balance of one block group, and I saw this in dmesg: > > > > > > > > BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873 > > > > BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0 > > > > BTRFS info (device sdi1): found 1 extents > > > > BTRFS info (device sdi1): found 1 extents > > > > BTRFS info (device sdi1): found 1 extents > > > > BTRFS info (device sdi1): found 1 extents > > > > BTRFS info (device sdi1): found 1 extents > > > > > > > > [...] > > > > > > > > I am using Arch Linux with kernel version 5.2.14-arch2, and I specified > > > > "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect > > > > against a use-after-free that I found when I had KASAN enabled. Would > > > > that kernel parameter result in a silent retry if it hit the use-after- > > > > free? > > > > > > Please disregard the quoted message. This behavior does appear to be a > > > result of using the slub_debug option instead of KASAN. It is not > > > directly caused by BTRFS. > > > > Actually, I just reproduced this behavior without slub_debug in the > > cmdline, on Linux 5.3.0 with "[PATCH] btrfs: relocation: Fix KASAN > > report about use-after-free due to dead reloc tree cleanup race" ( > > https://patchwork.kernel.org/patch/11153729/) applied. > > > > So, this issue is still relevant and possible to trigger, though under > > different conditions (different volume, kernel version, and cmdline). > > > > That patch is not to solve the while loop problem, so we still need some > extra info for this problem. > > Is the problem always reproducible on that fs or still with some randomness? > > And, can you still reproduce it with v5.1/v5.2? > > Thanks, > Qu > I mentioned that patch because it was the only patch I had applied to my kernel at the time. The "issue" I was referring to was the looping issue that I reported in the first email. I have only come across this behavior without slub_debug once or twice, so I don't have enough of a sample size to say whether it can happen on older kernels. It's caused by running a balance with *just* the right amount of free space, such that the correct behavior is probably ENOSPC. I might eventually dedicate a volume to reproducing this issue, and bisect the kernel. But I need all of my disks to be usable right now. -- Cebtenzzre <cebtenzzre@xxxxxxxxx>
