On Sun, Mar 29, 2020 at 10:13:05AM -0400, Jason Clara wrote: > I had a previous post about when trying to do a device delete that > it would cause my whole system to hang. I seem to have got past > that issue. > > For that, it seems like even though all the SCRUBs finished without > any errors I still had a problem with some files. By forcing a read > of every single file I was able to detect the bad files in DMESG. > Not sure though why SCRUB didn’t detect this. BTRFS warning (device > sdd1): csum failed root 5 ino 14654354 off 163852288 csum 0 That sounds like it could be the raid5/6 bug I reported https://www.spinics.net/lists/linux-btrfs/msg94594.html To trigger that bug you need pre-existing corruption on the disk. You can work around by: 1. Read every file, e.g. 'find -type f -exec cat {} + >/dev/null' This avoids dmesg ratelimiting which will hide some errors. 2. If there are read errors in step 1, remove any that have failures. 3. Run full scrub to fix parity or inject new errors. 4. Repeat until there are no errors at step 1. The bug will introduce new errors in a small fraction (<0.1%) of corrupted raid stripes as you do this. Each pass through the loop will remove existing errors, but may add a few more new errors at the same time. The rate of removal is much faster than the rate of addition, so the loop will eventually terminate at zero errors. You'll be able to use the filesystem normally again after that. This bug is not a regression--there has not been a kernel release with working btrfs raid5/6 yet. All releases from 4.15 to 5.5.3 fail my test case, and versions before 4.15 have worse bugs. At the moment, btrfs raid5/6 should only be used by developers who intend to test, debug, and fix btrfs raid5/6. > But now when I attempt to delete a device from the array it seems to > get stuck. Normally it will show in the log that it has found some > extents and then another message saying they were relocated. > > But for the last few days it has just been repeating the same found > value and never relocating anything, and the usage of the device > doesn’t change at all. > > This line has now been repeating for more then 24 hours, and the > previous attempt was similar. [Sun Mar 29 09:59:50 2020] BTRFS info > (device sdd1): found 133 extents Kernels starting with 5.1 have a known regression where block group relocation gets stuck in loops. Everything in the block group gets relocated except for shared data backref items, then the relocation can't seem to move those and no further progress is made. This has not been fixed yet. > Prior to this run I had tried with an earlier kernel (5.5.10) and had > the same results. It starts with finding and then relocating, but > then relocating. So I upgraded my kernel to see if that would help, > and it has not. Use kernel 4.19 for device deletes or other big relocation operations. (5.0 and 4.20 are OK too, but 4.19 is still maintained and has fixes for non-btrfs issues). > System Info > Ubuntu 18.04 > btrfs-progs v5.4.1 > Linux FileServer 5.5.13-050513-generic #202003251631 SMP Wed Mar 25 16:35:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux > > DEVICE USAGE > /dev/sdd1, ID: 1 > Device size: 2.73TiB > Device slack: 0.00B > Data,RAID6: 188.67GiB > Data,RAID6: 1.68TiB > Data,RAID6: 888.43GiB > Unallocated: 1.00MiB > > /dev/sdb1, ID: 2 > Device size: 2.73TiB > Device slack: 2.73TiB > Data,RAID6: 188.67GiB > Data,RAID6: 508.82GiB > Data,RAID6: 2.00GiB > Unallocated: -699.50GiB > > /dev/sdc1, ID: 3 > Device size: 2.73TiB > Device slack: 0.00B > Data,RAID6: 188.67GiB > Data,RAID6: 1.68TiB > Data,RAID6: 888.43GiB > Unallocated: 1.00MiB > > /dev/sdi1, ID: 5 > Device size: 2.73TiB > Device slack: 1.36TiB > Data,RAID6: 188.67GiB > Data,RAID6: 1.18TiB > Unallocated: 1.00MiB > > /dev/sdh1, ID: 6 > Device size: 4.55TiB > Device slack: 0.00B > Data,RAID6: 188.67GiB > Data,RAID6: 1.68TiB > Data,RAID6: 1.23TiB > Data,RAID6: 888.43GiB > Data,RAID6: 2.00GiB > Metadata,RAID1: 2.00GiB > Unallocated: 601.01GiB > > /dev/sda1, ID: 7 > Device size: 7.28TiB > Device slack: 0.00B > Data,RAID6: 188.67GiB > Data,RAID6: 1.68TiB > Data,RAID6: 1.23TiB > Data,RAID6: 888.43GiB > Data,RAID6: 2.00GiB > Metadata,RAID1: 2.00GiB > System,RAID1: 32.00MiB > Unallocated: 3.32TiB > > /dev/sdf1, ID: 8 > Device size: 7.28TiB > Device slack: 0.00B > Data,RAID6: 188.67GiB > Data,RAID6: 1.68TiB > Data,RAID6: 1.23TiB > Data,RAID6: 888.43GiB > Data,RAID6: 2.00GiB > Metadata,RAID1: 8.00GiB > Unallocated: 3.31TiB > > /dev/sdj1, ID: 9 > Device size: 7.28TiB > Device slack: 0.00B > Data,RAID6: 188.67GiB > Data,RAID6: 1.68TiB > Data,RAID6: 1.23TiB > Data,RAID6: 888.43GiB > Data,RAID6: 2.00GiB > Metadata,RAID1: 8.00GiB > System,RAID1: 32.00MiB > Unallocated: 3.31TiB > > > FI USAGE > WARNING: RAID56 detected, not implemented > Overall: > Device size: 33.20TiB > Device allocated: 20.06GiB > Device unallocated: 33.18TiB > Device missing: 0.00B > Used: 19.38GiB > Free (estimated): 0.00B (min: 8.00EiB) > Data ratio: 0.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID6: Size:15.42TiB, Used:15.18TiB (98.44%) > /dev/sdd1 2.73TiB > /dev/sdb1 699.50GiB > /dev/sdc1 2.73TiB > /dev/sdi1 1.36TiB > /dev/sdh1 3.96TiB > /dev/sda1 3.96TiB > /dev/sdf1 3.96TiB > /dev/sdj1 3.96TiB > > Metadata,RAID1: Size:10.00GiB, Used:9.69GiB (96.90%) > /dev/sdh1 2.00GiB > /dev/sda1 2.00GiB > /dev/sdf1 8.00GiB > /dev/sdj1 8.00GiB > > System,RAID1: Size:32.00MiB, Used:1.19MiB (3.71%) > /dev/sda1 32.00MiB > /dev/sdj1 32.00MiB > > Unallocated: > /dev/sdd1 1.00MiB > /dev/sdb1 -699.50GiB > /dev/sdc1 1.00MiB > /dev/sdi1 1.00MiB > /dev/sdh1 601.01GiB > /dev/sda1 3.32TiB > /dev/sdf1 3.31TiB > /dev/sdj1 3.31TiB > > > FI SHOW > Label: 'Pool1' uuid: 99935e27-4922-4efa-bf76-5787536dd71f > Total devices 8 FS bytes used 15.19TiB > devid 1 size 2.73TiB used 2.73TiB path /dev/sdd1 > devid 2 size 0.00B used 699.50GiB path /dev/sdb1 > devid 3 size 2.73TiB used 2.73TiB path /dev/sdc1 > devid 5 size 1.36TiB used 1.36TiB path /dev/sdi1 > devid 6 size 4.55TiB used 3.96TiB path /dev/sdh1 > devid 7 size 7.28TiB used 3.96TiB path /dev/sda1 > devid 8 size 7.28TiB used 3.97TiB path /dev/sdf1 > devid 9 size 7.28TiB used 3.97TiB path /dev/sdj1 > > FI DF > Data, RAID6: total=15.42TiB, used=15.18TiB > System, RAID1: total=32.00MiB, used=1.19MiB > Metadata, RAID1: total=10.00GiB, used=9.69GiB > GlobalReserve, single: total=512.00MiB, used=0.00B
