Thanks, I will give it a try. Your step 1 is actually what I used to detect the errors the first time when the delete would cause the system to hang completely. I then deleted all bad files and restored from a backup. I did do a scrub after that, but didn’t repeat step 1 again. I will try your suggestion and repeat the steps till I see no errors. Also, I understand the state of RAID 5/6. This pool has all important data backed up to another RAID1 pool daily. I am actually trying to reduce the size of this pool to add to the RAID1 pool. It was previously a RAID1 pool I converted to RAID6 and since then I have not been able to remove that device. > On Mar 29, 2020, at 2:55 PM, Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote: > > On Sun, Mar 29, 2020 at 10:13:05AM -0400, Jason Clara wrote: >> I had a previous post about when trying to do a device delete that >> it would cause my whole system to hang. I seem to have got past >> that issue. >> >> For that, it seems like even though all the SCRUBs finished without >> any errors I still had a problem with some files. By forcing a read >> of every single file I was able to detect the bad files in DMESG. >> Not sure though why SCRUB didn’t detect this. BTRFS warning (device >> sdd1): csum failed root 5 ino 14654354 off 163852288 csum 0 > > That sounds like it could be the raid5/6 bug I reported > > https://www.spinics.net/lists/linux-btrfs/msg94594.html > > To trigger that bug you need pre-existing corruption on the disk. > > You can work around by: > > 1. Read every file, e.g. 'find -type f -exec cat {} + >/dev/null' > This avoids dmesg ratelimiting which will hide some errors. > > 2. If there are read errors in step 1, remove any that have > failures. > > 3. Run full scrub to fix parity or inject new errors. > > 4. Repeat until there are no errors at step 1. > > The bug will introduce new errors in a small fraction (<0.1%) of corrupted > raid stripes as you do this. Each pass through the loop will remove > existing errors, but may add a few more new errors at the same time. > The rate of removal is much faster than the rate of addition, so the > loop will eventually terminate at zero errors. You'll be able to use > the filesystem normally again after that. > > This bug is not a regression--there has not been a kernel release with > working btrfs raid5/6 yet. All releases from 4.15 to 5.5.3 fail my test > case, and versions before 4.15 have worse bugs. At the moment, btrfs > raid5/6 should only be used by developers who intend to test, debug, > and fix btrfs raid5/6. > >> But now when I attempt to delete a device from the array it seems to >> get stuck. Normally it will show in the log that it has found some >> extents and then another message saying they were relocated. >> >> But for the last few days it has just been repeating the same found >> value and never relocating anything, and the usage of the device >> doesn’t change at all. >> >> This line has now been repeating for more then 24 hours, and the >> previous attempt was similar. [Sun Mar 29 09:59:50 2020] BTRFS info >> (device sdd1): found 133 extents > > Kernels starting with 5.1 have a known regression where block group > relocation gets stuck in loops. Everything in the block group gets > relocated except for shared data backref items, then the relocation can't > seem to move those and no further progress is made. This has not been > fixed yet. > >> Prior to this run I had tried with an earlier kernel (5.5.10) and had >> the same results. It starts with finding and then relocating, but >> then relocating. So I upgraded my kernel to see if that would help, >> and it has not. > > Use kernel 4.19 for device deletes or other big relocation operations. > (5.0 and 4.20 are OK too, but 4.19 is still maintained and has fixes > for non-btrfs issues). > >> System Info >> Ubuntu 18.04 >> btrfs-progs v5.4.1 >> Linux FileServer 5.5.13-050513-generic #202003251631 SMP Wed Mar 25 16:35:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux >> >> DEVICE USAGE >> /dev/sdd1, ID: 1 >> Device size: 2.73TiB >> Device slack: 0.00B >> Data,RAID6: 188.67GiB >> Data,RAID6: 1.68TiB >> Data,RAID6: 888.43GiB >> Unallocated: 1.00MiB >> >> /dev/sdb1, ID: 2 >> Device size: 2.73TiB >> Device slack: 2.73TiB >> Data,RAID6: 188.67GiB >> Data,RAID6: 508.82GiB >> Data,RAID6: 2.00GiB >> Unallocated: -699.50GiB >> >> /dev/sdc1, ID: 3 >> Device size: 2.73TiB >> Device slack: 0.00B >> Data,RAID6: 188.67GiB >> Data,RAID6: 1.68TiB >> Data,RAID6: 888.43GiB >> Unallocated: 1.00MiB >> >> /dev/sdi1, ID: 5 >> Device size: 2.73TiB >> Device slack: 1.36TiB >> Data,RAID6: 188.67GiB >> Data,RAID6: 1.18TiB >> Unallocated: 1.00MiB >> >> /dev/sdh1, ID: 6 >> Device size: 4.55TiB >> Device slack: 0.00B >> Data,RAID6: 188.67GiB >> Data,RAID6: 1.68TiB >> Data,RAID6: 1.23TiB >> Data,RAID6: 888.43GiB >> Data,RAID6: 2.00GiB >> Metadata,RAID1: 2.00GiB >> Unallocated: 601.01GiB >> >> /dev/sda1, ID: 7 >> Device size: 7.28TiB >> Device slack: 0.00B >> Data,RAID6: 188.67GiB >> Data,RAID6: 1.68TiB >> Data,RAID6: 1.23TiB >> Data,RAID6: 888.43GiB >> Data,RAID6: 2.00GiB >> Metadata,RAID1: 2.00GiB >> System,RAID1: 32.00MiB >> Unallocated: 3.32TiB >> >> /dev/sdf1, ID: 8 >> Device size: 7.28TiB >> Device slack: 0.00B >> Data,RAID6: 188.67GiB >> Data,RAID6: 1.68TiB >> Data,RAID6: 1.23TiB >> Data,RAID6: 888.43GiB >> Data,RAID6: 2.00GiB >> Metadata,RAID1: 8.00GiB >> Unallocated: 3.31TiB >> >> /dev/sdj1, ID: 9 >> Device size: 7.28TiB >> Device slack: 0.00B >> Data,RAID6: 188.67GiB >> Data,RAID6: 1.68TiB >> Data,RAID6: 1.23TiB >> Data,RAID6: 888.43GiB >> Data,RAID6: 2.00GiB >> Metadata,RAID1: 8.00GiB >> System,RAID1: 32.00MiB >> Unallocated: 3.31TiB >> >> >> FI USAGE >> WARNING: RAID56 detected, not implemented >> Overall: >> Device size: 33.20TiB >> Device allocated: 20.06GiB >> Device unallocated: 33.18TiB >> Device missing: 0.00B >> Used: 19.38GiB >> Free (estimated): 0.00B (min: 8.00EiB) >> Data ratio: 0.00 >> Metadata ratio: 2.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> Data,RAID6: Size:15.42TiB, Used:15.18TiB (98.44%) >> /dev/sdd1 2.73TiB >> /dev/sdb1 699.50GiB >> /dev/sdc1 2.73TiB >> /dev/sdi1 1.36TiB >> /dev/sdh1 3.96TiB >> /dev/sda1 3.96TiB >> /dev/sdf1 3.96TiB >> /dev/sdj1 3.96TiB >> >> Metadata,RAID1: Size:10.00GiB, Used:9.69GiB (96.90%) >> /dev/sdh1 2.00GiB >> /dev/sda1 2.00GiB >> /dev/sdf1 8.00GiB >> /dev/sdj1 8.00GiB >> >> System,RAID1: Size:32.00MiB, Used:1.19MiB (3.71%) >> /dev/sda1 32.00MiB >> /dev/sdj1 32.00MiB >> >> Unallocated: >> /dev/sdd1 1.00MiB >> /dev/sdb1 -699.50GiB >> /dev/sdc1 1.00MiB >> /dev/sdi1 1.00MiB >> /dev/sdh1 601.01GiB >> /dev/sda1 3.32TiB >> /dev/sdf1 3.31TiB >> /dev/sdj1 3.31TiB >> >> >> FI SHOW >> Label: 'Pool1' uuid: 99935e27-4922-4efa-bf76-5787536dd71f >> Total devices 8 FS bytes used 15.19TiB >> devid 1 size 2.73TiB used 2.73TiB path /dev/sdd1 >> devid 2 size 0.00B used 699.50GiB path /dev/sdb1 >> devid 3 size 2.73TiB used 2.73TiB path /dev/sdc1 >> devid 5 size 1.36TiB used 1.36TiB path /dev/sdi1 >> devid 6 size 4.55TiB used 3.96TiB path /dev/sdh1 >> devid 7 size 7.28TiB used 3.96TiB path /dev/sda1 >> devid 8 size 7.28TiB used 3.97TiB path /dev/sdf1 >> devid 9 size 7.28TiB used 3.97TiB path /dev/sdj1 >> >> FI DF >> Data, RAID6: total=15.42TiB, used=15.18TiB >> System, RAID1: total=32.00MiB, used=1.19MiB >> Metadata, RAID1: total=10.00GiB, used=9.69GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B
