On Thu, Jul 18, 2019 at 1:25 PM Drazen Kacar <drazen.kacar@xxxxxxxxxxx> wrote: > > Hello, > > I think I've encountered a deadlock between btrfs-transacti and postgres > process(es). This is system information (btrfs fi usage obtained after > poweroff and boot): > > # cat /etc/redhat-release > CentOS Linux release 7.6.1810 (Core) > > # uname -a > Linux prod-dbsnap-01 5.2.1-1.el7.elrepo.x86_64 #1 SMP Sun Jul 14 08:15:04 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux > > # btrfs --version > btrfs-progs v5.2 > > # btrfs filesystem usage /data/pg_data > Overall: > Device size: 2.00TiB > Device allocated: 345.03GiB > Device unallocated: 1.66TiB > Device missing: 0.00B > Used: 338.07GiB > Free (estimated): 1.67TiB (min: 854.27GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID0: Size:332.00GiB, Used:329.22GiB > /dev/sdb 83.00GiB > /dev/sdc 83.00GiB > /dev/sdd 83.00GiB > /dev/sde 83.00GiB > > Metadata,RAID10: Size:6.50GiB, Used:4.42GiB > /dev/sdb 3.25GiB > /dev/sdc 3.25GiB > /dev/sdd 3.25GiB > /dev/sde 3.25GiB > > System,RAID10: Size:16.00MiB, Used:48.00KiB > /dev/sdb 8.00MiB > /dev/sdc 8.00MiB > /dev/sdd 8.00MiB > /dev/sde 8.00MiB > > Unallocated: > /dev/sdb 425.74GiB > /dev/sdc 425.74GiB > /dev/sdd 425.74GiB > /dev/sde 425.74GiB > > There were three btrfs subvolumes and on each one there was a Postgres > database slave doing recovery (single threaded writing). But there was a > lot of writing. And prior to starting Postgres slaves I was restoring base > backup from the backup server, which was being done by a number of > parallel rsync processes (12 at most, I think). > > The file system is mounted with: > > # grep btrfs /proc/mounts > /dev/sdd /data/pg_data btrfs rw,noatime,compress-force=zstd:3,space_cache,subvolid=5,subvol=/ 0 0 > > After several hours I got this in /var/log/messages: > > # grep 'INFO: task .*blocked for more than' messages > Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task btrfs-transacti:1361 blocked for more than 122 seconds. > Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task postgres:62682 blocked for more than 122 seconds. > Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task postgres:80145 blocked for more than 122 seconds. > Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task postgres:87299 blocked for more than 122 seconds. > Jul 17 16:47:09 prod-dbsnap-01 kernel: INFO: task postgres:93108 blocked for more than 122 seconds. > Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task btrfs-transacti:1361 blocked for more than 245 seconds. > Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task postgres:62682 blocked for more than 245 seconds. > Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task postgres:80145 blocked for more than 245 seconds. > Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task postgres:87299 blocked for more than 245 seconds. > Jul 17 16:49:12 prod-dbsnap-01 kernel: INFO: task postgres:93108 blocked for more than 245 seconds. > > Full log is in the attachment. It seems to me that there is a deadlock > between btrfs-transacti and any process which is trying to write > something (not sure about reading). While this was going on the cpu > usage (according to top) was non-existant. There are 4 cpus (it's a virtual > machine in VmWare) and 3 were 100% idle. The 4th was 100% in > waiting. (I didn't find out which process was on that cpu, unfortunately.) > > I powered off the machine (yesterday), booted this morning and things > are working without errors. I stopped postgres clusters, though. > > I have a few questions: > > 1. After something like this happens and the machine is rebooted is there > a procedure which would lower the probablity of encountering the deadlock > again (maybe btrfs scrub or btrfs defragment or something like that)? This > happened after a heavy write activity, so maybe fragmentation had > something to do with it. > > 2. Should I run btrfsck (or something else) to verify on-disk integrity after a > problem like this? Or it's just an in-memory problem, so I can assume that > nothing bad happened to the data on disks. > > 3. I'd like to write a watchdog program to catch deadlocks and reboot > (probably power-cycle) the VM, but I'm not sure what would the appropriate > check be. Does it have to write something to the disk or reading would be > sufficient? And how to bypass the OS buffer cache (fsync() or O_DIRECT > should do it for writing, but I'm not sure about reading)? > > What would the appropriate timeout be? (If the operation doesn't > complete in xx seconds a reboot should be triggered, but I don't know how > many seconds to wait when there's a heavy load and things are just slow, > but there's no deadlock.) > > Should I put watchdog process in real-time class (or however the equivalent > is called on Linux)? Since this is a mainline kernel, I'm not sure if I could assume > that real-time support won't have bugs of its own. > > And last, but not least, is there additional data that could help with debugging > issues like this? (If possible, something that could be programmed into > watchdog service.) It's a regression introduced in 5.2 Fix just sent: https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@xxxxxxxxxx/T/#u Thanks. -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
