Re: A collection of btrfs lockup stack traces (4.14..4.20.17, but not later kernels)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 19, 2019 at 11:49:51PM -0400, Zygo Blaxell wrote:
> On Tue, Mar 19, 2019 at 11:39:59PM -0400, Zygo Blaxell wrote:
> > I haven't been able to easily reproduce these in a test environment;
> > however, they have been happening several times a year on servers in
> > production.
> > 
> > Kernel:  most recent observation on 4.14.105 + cherry-picked deadlock
> > and misc hang fixes:
> > 
> > 	btrfs: wakeup cleaner thread when adding delayed iput
> > 	Btrfs: fix deadlock when allocating tree block during leaf/node split
> > 	Btrfs: use nofs context when initializing security xattrs to avoid deadlock
> > 	Btrfs: fix deadlock with memory reclaim during scrub
> > 	Btrfs: fix deadlock between clone/dedupe and rename
> > 
> > Also observed on 4.20.13, and 4.14.0..4.14.105 (4.14.106 is currently
> > running, but hasn't locked up yet).
> > 
> > Filesystem mount flags:  compress=zstd,ssd,flushoncommit,space_cache=v2.
> > Configuration is either -draid1/-mraid1 or -dsingle/-mraid1.  I've
> > also reproduced a lockup without flushoncommit.
> > 
> > The machines that are locking up all run the same workload:
> > 
> > 	rsync receiving data continuously (gigabytes aren't enough,
> > 	I can barely reproduce this once a month with 2TB of rsync
> > 	traffic from 10 simulated clients)
> > 
> > 	bees doing continuous dedupe
> > 
> > 	snapshots daily and after each rsync
> > 
> > 	snapshot deletes as required to maintain free space
> > 
> > 	scrubs twice monthly plus after each crash
> > 
> > 	watchdog does a 'mkdir foo; rmdir foo' every few seconds.
> > 	If this takes longer than 50 minutes, collect a stack trace;
> > 	longer than 60 minutes, reboot the machine.

These deadlocks still occur on the LTS kernels 4.14 and 4.19 (I have
not tested earlier LTS kernels).  The deadlocks also occur on 4.15..4.18
and 4.20, but the deadlocks stopped on 5.0 and haven't occurred since.

I'm running a bisect to see if I can find where in 5.0 this was fixed,
and whether it is something that can be backported to stable.  This might
take a month or so, as it takes a few days to get a negative deadlock
result and I can only spare one VM for the tests.

In the meantime several other bugs have been fixed and those _have_
been backported to 4.14 and 4.19, so the test case for the bisect is
more aggressive:

	10x rsync write/updates
	bees dedupe, 4 threads
	snapshot create at random intervals (60-600 seconds)
	snapshot delete at random intervals
	balance start continuously
	balance cancel at random intervals
	scrub start continuously
	scrub cancel at random intervals

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux