On Sun, Nov 10, 2019 at 04:02:11AM -0600, Timothy Pearson wrote: > > > ----- Original Message ----- > > From: "Qu Wenruo" <quwenruo.btrfs@xxxxxxx> > > To: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx> > > Cc: "linux-btrfs" <linux-btrfs@xxxxxxxxxxxxxxx> > > Sent: Sunday, November 10, 2019 1:45:14 AM > > Subject: Re: Unusual crash -- data rolled back ~2 weeks? > > > On 2019/11/10 下午3:18, Timothy Pearson wrote: > >> > >> > >> ----- Original Message ----- > >>> From: "Qu Wenruo" <quwenruo.btrfs@xxxxxxx> > >>> To: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx> > >>> Cc: "linux-btrfs" <linux-btrfs@xxxxxxxxxxxxxxx> > >>> Sent: Sunday, November 10, 2019 6:54:55 AM > >>> Subject: Re: Unusual crash -- data rolled back ~2 weeks? > >> > >>> On 2019/11/10 下午2:47, Timothy Pearson wrote: > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Qu Wenruo" <quwenruo.btrfs@xxxxxxx> > >>>>> To: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx>, "linux-btrfs" > >>>>> <linux-btrfs@xxxxxxxxxxxxxxx> > >>>>> Sent: Saturday, November 9, 2019 9:38:21 PM > >>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks? > >>>> > >>>>> On 2019/11/10 上午6:33, Timothy Pearson wrote: > >>>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to > >>>>>> serve a BTRFS filesystem. NFS went into deadlock (D wait) with no apparent > >>>>>> underlying disk subsystem problems, and when the server was hard rebooted to > >>>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was > >>>>>> in approximately two weeks earlier (!). > >>>>> > >>>>> This means during two weeks, the btrfs is not committed. > >>>> > >>>> Is there any hope of getting the data from that interval back via btrfs-recover > >>>> or a similar tool, or does the lack of commit mean the data was stored in RAM > >>>> only and is therefore gone after the server reboot? Writeback will dump out some data blocks between commits; however, without a commit, there will be no metadata pages on disk that point to the data. Writeback could keep a fileserver running for a long time as long as nobody calls a nontrivial fsync() (too complex to be sent to the log tree) or sync(), or renames a file over another existing file (all may trigger a commit if reservations fill up); however, as soon as one of those happens, something should be noticeably failing as the calls will block. > >>> If it's deadlock preventing new transaction to be committed, then no > >>> metadata is even written back to disk, so no way to recover metadata. > >>> Maybe you can find some data written, but without metadata it makes no > >>> sense. > >> > >> OK, I'll just assume the data written in that window is unrecoverable at this > >> point then. > >> > >> Would the commit deadlock affect only one btrfs filesystem or all of them on the > >> machine? I take it there is no automatic dmesg spew on extended deadlock? > >> dmesg was completely clean at the time of the fault / reboot. Stepping away from btrfs a bit, I've heard rumors of something like this happening to SSDs (on Windows, so not a btrfs issue). I guess it may be possible for a log-structured FTL layer to revert to a significantly earlier disk content state if there are enough free erase blocks so that the older data isn't destroyed, and the pointer to the current log record isn't updated in persistent storage due to a firmware bug. Obviously this is not relevant if you're not using SSD, and not likely if you have a multi-disk filesystem (one disk will appear to be corrupted in that case). > > It should have some kernel message for things like process hang for over > > 120s. > > If you could recover that, it would help us to locate the cause. > > > > Normally such deadlock should only affect the unlucky fs which meets the > > condition, not all filesystems. > > But if you're unlucky enough, it may happen to other filesystems. > > > > Anyway, without enough info, it's really hard to say. > > I was able to retrieve complete logs from the kernel for the entire time period. The BTRFS filesystem was online resized five days before the last apparent filesystem commit. Immediately after resize, a couple of csum errors were thrown for a single inode on the resized filesystem, though this was not detected at the time. The underlying hardware did not experience a fault at any point and is passing all diagnostics at this time. Intriguingly, there are a handful of files accessible from after the last known good filesystem commit (Oct. 29), but the vast majority are simply absent. > > At this point I'm more interested in making sure this type of event does not happen in the future than anything else. At no point did the kernel print any type of stack trace or deadlock warning. I'm starting to wonder if we hit a bug in the online resize path, but am just guessing at this point. The timing is certainly very close / coincidental. To detect this kind of failure we use a watchdog script that invokes mkdir and rmdir every 30 seconds on each filesystem backed by disk (i.e. btrfs, ext4, and xfs). If the mkdir/rmdir takes too long (*) then we try to log some information (mostly 'echo w > /proc/sysrq-trigger') and force a reboot. mkdir and rmdir will eventually get stuck on btrfs if there is a commit that is not making forward progress. It's a surprisingly simple and effective bug detector on ext4 and xfs too. (This doesn't detect the SSD thing--you'd need RAID1 to handle that case). The lack of kernel messages is unexpected, especially when you have a NFS process stuck in D state long enough to get admins to force a reboot. That should have produced at least a stuck task warning if they are enabled in your kernel. Did anyone capture the nfsd process stack trace? (*) too long can be surprisingly long. Some btrfs algorithms don't have bounded running time and can delay a commit for several hours if there are active writers on the system. We record logs for commits over 100 seconds, send alarms to admins set at one hour, and automatic reboots after 12 hours. > Thanks
Attachment:
signature.asc
Description: PGP signature
