Re: Btrfs filesystem trashed after OOM scenario

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 24, 2019, 18:34 Chris Murphy, <lists@xxxxxxxxxxxxxxxxx> wrote:
> On Tue, Sep 24, 2019 at 4:04 PM Nick Bowler <nbowler@xxxxxxxxxx> wrote:
> > - Running Linux 5.2.14, I pushed this system to OOM; the oom killer
> > ran and killed some userspace tasks.  At this point many of the
> > remaining tasks were stuck in uninterruptible sleeps.  Not really
> > worried, I turned the machine off and on again to just get everything
> > back to normal.  But I guess now that everything had gone horribly
> > wrong already at this point...
>
> Yeah the kernel oomkiller is pretty much only about kernel
> preservation, not user space preservation.

Indeed I am not bothered at all by needing to turn it off and on again
in this situation.  But filesystems being completely trashed is
another matter...

> > - Upon reboot, the system boots OK but now btrfs is throwing zillions
> > of checksum errors.  After some time the filesystem is remounted
> > readonly and I lose the ability to interact with the system at all, so
> > it gets powered off.
> >
> > - Now the filesystem is unmountable.
>
> The transid errors look like they might be caused by the 5.2 regression
>
> https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@xxxxxxxxxx/T/#u
>
> Fixed since 5.2.15 and 5.3.0.

Yikes, so my decision to update the latest kernel two weeks ago
perhaps was a very bad one.  Should've stuck with 4.19.y I guess.

> So if you're willing to blow shit up again, you can try to reproduce
> with one of those.

Well I could try but it sounds like this might be hard to reproduce...

> I was also doing oomkiller blow shit up tests a few weeks ago with
> these same problem kernels and never hit this bug, or any others. I
> also had to do a LOT of force power offs because the system just
> became totally wedged in and I had no way of estimating how long it
> would be for recovery so after 30 minutes I hit the power button. Many
> times. Zero corruptions. That's with a single Samsung 840 EVO in a
> laptop relegated to such testing.

Just a thought... the system was alive but I was able to briefly
inspect the situation and notice that tasks were blocked and
unkillable... until my shell hung too and then I was hosed.  But I
didn't hit the power button but rather rebooted with sysrq+e, sysrq+u,
sysrq+b.  Not sure if that makes a difference.

> Might be a different bug. Not sure. But also, this is with
>
> > [  347.551595] CPU: 3 PID: 1143 Comm: mount Not tainted 4.19.34-1-lts #1
>
> So I don't know how an older kernel will report on the problem caused
> by the 5.2 bug.

This is the kernel from systemrescuecd.  I can try taking a disk image
and mounting on another machine with a newer linux version.

Thanks,
  Nick



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux