Re: btrfs root fs started remounting ro

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 7, 2020 at 3:31 PM John Hendy <jw.hendy@xxxxxxxxx> wrote:
>
> On Fri, Feb 7, 2020 at 2:22 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
> >
> > On Fri, Feb 7, 2020 at 10:52 AM John Hendy <jw.hendy@xxxxxxxxx> wrote:
> >
> > > As an update, I'm now running off of a different drive (ssd, not the
> > > nvme) and I got the error again! I'm now inclined to think this might
> > > not be hardware after all, but something related to my setup or a bug
> > > with chromium.
> >
> > Even if there's a Chromium bug, it should result in file system
> > corruption like what you're seeing.
>
> I'm assuming you meant "*shouldn't* result in file system corruption"?

Ha! Yes, of course.


> Indeed. Just reproduced it:
> - https://pastebin.com/UJ8gbgFE

[  126.656696] BTRFS info (device dm-0): turning on discard

I advise removing the discard mount option from /etc/fstab. This
obviates manual fstrim, and makes sure you can't correlate discards to
these problems.


> Aside: is there a preferred way for sharing these? The page I read
> about this list said text couldn't exceed 100kb, but my original
> appears to have bounced and the dmesg alone is >100kb... Just want to
> make sure pastebin is cool and am happy to use something
> better/preferred.

Everyone has their own convention. My preferred convention is to put
the entire dmesg up on google drive, unedited, and include the URL.
And then I extract excerpts I think are relevant and paste into the
email body. That way search engines can find relevant threads.

> Clarification, and apologies for the confusion:
> - the m2.sata in my original post was my primary drive and had an
> issue, then I wiped, mkfs.btrfs from scratch, reinstalled linux, etc.
> and it happened again.
>
> - the ssd I'm now running on was the former boot drive in my last
> computer which I was using as a backup drive for /mnt/vault pool but
> still had the old root fs. After the m2.sata failure, I started
> booting from it. It is not a new fs but >2yrs old.

Got it. Well it would be really bad luck but not impossible to have
two different drives with discard related firmware bugs. But the point
of going through the tedious work to prove this? Such devices will get
the relevant (mis)feature blacklisted in the kernel for that
make/model so that no one else experiences it.




>
> If you'd like, let's stick to troubleshooting the ssd for now.
>
> > [   60.697438] BTRFS error (device dm-0): parent transid verify failed
> > on 202711384064 wanted 68719924810 found 448074

448704 is reasonable for a 2 year old file system. I'm doubt 68719924810 is.


> $ lsattr /home/jwhendy/.config/chromium/Default/Cookies
> -------------------- /home/jwhendy/.config/chromium/Default/Cookies

No +C so these files should have csums.


> Yes, though I have turned that off for the SSD ever since I started
> booting from it. That said, I realized that discard is still in my
> fstab... is this a potential source of the transid/csum issues? I've
> now removed that and am about to reboot after I send this.

Maybe.


> I just updated today which put me at 5.5.2, but in theory yes. And as
> I went to check that I get an Input/Output error trying to check the
> pacman log! Here's the dmesg with those new errors included:
> - https://pastebin.com/QzYQ2RRg
>
> I'm still mounted rw, but my gosh... what the heck is happening. The
> output is for a different root/inode:

Understand that Btrfs is like a canary in the coal mine. It's *less*
tolerant of hardware problems than other file systems, because it
doesn't trust the hardware. Everything is checksummed. The instant
there's a problem, Btrfs will start complaining, and if it gets
confused it goes ro in order to stop spreading the corruption.


>
> $ sudo btrfs insp inod -v 273 /
> ioctl ret=0, bytes_left=4053, bytes_missing=0, cnt=1, missed=0
> //var/log/pacman.log
>
> Is the double // a concern for that file?

No it's just a convention.


> - ssd: Samsung 850 evo, 250G
> - m2.sata: nvme Samsung 960 evo, 250G

As a first step, stop using discard mount option. And delete all the
corrupt files by searching for other affected inodes. Once you're sure
they're all deleted, do a scrub and report back. If the scrub finds no
errors, then I suggest booting off install media and running 'btrfs
check --mode=lowmem' and reporting that output to the list also. Don't
use --repair even if there are reported problems.

A general rule is to change only one thing at a time when
troubleshooting. That way you have a much easier time finding the
source of the problem. I'm not sure how quickly this problem started
to happen, days or weeks? But you want to go for about that long,
unless the problem happens again, to prove whether any change solved
the problem. Ideally, you revert to the suspected setting that causes
the problem to try and prove it's the source, but that's tedious and
up to you. It's fine to just not ever use the discard mount option if
that's what's causing the problem.

I can't really estimate whether that could be defect in the SSD, or
firmware bug that's maybe fixed with a firmware update, or a Btrfs
regression bug. BTW, I think your laptop has a more recent firmware
update available. 01.31 Rev.A 13.5 MB Nov 8, 2019. Could it be
related? *shrug* No idea. But it's vaguely possible. More likely such
things are drive firmware related.

-- 
Chris Murphy



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux