Re: btrfs root fs started remounting ro

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 7, 2020 at 2:22 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Feb 7, 2020 at 10:52 AM John Hendy <jw.hendy@xxxxxxxxx> wrote:
>
> > As an update, I'm now running off of a different drive (ssd, not the
> > nvme) and I got the error again! I'm now inclined to think this might
> > not be hardware after all, but something related to my setup or a bug
> > with chromium.
>
> Even if there's a Chromium bug, it should result in file system
> corruption like what you're seeing.

I'm assuming you meant "*shouldn't* result in file system corruption"?

>
> > dmesg after trying to start chromium:
> > - https://pastebin.com/CsCEQMJa
>
> Could you post the entire dmesg, start to finish, for the boot in
> which this first occurred?

Indeed. Just reproduced it:
- https://pastebin.com/UJ8gbgFE

Aside: is there a preferred way for sharing these? The page I read
about this list said text couldn't exceed 100kb, but my original
appears to have bounced and the dmesg alone is >100kb... Just want to
make sure pastebin is cool and am happy to use something
better/preferred.

> This transid isn't realistic, in particular for a filesystem this new.

Clarification, and apologies for the confusion:
- the m2.sata in my original post was my primary drive and had an
issue, then I wiped, mkfs.btrfs from scratch, reinstalled linux, etc.
and it happened again.

- the ssd I'm now running on was the former boot drive in my last
computer which I was using as a backup drive for /mnt/vault pool but
still had the old root fs. After the m2.sata failure, I started
booting from it. It is not a new fs but >2yrs old.

If you'd like, let's stick to troubleshooting the ssd for now.

> [   60.697438] BTRFS error (device dm-0): parent transid verify failed
> on 202711384064 wanted 68719924810 found 448074
> [   60.697457] BTRFS info (device dm-0): no csum found for inode 19064
> start 2392064
> [   60.697777] BTRFS warning (device dm-0): csum failed root 339 ino
> 19064 off 2392064 csum 0x8941f998 expected csum 0x00000000 mirror 1
>
> Expected csum null? Are these files using chattr +C? Something like
> this might help figure it out:
>
> $ sudo btrfs insp inod -v 19064 /home

$ sudo btrfs insp inod -v 19056 /home/jwhendy
ioctl ret=0, bytes_left=4039, bytes_missing=0, cnt=1, missed=0
/home/jwhendy/.config/chromium/Default/Cookies

> $ lsattr /path/to/that/file/

$ lsattr /home/jwhendy/.config/chromium/Default/Cookies
-------------------- /home/jwhendy/.config/chromium/Default/Cookies

> Report output for both.
>
>
> > Thanks for any pointers, as it would now seem that my purchase of a
> > new m2.sata may not buy my way out of this problem! While I didn't
> > want to reinstall, at least new hardware is a simple fix. Now I'm
> > worried there is a deeper issue bound to recur :(
>
> Yep. And fixing Btrfs is not simple.
>
> > > nvme0n1p3 is encrypted with dm-crypt/LUKS.
>
> I don't think the problem is here, except that I sooner believe
> there's a regression in dm-crypt or Btrfs with discards, than I
> believe two different drives have discard related bugs.
>
>
> > > The only thing I've stumbled on is that I have been mounting with
> > > rd.luks.options=discard and that manually running fstrim is preferred.
>
> This was the case for both the NVMe and SSD drives?

Yes, though I have turned that off for the SSD ever since I started
booting from it. That said, I realized that discard is still in my
fstab... is this a potential source of the transid/csum issues? I've
now removed that and am about to reboot after I send this.

$ cat /etc/fstab
/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on / type btrfs
(rw,relatime,compress=lzo,ssd,discard,space_cache,subvolid=263,subvol=/arch)
/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on /home/jwhendy
type btrfs (rw,relatime,compress=lzo,ssd,discard,space_cache,subvolid=339,subvol=/jwhendy)
/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on /mnt/vault
type btrfs (rw,relatime,compress=lzo,ssd,discard,space_cache,subvolid=265,subvol=/vault)

> What was the kernel version this problem first appeared on with NVMe?
> For the (new) SSD you're using 5.5.1, correct?

I just updated today which put me at 5.5.2, but in theory yes. And as
I went to check that I get an Input/Output error trying to check the
pacman log! Here's the dmesg with those new errors included:
- https://pastebin.com/QzYQ2RRg

I'm still mounted rw, but my gosh... what the heck is happening. The
output is for a different root/inode:

$ sudo btrfs insp inod -v 273 /
ioctl ret=0, bytes_left=4053, bytes_missing=0, cnt=1, missed=0
//var/log/pacman.log

Is the double // a concern for that file?

$ sudo lsattr /var/log/pacman.log
-------------------- /var/log/pacman.log

> Can you correlate both corruption events to recent use of fstrim?

I've never used fstrim manually on either drive.

> What are the make/model of both drives?

- ssd: Samsung 850 evo, 250G
- m2.sata: nvme Samsung 960 evo, 250G

> In the meantime, I suggest refreshing backups. Btrfs won't allow files
> with checksums that it knows are corrupt to be copied to user space.
> But it sounds like so far the only files affected are Chrome cache
> files? If so this is relatively straight forward to get back to a
> healthy file system. And then it's time to start iterating some of the
> setup to find out what's causing the problem.

So far, it seemed limited to chromium. I'm not sure about the new
input/output error trying to cat/grep on /var/log/pacman.log. I can
also ro my old drive just fine and have not done anything significant
on the new one. If/when we get to potential destructive operations,
I'll certainly re-up prior to doing those.

Really appreciate the help!
John

>
> --
> Chris Murphy



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux