On Fri, Feb 7, 2020 at 5:17 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote: > > On Fri, Feb 7, 2020 at 3:31 PM John Hendy <jw.hendy@xxxxxxxxx> wrote: > > > > On Fri, Feb 7, 2020 at 2:22 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote: > > > > > > On Fri, Feb 7, 2020 at 10:52 AM John Hendy <jw.hendy@xxxxxxxxx> wrote: > > > > > > > As an update, I'm now running off of a different drive (ssd, not the > > > > nvme) and I got the error again! I'm now inclined to think this might > > > > not be hardware after all, but something related to my setup or a bug > > > > with chromium. > > > > > > Even if there's a Chromium bug, it should result in file system > > > corruption like what you're seeing. > > > > I'm assuming you meant "*shouldn't* result in file system corruption"? > > Ha! Yes, of course. > > > > Indeed. Just reproduced it: > > - https://pastebin.com/UJ8gbgFE > > [ 126.656696] BTRFS info (device dm-0): turning on discard > > I advise removing the discard mount option from /etc/fstab. This > obviates manual fstrim, and makes sure you can't correlate discards to > these problems. Done! /dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on / type btrfs (rw,relatime,compress=lzo,ssd,space_cache,subvolid=263,subvol=/arch) /dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on /home/jwhendy type btrfs (rw,relatime,compress=lzo,ssd,space_cache,subvolid=339,subvol=/jwhendy) /dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on /mnt/vault type btrfs (rw,relatime,compress=lzo,ssd,space_cache,subvolid=265,subvol=/vault) > > Aside: is there a preferred way for sharing these? The page I read > > about this list said text couldn't exceed 100kb, but my original > > appears to have bounced and the dmesg alone is >100kb... Just want to > > make sure pastebin is cool and am happy to use something > > better/preferred. > > Everyone has their own convention. My preferred convention is to put > the entire dmesg up on google drive, unedited, and include the URL. > And then I extract excerpts I think are relevant and paste into the > email body. That way search engines can find relevant threads. > Thanks for that. I'll stick to pastebin for now just for convenience. Mainly I wanted to make sure that links to these were reasonable, and sounds like this is okay for the list. Thanks! > > Clarification, and apologies for the confusion: > > - the m2.sata in my original post was my primary drive and had an > > issue, then I wiped, mkfs.btrfs from scratch, reinstalled linux, etc. > > and it happened again. > > > > - the ssd I'm now running on was the former boot drive in my last > > computer which I was using as a backup drive for /mnt/vault pool but > > still had the old root fs. After the m2.sata failure, I started > > booting from it. It is not a new fs but >2yrs old. > > Got it. Well it would be really bad luck but not impossible to have > two different drives with discard related firmware bugs. But the point > of going through the tedious work to prove this? Such devices will get > the relevant (mis)feature blacklisted in the kernel for that > make/model so that no one else experiences it. > > > > If you'd like, let's stick to troubleshooting the ssd for now. > > > > > [ 60.697438] BTRFS error (device dm-0): parent transid verify failed > > > on 202711384064 wanted 68719924810 found 448074 > > 448704 is reasonable for a 2 year old file system. I'm doubt 68719924810 is. > > > > $ lsattr /home/jwhendy/.config/chromium/Default/Cookies > > -------------------- /home/jwhendy/.config/chromium/Default/Cookies > > No +C so these files should have csums. > > > > Yes, though I have turned that off for the SSD ever since I started > > booting from it. That said, I realized that discard is still in my > > fstab... is this a potential source of the transid/csum issues? I've > > now removed that and am about to reboot after I send this. > > Maybe. > > > > I just updated today which put me at 5.5.2, but in theory yes. And as > > I went to check that I get an Input/Output error trying to check the > > pacman log! Here's the dmesg with those new errors included: > > - https://pastebin.com/QzYQ2RRg > > > > I'm still mounted rw, but my gosh... what the heck is happening. The > > output is for a different root/inode: > > Understand that Btrfs is like a canary in the coal mine. It's *less* > tolerant of hardware problems than other file systems, because it > doesn't trust the hardware. Everything is checksummed. The instant > there's a problem, Btrfs will start complaining, and if it gets > confused it goes ro in order to stop spreading the corruption. > > > > > > $ sudo btrfs insp inod -v 273 / > > ioctl ret=0, bytes_left=4053, bytes_missing=0, cnt=1, missed=0 > > //var/log/pacman.log > > > > Is the double // a concern for that file? > > No it's just a convention. > > > > - ssd: Samsung 850 evo, 250G > > - m2.sata: nvme Samsung 960 evo, 250G > > As a first step, stop using discard mount option. And delete all the > corrupt files by searching for other affected inodes. Once you're sure > they're all deleted, do a scrub and report back. If the scrub finds no > errors, then I suggest booting off install media and running 'btrfs > check --mode=lowmem' and reporting that output to the list also. Don't > use --repair even if there are reported problems. I tried to remove .config/chromium, but ran into a weird problem. I was getting an error on `rm` with a TransportSecurity file saying "No such file or directory." More on that below. I also removed /var/log/pacman.log, the other offending file from the previous inode error. At this point I tried a `btrfs scrub start /` but it fails (aborted): [ 126.520270] BTRFS error (device dm-0): parent transid verify failed on 202711384064 wanted 68719924810 found 448074 [ 126.532637] BTRFS info (device dm-0): scrub: not finished on devid 1 with status: -5 Full dmesg at that point: - https://pastebin.com/9TvvMVpE Brief aside before we get back to .config/chromium: after I sent the last message and removed the discard option (but before I deleted these files), I ran btrfs check from an arch install usb. - https://pastebin.com/Wdg8aqTY The first inode resolved to /var/log/journal so I just rm'd the whole thing. Every subsequent inode on root 263 (/ mountpoint) resulted in the following, so I think problematic files on / are set: ERROR: ino paths ioctl: No such file or directory This inode was also in the output of the btrfs check, and is the same file I can't delete from above: root 339 inode 17848 errors 200, dir isize wrong unresolved ref dir 17848 index 6 namelen 11 name File System filetype 2 errors 2, no dir index root 339 inode 4504988 errors 1, no inode item unresolved ref dir 17848 index 489287 namelen 17 name TransportSecurity filetype 1 errors 5, no dir item, no inode ref $ sudo btrfs insp inode -v 17848 /home/jwhendy/ [sudo] password for jwhendy: ioctl ret=0, bytes_left=4034, bytes_missing=0, cnt=1, missed=0 /home/jwhendy//.local/share/Trash/expunged/3065996973 $ cd .local/share/Trash/expunged/3065996973/ $ ls ls: cannot access 'TransportSecurity': No such file or directory TransportSecurity $ ls -la ls: cannot access 'TransportSecurity': No such file or directory total 0 drwx------ 1 jwhendy jwhendy 22 Feb 7 21:42 . drwx------ 1 jwhendy jwhendy 20 Feb 7 21:46 .. -????????? ? ? ? ? ? TransportSecurity Posts online suggest `rm -i -- ./*` but that doesn't work. $ rm -i -- ./* rm: cannot remove './TransportSecurity': No such file or directory I also found a post suggesting this, potentially revealing weird, non-obvious characters that might be present: $ ls | od -a 0000000 T r a n s p o r t S e c u r i t 0000020 y nl 0000022 Not sure what to make of that. In other StackOverflow and similar posts, the `rm -i -- ./*` does the trick. Yet another post suggested moving to /tmp and rebooting, but I can't move it (same "no such file or directory" error). Any input on how to blow this thing up? > A general rule is to change only one thing at a time when > troubleshooting. That way you have a much easier time finding the > source of the problem. I'm not sure how quickly this problem started > to happen, days or weeks? But you want to go for about that long, > unless the problem happens again, to prove whether any change solved > the problem. Ideally, you revert to the suspected setting that causes > the problem to try and prove it's the source, but that's tedious and > up to you. It's fine to just not ever use the discard mount option if > that's what's causing the problem. > > I can't really estimate whether that could be defect in the SSD, or > firmware bug that's maybe fixed with a firmware update, or a Btrfs > regression bug. BTW, I think your laptop has a more recent firmware > update available. 01.31 Rev.A 13.5 MB Nov 8, 2019. Could it be > related? *shrug* No idea. But it's vaguely possible. More likely such > things are drive firmware related. firmware = BIOS? I can check that. Or if this is is intel-ucode, I just have whatever arch has as current... Thanks again, John > > -- > Chris Murphy
