On 2020/2/9 上午8:59, John Hendy wrote: > Also, if it's of interest, the zero-log trick was new to me. For my > original m2.sata nvme drive, I'd already run all of --init-csum-tree, > --init-extent-tree, and --repair (unsure on the order of the first > two, but --repair was definitely last) but could then not mount it. I > just ran `btrfs rescue zero-log` on it and here is the very brief > output from a btrfs check: > > $ sudo btrfs check /dev/mapper/nvme > Opening filesystem to check... > Checking filesystem on /dev/mapper/nvme > UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4 > [1/7] checking root items > [2/7] checking extents > data backref 40762777600 root 256 owner 525787 offset 0 num_refs 0 not > found in extent tree > incorrect local backref count on 40762777600 root 256 owner 525787 > offset 0 found 1 wanted 0 back 0x5635831f9a20 > incorrect local backref count on 40762777600 root 4352 owner 525787 > offset 0 found 0 wanted 1 back 0x56357e5a3c70 > backref disk bytenr does not match extent record, bytenr=40762777600, > ref bytenr=0 > backpointer mismatch on [40762777600 4096] At this stage, btrfs check --repair should be able to fix it. Or does it still segfault? Thanks, Qu > ERROR: errors found in extent allocation tree or chunk allocation > [3/7] checking free space cache > [4/7] checking fs roots > [5/7] checking only csums items (without verifying data) > [6/7] checking root refs > [7/7] checking quota groups skipped (not enabled on this FS) > found 87799443456 bytes used, error(s) found > total csum bytes: 84696784 > total tree bytes: 954220544 > total fs tree bytes: 806535168 > total extent tree bytes: 47710208 > btree space waste bytes: 150766636 > file data blocks allocated: 87780622336 > referenced 94255783936 > > If that looks promising... I'm hoping that the ssd we're currently > working on will follow suit! I'll await your recommendation for what > to do on the previous inquiries for the SSD, and if you have any > suggestions for the backref errors on the nvme drive above. > > Many thanks, > John > > On Sat, Feb 8, 2020 at 6:51 PM John Hendy <jw.hendy@xxxxxxxxx> wrote: >> >> On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote: >>> >>> >>> >>> On 2020/2/9 上午5:57, John Hendy wrote: >>>> On phone due to no OS, so apologies if this is in html mode. Indeed, I >>>> can't mount or boot any longer. I get the error: >>>> >>>> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed >>>> to recover log tree) >>>> BTRFS error (device dm-0): open_ctree failed >>> >>> That can be easily fixed by `btrfs rescue zero-log`. >>> >> >> Whew. This was most helpful and it is wonderful to be booting at >> least. I think the outstanding issues are: >> - what should I do about `btrfs check --repair seg` faulting? >> - how can I deal with this (probably related to seg fault) ghost file >> that cannot be deleted? >> - I'm not sure if you looked at the post --repair log, but there a ton >> of these errors that didn't used to be there: >> >> backpointer mismatch on [13037375488 20480] >> ref mismatch on [13037395968 892928] extent item 0, found 1 >> data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0 >> not found in extent tree >> incorrect local backref count on 13037395968 root 263 owner 4257169 >> offset 0 found 1 wanted 0 back 0x5627f59cadc0 >> >> Here is the latest btrfs check output after the zero-log operation. >> - https://pastebin.com/KWeUnk0y >> >> I'm hoping once that file is deleted, it's a matter of >> --init-csum-tree and perhaps I'm set? Or --init-extent-tree? >> >> Thanks, >> John >> >>> At least, btrfs check --repair didn't make things worse. >>> >>> Thanks, >>> Qu >>>> >>>> John >>>> >>>> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@xxxxxxxxx >>>> <mailto:jw.hendy@xxxxxxxxx>> wrote: >>>> >>>> This is not going so hot. Updates: >>>> >>>> booted from arch install, pre repair btrfs check: >>>> - https://pastebin.com/6vNaSdf2 >>>> >>>> btrfs check --mode=lowmem as requested by Chris: >>>> - https://pastebin.com/uSwSTVVY >>>> >>>> Then I did btrfs check --repair, which seg faulted at the end. I've >>>> typed them off of pictures I took: >>>> >>>> Starting repair. >>>> Opening filesystem to check... >>>> Checking filesystem on /dev/mapper/ssd >>>> [1/7] checking root items >>>> Fixed 0 roots. >>>> [2/7] checking extents >>>> parent transid verify failed on 20271138064 wanted 68719924810 found >>>> 448074 >>>> parent transid verify failed on 20271138064 wanted 68719924810 found >>>> 448074 >>>> Ignoring transid failure >>>> # ... repeated the previous two lines maybe hundreds of times >>>> # ended with this: >>>> ref mismatch on [12797435904 268505088] extent item 1, found 412 >>>> [1] 1814 segmentation fault (core dumped) btrfs check --repair >>>> /dev/mapper/ssd >>>> >>>> This was with btrfs-progs 5.4 (the install USB is maybe a month old). >>>> >>>> Here is the output of btrfs check after the --repair attempt: >>>> - https://pastebin.com/6MYRNdga >>>> >>>> I rebooted to write this email given the seg fault, as I wanted to >>>> make sure that I should still follow-up --repair with >>>> --init-csum-tree. I had pictures of the --repair output, but Firefox >>>> just wouldn't load imgur.com <http://imgur.com> for me to post the >>>> pics and was acting >>>> really weird. In suspiciously checking dmesg, things have gone ro on >>>> me :( Here is the dmesg from this session: >>>> - https://pastebin.com/a2z7xczy >>>> >>>> The gist is: >>>> >>>> [ 40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7 >>>> block=172703744 slot=0, csum end range (12980568064) goes beyond the >>>> start range (12980297728) of the next csum item >>>> [ 40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983 >>>> total ptrs 34 free space 29 owner 7 >>>> [ 40.997942] item 0 key (18446744073709551606 128 12979060736) >>>> itemoff 14811 itemsize 1472 >>>> [ 40.997944] item 1 key (18446744073709551606 128 12980297728) >>>> itemoff 13895 itemsize 916 >>>> [ 40.997945] item 2 key (18446744073709551606 128 12981235712) >>>> itemoff 13811 itemsize 84 >>>> # ... there's maybe 30 of these item n key lines in total >>>> [ 40.997984] BTRFS error (device dm-0): block=172703744 write time >>>> tree block corruption detected >>>> [ 41.016793] BTRFS: error (device dm-0) in >>>> btrfs_commit_transaction:2332: errno=-5 IO failure (Error while >>>> writing out transaction) >>>> [ 41.016799] BTRFS info (device dm-0): forced readonly >>>> [ 41.016802] BTRFS warning (device dm-0): Skipping commit of aborted >>>> transaction. >>>> [ 41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890: >>>> errno=-5 IO failure >>>> [ 41.016807] BTRFS info (device dm-0): delayed_refs has NO entry >>>> [ 41.023473] BTRFS warning (device dm-0): Skipping commit of aborted >>>> transaction. >>>> [ 41.024297] BTRFS info (device dm-0): delayed_refs has NO entry >>>> [ 44.509418] systemd-journald[416]: >>>> /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal: >>>> Journal file corrupted, rotating. >>>> [ 44.509440] systemd-journald[416]: Failed to rotate >>>> /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal: >>>> Read-only file system >>>> [ 44.509450] systemd-journald[416]: Failed to rotate >>>> /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal: >>>> Read-only file system >>>> [ 44.509540] systemd-journald[416]: Failed to write entry (23 items, >>>> 705 bytes) despite vacuuming, ignoring: Bad message >>>> # ... then a bunch of these failed journal attempts (of note: >>>> /var/log/journal was one of the bad inodes from btrfs check >>>> previously) >>>> >>>> Kindly let me know what you would recommend. I'm sadly back to an >>>> unusable system vs. a complaining/worrisome one. This is similar to >>>> the behavior I had with the m2.sata nvme drive in my original >>>> experience. After trying all of --repair, --init-csum-tree, and >>>> --init-extent-tree, I couldn't boot anymore. After my dm-crypt >>>> password at boot, I just saw a bunch of [FAILED] in the text splash >>>> output. Hoping to not repeat that with this drive. >>>> >>>> Thanks, >>>> John >>>> >>>> >>>> On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@xxxxxxx >>>> <mailto:quwenruo.btrfs@xxxxxxx>> wrote: >>>> > >>>> > >>>> > >>>> > On 2020/2/8 下午12:48, John Hendy wrote: >>>> > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx >>>> <mailto:quwenruo.btrfs@xxxxxxx>> wrote: >>>> > >> >>>> > >> >>>> > >> >>>> > >> On 2020/2/8 上午1:52, John Hendy wrote: >>>> > >>> Greetings, >>>> > >>> >>>> > >>> I'm resending, as this isn't showing in the archives. Perhaps >>>> it was >>>> > >>> the attachments, which I've converted to pastebin links. >>>> > >>> >>>> > >>> As an update, I'm now running off of a different drive (ssd, >>>> not the >>>> > >>> nvme) and I got the error again! I'm now inclined to think >>>> this might >>>> > >>> not be hardware after all, but something related to my setup >>>> or a bug >>>> > >>> with chromium. >>>> > >>> >>>> > >>> After a reboot, chromium wouldn't start for me and demsg showed >>>> > >>> similar parent transid/csum errors to my original post below. >>>> I used >>>> > >>> btrfs-inspect-internal to find the inode traced to >>>> > >>> ~/.config/chromium/History. I deleted that, and got a new set of >>>> > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted >>>> that and >>>> > >>> tried starting chromium, I found that my btrfs /home/jwhendy >>>> pool was >>>> > >>> mounted ro just like the original problem below. >>>> > >>> >>>> > >>> dmesg after trying to start chromium: >>>> > >>> - https://pastebin.com/CsCEQMJa >>>> > >> >>>> > >> So far, it's only transid bug in your csum tree. >>>> > >> >>>> > >> And two backref mismatch in data backref. >>>> > >> >>>> > >> In theory, you can fix your problem by `btrfs check --repair >>>> > >> --init-csum-tree`. >>>> > >> >>>> > > >>>> > > Now that I might be narrowing in on offending files, I'll wait >>>> to see >>>> > > what you think from my last response to Chris. I did try the above >>>> > > when I first ran into this: >>>> > > - >>>> https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@xxxxxxxxxxxxxx/ >>>> > >>>> > That RO is caused by the missing data backref. >>>> > >>>> > Which can be fixed by btrfs check --repair. >>>> > >>>> > Then you should be able to delete offending files them. (Or the whole >>>> > chromium cache, and switch to firefox if you wish :P ) >>>> > >>>> > But also please keep in mind that, the transid mismatch looks >>>> happen in >>>> > your csum tree, which means your csum tree is no longer reliable, and >>>> > may cause -EIO reading unrelated files. >>>> > >>>> > Thus it's recommended to re-fill the csum tree by --init-csum-tree. >>>> > >>>> > It can be done altogether by --repair --init-csum-tree, but to be >>>> safe, >>>> > please run --repair only first, then make sure btrfs check reports no >>>> > error after that. Then go --init-csum-tree. >>>> > >>>> > > >>>> > >> But I'm more interesting in how this happened. >>>> > > >>>> > > Me too :) >>>> > > >>>> > >> Have your every experienced any power loss for your NVME drive? >>>> > >> I'm not say btrfs is unsafe against power loss, all fs should >>>> be safe >>>> > >> against power loss, I'm just curious about if mount time log >>>> replay is >>>> > >> involved, or just regular internal log replay. >>>> > >> >>>> > >> From your smartctl, the drive experienced 61 unsafe shutdown >>>> with 2144 >>>> > >> power cycles. >>>> > > >>>> > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get >>>> > > caught off gaurd by low battery and instant power-off, I kick myself >>>> > > and mean to set up a script to force poweroff before that >>>> happens. So, >>>> > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe >>>> > > not over ~2 years. And actually, I mis-stated the age. I haven't >>>> > > *booted* from this drive in almost 2yrs. It's a corporate laptop, >>>> > > issued every 3, so the ssd drive is more like 5 years old. >>>> > > >>>> > >> Not sure if it's related. >>>> > >> >>>> > >> Another interesting point is, did you remember what's the >>>> oldest kernel >>>> > >> running on this fs? v5.4 or v5.5? >>>> > > >>>> > > Hard to say, but arch linux maintains a package archive. The nvme >>>> > > drive is from ~May 2018. The archives only go back to Jan 2019 >>>> and the >>>> > > kernel/btrfs-progs was at 4.20 then: >>>> > > - https://archive.archlinux.org/packages/l/linux/ >>>> > >>>> > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could >>>> > cause metadata corruption. And the symptom is transid error, which >>>> also >>>> > matches your problem. >>>> > >>>> > Thanks, >>>> > Qu >>>> > >>>> > > >>>> > > Searching my Amazon orders, the SSD was in the 2015 time frame, >>>> so the >>>> > > kernel version would have been even older. >>>> > > >>>> > > Thanks for your input, >>>> > > John >>>> > > >>>> > >> >>>> > >> Thanks, >>>> > >> Qu >>>> > >>> >>>> > >>> Thanks for any pointers, as it would now seem that my purchase >>>> of a >>>> > >>> new m2.sata may not buy my way out of this problem! While I didn't >>>> > >>> want to reinstall, at least new hardware is a simple fix. Now I'm >>>> > >>> worried there is a deeper issue bound to recur :( >>>> > >>> >>>> > >>> Best regards, >>>> > >>> John >>>> > >>> >>>> > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@xxxxxxxxx >>>> <mailto:jw.hendy@xxxxxxxxx>> wrote: >>>> > >>>> >>>> > >>>> Greetings, >>>> > >>>> >>>> > >>>> I've had this issue occur twice, once ~1mo ago and once a >>>> couple of >>>> > >>>> weeks ago. Chromium suddenly quit on me, and when trying to >>>> start it >>>> > >>>> again, it complained about a lock file in ~. I tried to delete it >>>> > >>>> manually and was informed I was on a read-only fs! I ended up >>>> biting >>>> > >>>> the bullet and re-installing linux due to the number of dead end >>>> > >>>> threads and slow response rates on diagnosing these issues, >>>> and the >>>> > >>>> issue occurred again shortly after. >>>> > >>>> >>>> > >>>> $ uname -a >>>> > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 >>>> 16:38:40 >>>> > >>>> +0000 x86_64 GNU/Linux >>>> > >>>> >>>> > >>>> $ btrfs --version >>>> > >>>> btrfs-progs v5.4 >>>> > >>>> >>>> > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be >>>> mounting a subvol on / >>>> > >>>> Data, single: total=114.01GiB, used=80.88GiB >>>> > >>>> System, single: total=32.00MiB, used=16.00KiB >>>> > >>>> Metadata, single: total=2.01GiB, used=769.61MiB >>>> > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B >>>> > >>>> >>>> > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15. >>>> > >>>> nvme0n1 259:5 0 >>>> 232.9G 0 disk >>>> > >>>> ├─nvme0n1p1 259:6 0 >>>> 512M 0 >>>> > >>>> part (/boot/efi) >>>> > >>>> ├─nvme0n1p2 259:7 0 >>>> 1G 0 part (/boot) >>>> > >>>> └─nvme0n1p3 259:8 0 >>>> 231.4G 0 part (btrfs) >>>> > >>>> >>>> > >>>> I have the following subvols: >>>> > >>>> arch: used for / when booting arch >>>> > >>>> jwhendy: used for /home/jwhendy on arch >>>> > >>>> vault: shared data between distros on /mnt/vault >>>> > >>>> bionic: root when booting ubuntu bionic >>>> > >>>> >>>> > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS. >>>> > >>>> >>>> > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached. >>>> > >>> >>>> > >>> Edit: links now: >>>> > >>> - btrfs check: https://pastebin.com/nz6Bc145 >>>> > >>> - dmesg: https://pastebin.com/1GGpNiqk >>>> > >>> - smartctl: https://pastebin.com/ADtYqfrd >>>> > >>> >>>> > >>> btrfs dev stats (not worth a link): >>>> > >>> >>>> > >>> [/dev/mapper/old].write_io_errs 0 >>>> > >>> [/dev/mapper/old].read_io_errs 0 >>>> > >>> [/dev/mapper/old].flush_io_errs 0 >>>> > >>> [/dev/mapper/old].corruption_errs 0 >>>> > >>> [/dev/mapper/old].generation_errs 0 >>>> > >>> >>>> > >>> >>>> > >>>> If these are of interested, here are reddit threads where I >>>> posted the >>>> > >>>> issue and was referred here. >>>> > >>>> 1) >>>> https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/ >>>> > >>>> 2) >>>> https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/ >>>> > >>>> >>>> > >>>> It has been suggested this is a hardware issue. I've already >>>> ordered a >>>> > >>>> replacement m2.sata, but for sanity it would be great to know >>>> > >>>> definitively this was the case. If anything stands out above that >>>> > >>>> could indicate I'm not setup properly re. btrfs, that would >>>> also be >>>> > >>>> fantastic so I don't repeat the issue! >>>> > >>>> >>>> > >>>> The only thing I've stumbled on is that I have been mounting with >>>> > >>>> rd.luks.options=discard and that manually running fstrim is >>>> preferred. >>>> > >>>> >>>> > >>>> >>>> > >>>> Many thanks for any input/suggestions, >>>> > >>>> John >>>> > >> >>>> > >>>> >>>
Attachment:
signature.asc
Description: OpenPGP digital signature
