On 4/1/20 9:56 PM, Christoph Anton Mitterer wrote:
Hey Josef, et al.
First, many thanks for the quick help before. :-)
On Wed, 2020-04-01 at 16:40 -0400, Josef Bacik wrote:
btrfs rescue zero-log /dev/whatever
This worked nicely and at the very first glance (though I haven't
diffed any of the data with backups nor did I run scrub, yet) it seems
to be mostly all there.
I have a number of questions though...
1) Could this be a bug?
Yes I know I had a freeze but, here's what happened.
- few days ago I've upgraded from 5.2 respectively 5.4 to 5.5.13
the system ran already for one day without issues, "before" it
suddenly froze, Magic-SysRq wasn't working and I had to power-off
- I then booted from a rescue USB stick with some kernel 5.4 and btrfs
tools 5.4.1
- did a --mode=normal fsck of the fs, no errors !!
- then I did a --clear-space-cache v1
Every now and then I see some free space warnings in the kernel log,
and so I do clear the cache from time to time when I have the
filesystem offline
- I didn't to another fsck directly afterwards unfortunately...
if I had done (and saw errors already by then, we'd now know for sure
there must be some bug)
- then I rebooted in the the normal system and there it failed to mount
(i.e. the root fs)
So I mean I could understand that something would have gotten damaged
right after the freeze, but the fsck there seemed fine,...
Any ideas?
This was just a corruption of the log tree, so it won't affect your actual data
thankfully.
As for how this happened, well we had a very long standing problem that I fixed
in 5.4 where we could mistakenly update the tree log with the wrong block and
thus get transid mismatches. But if this happened while on a 5.5 kernel then I
don't know what went wrong. I'll go poke around and see if there's any other
related ways we could make the same mistake.
2) What's the tree log doing? Is it like kind of a journal? And
basically everything that was in it and not yet fully commited to the
fs is now lost?
It's the fsync log, so if you fsync something between transaction commits (every
30 seconds) then that's where the data goes. So assume you lost anything in the
last 30 seconds of the life of the fs.
3) Based on the generation (I assume 1453260 and 1452480 are generation
numbers?), can one tell how much data is lost, like in the sense of the
time span?
parent transid verify failed on 425230336 wanted 1453260 found 1452480
And can one tell what is pointed to by 425230336?
No way to know really, like I said it's just the fsync log, so you likely didn't
lose anything you care about.
4) The open_ctree failed error one saw on the screenshot... was this
then just a follow up error from the failure of replaying the log?
Yup, can't replay the log, open_ctree fails.
5) Was some backup superblock used now and thus some further
data/metadata-lost?
Nope, we just told it to ignore the log, everything before is all fine.
And most importantly:
6) Can one expect now, that everything which is now there/seen is still
valid? Or could there be any file internal corruptions (respectively is
this likely or not)?
Nothing should be corrupt, everything should be a-ok.
I mean this is what I'd more or less expect from a CoW fs... if it
crashes some data might be gone, but what's still there is 100% valid?
7) Am I advised to re-create the filesystem? Like could there be still
any hidden errors that fsck doesn't see and that sooner or later build
up and make it explode again?
Or is the whole thing just a minor issue and a well known/understood
clean up procedure from a previous freeze?
This is probably the safest form of failure, I wouldn't expect anything else to
be wrong.
Setting it up again (with the recovery) would be just work (not that I
can access the data again)... so if it's advisable I'd rather go for
that.
8) Any other checks I could/should make, like scrub?
You can if you like, but as I've said, the core of the file system remains
intact. Thanks,
Josef