Re: Kernel Dump scanning directory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 08, 2015 at 04:49:19PM -0500, Anthony Plack wrote:
> 
> > On May 8, 2015, at 4:18 PM, Hugo Mills <hugo@xxxxxxxxxxxxx> wrote:
> > 
> > On Fri, May 08, 2015 at 11:44:39AM -0500, Anthony Plack wrote:
> >> Once again btrfsck --repair /dev/sdm ended with
> >> 
> >> parent transid verify failed on 94238015488 wanted 150690 found 150691
> >> Ignoring transid failure
> >> 
> >> no attempt to actually repair the volume.  No indication from the tools why.
> > 
> >   A transid failure means that the superblock has been written out to
> > disk *before* a part of the metadata that forms that transaction, and
> > then the machine has crashed in some way that prevented the
> > late-arriving metadata from hitting the disk. There are two ways that
> > this can happen: it's a bug in the kernel, or the hardware lies about
> > having written data. Both are possible, but the former is more likely.
> 
> Also good to know.  I agree to the bug in the kernel.  I think you just hit on the issue here.  Since this COW, we should be able to assume that the superblock is then still "empty."  Or do I misunderstand?

   No, if the superblock didn't have anything in it, the FS wouldn't
be mountable.

   The CoW design works by making all writes to currently-unused
locations. At some point (a transaction boundary, every 30 seconds by
default), the FS ensures that everything written to date is consistent
-- that everything has been flushed to permanent storage -- and only
when all of that data has hit the disk does it write out the updated
superblock(s) that point to the new data structures. Since superblock
writes are atomic, the FS remains consistent at all times: either the
superblock refers to the old data or the new data, but cannot refer to
any partial state in between.

   It doesn't matter what order the data is written out in, *provided*
that there's a guarantee that it's *all* written out before the
superblock is sent.

   Now, where things go wrong, and results in a transid failure, is
where the metadata structures are written or modified *after* the
superblock is written, and then the machine crashes partway through
the sequence of writes. The intended design of the FS is such that
this should never happen -- however, there's a class of bug that means
it *can* happen, which is probably what's gone on here. I don't know
enough about the block layer to be able to characterise in more detail
what the problem looks like.

> >   Once this failure has happened, the FS is corrupt in a way that's
> > hard to repair reliably. I did raise this question with Chris a while
> > ago, and my understanding from the conversation was that he didn't
> > think that it was possible to fix transid failures in btrfsck.
> 
> I guess I don't understand how that could be designed that way.
> 
> >   Once again: zeroing the log won't help. It doesn't fix everything.
> > In fact, it rarely fixes anything.
> 
> >   The reason there's no documentation on fixing transid failures is
> > because there's no good fix for them.
> 
> Then what is the point of the transactions?  Why do we care about
> transid mismatch?  Why keep something, that if it fails, breaks the
> whole thing?

   A transaction is simply what you get when you write out a new
superblock. It's an atomic change to the FS state. In an ideal world,
the transactions would be fine-grained, and would match the POSIX
filesystem semantics. In practice, the overhead of CoWing all the
metadata for a change at that scale is (relatively) huge. As a result,
the FS collects together larger chunks of writes into a single
transaction.

   Now, there's a further difficulty, which is that if you make
transactions long-lived, you have two choices about the FS behaviour
towards applications: either you wait until the end of the
transaction, and then report success to the writer, in which case
performance is appalling; or you can report to the application that
the write succeeded as soon as you've accepted it.

   Btrfs uses the latter method for performance reasons. But... if you
do that and the machine crashes before the end of the transaction, the
FS state is still the old one, so the app thinks it's written data
successfully to disk, but there's no record of it in the FS. The log
tree fixes _this_ problem by keeping a minimal record of the writes
made so far in the current transaction, so that when the FS resumes
after a crash or power loss, the log can be examined and used to find
the outstanding writes that have been confirmed to userspace. From
that, the partial transaction can be completed.

   In the past, there have been a few bugs where the log tree has been
written out with bad data in it, and this has led to non-mountable
filesystems (because it tries replaying the log tree and crashes).
This is the *only* case where btrfs-zero-log helps.

   I hope all that helps explain what all the moving parts and for,
and why they're there.

   Hugo.

-- 
Hugo Mills             | To an Englishman, 100 miles is a long way; to an
hugo@... carfax.org.uk | American, 100 years is a long time.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                        Earle Hitchner

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux