On Fri, May 08, 2015 at 04:49:19PM -0500, Anthony Plack wrote: > > > On May 8, 2015, at 4:18 PM, Hugo Mills <hugo@xxxxxxxxxxxxx> wrote: > > > > On Fri, May 08, 2015 at 11:44:39AM -0500, Anthony Plack wrote: > >> Once again btrfsck --repair /dev/sdm ended with > >> > >> parent transid verify failed on 94238015488 wanted 150690 found 150691 > >> Ignoring transid failure > >> > >> no attempt to actually repair the volume. No indication from the tools why. > > > > A transid failure means that the superblock has been written out to > > disk *before* a part of the metadata that forms that transaction, and > > then the machine has crashed in some way that prevented the > > late-arriving metadata from hitting the disk. There are two ways that > > this can happen: it's a bug in the kernel, or the hardware lies about > > having written data. Both are possible, but the former is more likely. > > Also good to know. I agree to the bug in the kernel. I think you just hit on the issue here. Since this COW, we should be able to assume that the superblock is then still "empty." Or do I misunderstand? No, if the superblock didn't have anything in it, the FS wouldn't be mountable. The CoW design works by making all writes to currently-unused locations. At some point (a transaction boundary, every 30 seconds by default), the FS ensures that everything written to date is consistent -- that everything has been flushed to permanent storage -- and only when all of that data has hit the disk does it write out the updated superblock(s) that point to the new data structures. Since superblock writes are atomic, the FS remains consistent at all times: either the superblock refers to the old data or the new data, but cannot refer to any partial state in between. It doesn't matter what order the data is written out in, *provided* that there's a guarantee that it's *all* written out before the superblock is sent. Now, where things go wrong, and results in a transid failure, is where the metadata structures are written or modified *after* the superblock is written, and then the machine crashes partway through the sequence of writes. The intended design of the FS is such that this should never happen -- however, there's a class of bug that means it *can* happen, which is probably what's gone on here. I don't know enough about the block layer to be able to characterise in more detail what the problem looks like. > > Once this failure has happened, the FS is corrupt in a way that's > > hard to repair reliably. I did raise this question with Chris a while > > ago, and my understanding from the conversation was that he didn't > > think that it was possible to fix transid failures in btrfsck. > > I guess I don't understand how that could be designed that way. > > > Once again: zeroing the log won't help. It doesn't fix everything. > > In fact, it rarely fixes anything. > > > The reason there's no documentation on fixing transid failures is > > because there's no good fix for them. > > Then what is the point of the transactions? Why do we care about > transid mismatch? Why keep something, that if it fails, breaks the > whole thing? A transaction is simply what you get when you write out a new superblock. It's an atomic change to the FS state. In an ideal world, the transactions would be fine-grained, and would match the POSIX filesystem semantics. In practice, the overhead of CoWing all the metadata for a change at that scale is (relatively) huge. As a result, the FS collects together larger chunks of writes into a single transaction. Now, there's a further difficulty, which is that if you make transactions long-lived, you have two choices about the FS behaviour towards applications: either you wait until the end of the transaction, and then report success to the writer, in which case performance is appalling; or you can report to the application that the write succeeded as soon as you've accepted it. Btrfs uses the latter method for performance reasons. But... if you do that and the machine crashes before the end of the transaction, the FS state is still the old one, so the app thinks it's written data successfully to disk, but there's no record of it in the FS. The log tree fixes _this_ problem by keeping a minimal record of the writes made so far in the current transaction, so that when the FS resumes after a crash or power loss, the log can be examined and used to find the outstanding writes that have been confirmed to userspace. From that, the partial transaction can be completed. In the past, there have been a few bugs where the log tree has been written out with bad data in it, and this has led to non-mountable filesystems (because it tries replaying the log tree and crashes). This is the *only* case where btrfs-zero-log helps. I hope all that helps explain what all the moving parts and for, and why they're there. Hugo. -- Hugo Mills | To an Englishman, 100 miles is a long way; to an hugo@... carfax.org.uk | American, 100 years is a long time. http://carfax.org.uk/ | PGP: E2AB1DE4 | Earle Hitchner
Attachment:
signature.asc
Description: Digital signature
