Hi Zygo, Thanks for the reply. Here are few responses about btrfs behavior that you had questions about in your email. On Thu, Apr 19, 2018 at 4:41 PM, Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote: > > On Mon, Apr 16, 2018 at 09:35:24AM -0500, Jayashree Mohan wrote: > > Hi, > > > > The following seems to be a crash consistency bug on btrfs, where in > > the link count is not persisted even after a fsync on the original > > file. > > > > Consider the following workload : > > creat foo > > link (foo, A/bar) > > fsync(foo) > > ---Crash--- > > > > Now, on recovery we expect the metadata of foo to be persisted i.e > > have a link count of 2. However in btrfs, the link count is 1 and file > > A/bar is not persisted. The expected behaviour would be to persist the > > dependencies of inode foo. That is to say, shouldn't fsync of foo > > persist A/bar and correctly update the link count? > > Those dependencies are backward. foo's inode doesn't depend on anything > but the data in the file foo, and foo's inode itself. > > "foo" and "A/bar" are dirents that both depend on the inode of foo, which > implies that "A" and "." must be updated atomically with foo's inode. > If you had called fsync(A) then we'd expect A/bar to exist and the inode > to have a link count of 2. If you'd called fsync(.) then...well, you > didn't modify "." at all, so I guess either outcome is valid as long as > the inode link count matches the number of dirents referencing the inode. > > But then...why does foo exist at all? I'd expect at least some tests > would end without foo on disk either, since all that was fsync()ed was the > foo inode, not the foo dirent in the directory '.'. Does btrfs combine > creating foo and updating foo's inode into a single atomic operation? > I vaguely recall that it does exactly that, in order to solve a bug > some years ago. What happens if you add a rename, e.g. > > unlink foo2 # make sure foo2 doesn't exist > creat foo > rename(foo, foo2) > link(foo2, A/bar) > fsync(foo2) > > Do you get foo or foo2? I'd expect foo since you didn't fsync '.', > but maybe rename implies flush and you get foo2. This is the workload we tried: creat foo rename(foo, foo2) mkdir A link(foo2, A/bar) fsync(foo2) Btrfs persists foo2 here, and A/bar is not persisted. Also, the link count of foo2 is 1, while on the other filesystems, A/bar persist as well and foo2 has a link count 2. I don't think it's the rename here that implies the flush - removing the fsync(foo2) ends up not persisting A/bar. So looks like, its the fsync that forces all dependencies to the disk. As a variant of the above workload, consider the following: mkdir A sync creat foo rename(foo, foo2) link(foo2, A/bar) fsync(foo2) The only difference between this workload and the one above is, where directory A is created and whether its persisted previously or not. In this modified workload, foo2 has a link count of 2 and ends up persisting A/bar. Why is A/bar being persisted here even when we did not explicitly call a fsync on directory A? We don't seem to correctly understand the reason behind these different behaviors. It will be useful if you could provide more insight into this. > That's not to say that fsync is not a rich source of filesystem bugs. > btrfs did once have (and maybe still has?) a bug where renames and fsync > can create a dirent with no inode, e.g. > > loop continuously: > creat foo > write(foo, data) > fsync(foo) > rename(foo, bar) > > and crash somewhere in the middle of the loop, which will create a > dirent "foo" that points to a non-existent inode. > > Removing the "fsync" works around the bug. rename() does a flush anyway, > so the fsync() wasn't needed, but fsync() shouldn't _create_ a filesystem > inconsistency, especially when Googling recommends app developers to > sprinkle fsync()s indiscriminately in their code to prevent their data > from being mangled. > > I haven't been tracking to see if that's fixed yet. I last saw it on > 4.11, but I have been aggressively avoiding fsync with eatmydata for > some years now. > > > Note that ext4, xfs and f2fs recover to the correct link count of 2 > > for the above workload. > > Do those filesystems also work if you remove the fsync? That may be > your answer: they could be flushing the other metadata earlier, before > you call fsync(). >> creat foo >> link (foo, A/bar) > >fsync(foo) >>---Crash--- Actually, they don't seem to work if we remove the fsync. The behavior changes and we neither persist A/bar nor update the link count of foo to 2. So this confirms that fsync is forcing the metadata and not the link() syscall itself right? Thanks, Jayashree -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
