On Fri, 19 Dec 2008, Chris Mason wrote: > On Fri, 2008-12-19 at 10:48 -0800, Sage Weil wrote: > > On Fri, 19 Dec 2008, Chris Mason wrote: > > > On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote: > > > > On Fri, 19 Dec 2008, Yan Zheng wrote: > > > > > > I noticed some data and metadata getting out of sync on disk, despite > > > > > > wrapping my writes with btrfs transactions. After digging into it a bit, > > > > > > it appears to be a larger problem with inode size/data getting written > > > > > > during a regular commit. > > > > > > [...] > > > > > > > > > > This is the desired behaviour of data=ordered. Btrfs transaction commit > > > > > don't flush data, and metadata wont get updated until data IO complete. > > > > > > > > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code > > > > > > > > Ah, right, so it is. > > > > > > > > I think what I'm looking for then is a mount mode to get the old behavior, > > > > such that each commit flushes previously written data. Probably a call to > > > > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something > > > > along those lines... > > > > > > Could you describe the end goal a bit? I'm happy to make modes where > > > it'll do what you need. > > > > The end goal is for data to flush and commit with the transaction that was > > running when the write() occured. > > > > So, after a sequence like > > write A > > setxattr B > > <crash> > > you should always see A if you see B. > > > > And after a sequence like > > ioctl(fd, BTRFS_IOC_TRANS_START) > > write A > > setxattr B > > close(fd) > > <crash> > > you should see either both A and B or neither A nor B. > > > > fsync() isn't really appropriate since it forces a commit (or a tree log > > entry?), and it would still be better to roll lots of operations up > > together. Either a mount mode that includes dirty data in each > > transaction commit (and probably disables the tree log?), or a per-file > > fsync-like operation that commits an individual file's dirty data to the > > running transaction would do the trick. > > A third option is a different type of xattr operation that doesn't go to > disk until the metadata updates done at IO end time. > > >From a performance point of view, it'll be much faster than slowing down > commit with data writes. > > Can that work for you? I suspect not, since multiple files are involved. It's usually something like write A setxattr A write B setxattr C and all need to be committed atomically. The model really is a bundle of arbitrary operations that commit atomically. Slower commit times aren't as much of a concern because this is on the storage backend, behind client caches and so forth. I think it's a reasonable price to pay for the stronger consistency. Hopefully it's not throwing too big a wrench into the data=ordered machinery? It sort of looks like this is already what you get when taking a snapshot (I see the call to wait_ordered_extnets in commit_transaction when snaps_pending). sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
