Hello everyone, It took me much longer to chase down races in my new data=ordered code, but I think I've finally got it, and have pushed it out to the unstable trees. There are no disk format changes included. I need to make minor mods to the resizing and balancing code, but I wanted to get this stuff out the door. In general, I'll call data=ordered any system that prevents seeing stale data on the disk after a crash. This would include null bytes from areas not yet written when we crashed and the contents of old blocks the filesystem had freed in the past. The old data=ordered code worked something like this: file_write: * modify pages in page cache * set delayed allocation bits * Update in memory and on-disk i_size writepage: * collect a large delalloc region * allocate new extent * drop existing extents from the metadata * insert new extent * start the page io transaction commit: * write and wait on any dirty file data to finish * commit the new btree pointers The end result was very large latencies during transaction commit because it had to wait on all the file data. A fsync of a single file was forced to write out all the dirty metadata and dirty data on the FS. This is how ext3 works today, xfs does something smarter. ext4 is moving to something similar to xfs. With the new code, metadata is not modified in the btree until new extents are fully on disk. It now looks something like this: file write (start, len): * wait on pending ordered extents for the start, len range * modify pages in the page cache * set delayed allocation bits * Update in memory only i_size writepage: * collect a large delalloc extent * reserve a extent on disk in the allocation tree * create an ordered extent record * start the page io At IO completion (done in a kthread): * find the corresponding ordered extent record * if fully written, remove old extents from the tree, add new extents to the tree, update on disk i_size At commit time: * Just do only metadata IO The end result of all of this is lower commit latencies and a smoother system. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
