On Tue, Dec 7, 2010 at 12:15 PM, Chris Mason <chris.mason@xxxxxxxxxx> wrote: > Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500: >> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@xxxxxxxxxx> wrote: >> > Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500: >> >> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@xxxxxxxxxx> wrote: >> >> > Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -0500: >> >> >> Hi, >> >> >> >> >> >> I think that the disk allocation size of each file becomes a monotone increase >> >> >> when the file is made. >> >> >> But, it sometimes return to 0. ÂIs it correct? >> >> > >> >> > Well, there's a window during the processing of delayed allocation where >> >> > we don't have the bytes recorded as delalloc and we don't have the bytes >> >> > recorded in the inode yet. ÂThat's why they are showing up as zero. >> >> > >> >> > We don't call inode_add_bytes() until after we insert the extent, but we >> >> > drop the delalloc byte count on the file before the IO is done. >> >> > >> >> > Fixing it will be a little tricky because all the extent accounting >> >> > assumes the inode_add_bytes happens at extent insertion time. >> >> > >> >> >> >> How does opening the inode with O_APPEND during this window know where >> >> to write the bytes? ÂIf it's a pointer/cursor to the EOF then that >> >> size could be used during the window. ÂIs that right? >> > >> > This counter records the number of blocks allocated to the file, and >> > reading it with ls -l or stat is somewhat racey by nature. ÂMost of the >> > time its fine, btrfs just has a really big window where the results from >> > ls -l seem wrong. >> > >> >> I see. ÂIs it using per-cpu vars or something similar? > Ok, so to make sure I fully understand I'm going to make some psuedo code based on your description. > Our stat function returns the block count in the inode plus the number > of bytes we have accounted as delayed allocation. > stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes > As we do writes to the file, the delayed allocation count goes up and > then eventually we decide we need to do some IO. > > Before we do the IO, we have to decide where on the disk to write the > extents. inode_a2 = inode_a1 inode_a1 and inode_a2 are the same inode, but inode_a2 has a different list of extents and is not written yet (in the case of appending, most of the extents will be the same in the two extent lists, but inode_a2 will have more extents for the newly appended data) > Once that is decided, we decrement the count of delayed > allocation bytes. > > This is when stat starts returning the wrong answer. > inode_a2.bytes += inode_a1_delayed_allocation_bytes inode_a1_delayed_allocation_bytes -= inode_a1_delayed_allocation_bytes stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes Is it possible to have stat read from inode_a2 during this window? So it would be instead: stat = inode_a2.bytes > Then we do the IO, and when the IO is done we actually insert the file > extents into the file metadata. ÂThis is when stat starts returning the > right answer again. > /* implicit when write completes */ inode_a1 = inode_a2 kfree(inode_a2) stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes > The whole setup sounds strange, but this is how btrfs implements the > semantics from data=ordered. ÂWe don't update the file to point to > the new blocks until after the IO is done, so we never have to wait on > the data IO before we can do a transaction commit. ÂIt avoids all kinds > of latencies with fsync and other problems. > > One easy solution is to just add another counter in the in-memory inode > for the number of bytes in flight that aren't accounted for in other > places. ÂBut I'd rather not make the inode any bigger, so I'll have to > think if we can solve this another way. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
