Re: The value displayed by 'ls -s' command is strange.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Dec 7, 2010 at 12:15 PM, Chris Mason <chris.mason@xxxxxxxxxx> wrote:
> Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500:
>> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.mason@xxxxxxxxxx> wrote:
>> > Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500:
>> >> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.mason@xxxxxxxxxx> wrote:
>> >> > Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -0500:
>> >> >> Hi,
>> >> >>
>> >> >> I think that the disk allocation size of each file becomes a monotone increase
>> >> >> when the file is made.
>> >> >> But, it sometimes return to 0. ÂIs it correct?
>> >> >
>> >> > Well, there's a window during the processing of delayed allocation where
>> >> > we don't have the bytes recorded as delalloc and we don't have the bytes
>> >> > recorded in the inode yet. ÂThat's why they are showing up as zero.
>> >> >
>> >> > We don't call inode_add_bytes() until after we insert the extent, but we
>> >> > drop the delalloc byte count on the file before the IO is done.
>> >> >
>> >> > Fixing it will be a little tricky because all the extent accounting
>> >> > assumes the inode_add_bytes happens at extent insertion time.
>> >> >
>> >>
>> >> How does opening the inode with O_APPEND during this window know where
>> >> to write the bytes? ÂIf it's a pointer/cursor to the EOF then that
>> >> size could be used during the window. ÂIs that right?
>> >
>> > This counter records the number of blocks allocated to the file, and
>> > reading it with ls -l or stat is somewhat racey by nature. ÂMost of the
>> > time its fine, btrfs just has a really big window where the results from
>> > ls -l seem wrong.
>> >
>>
>> I see. ÂIs it using per-cpu vars or something similar?
>

Ok, so to make sure I fully understand I'm going to make some psuedo
code based on your description.

> Our stat function returns the block count in the inode plus the number
> of bytes we have accounted as delayed allocation.
>

stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes

> As we do writes to the file, the delayed allocation count goes up and
> then eventually we decide we need to do some IO.
>
> Before we do the IO, we have to decide where on the disk to write the
> extents.

inode_a2 = inode_a1

inode_a1 and inode_a2 are the same inode, but inode_a2 has a different
list of extents and is not written yet (in the case of appending, most
of the extents will be the same in the two extent lists, but inode_a2
will have more extents for the newly appended data)

> Once that is decided, we decrement the count of delayed
> allocation bytes.
>
> This is when stat starts returning the wrong answer.
>

inode_a2.bytes += inode_a1_delayed_allocation_bytes
inode_a1_delayed_allocation_bytes -= inode_a1_delayed_allocation_bytes
stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes

Is it possible to have stat read from inode_a2 during this window?

So it would be instead:

stat = inode_a2.bytes

> Then we do the IO, and when the IO is done we actually insert the file
> extents into the file metadata. ÂThis is when stat starts returning the
> right answer again.
>

/* implicit when write completes */
inode_a1 = inode_a2
kfree(inode_a2)
stat = inode_a1.bytes + inode_a1_delayed_allocation_bytes

> The whole setup sounds strange, but this is how btrfs implements the
> semantics from data=ordered. ÂWe don't update the file to point to
> the new blocks until after the IO is done, so we never have to wait on
> the data IO before we can do a transaction commit. ÂIt avoids all kinds
> of latencies with fsync and other problems.
>
> One easy solution is to just add another counter in the in-memory inode
> for the number of bytes in flight that aren't accounted for in other
> places. ÂBut I'd rather not make the inode any bigger, so I'll have to
> think if we can solve this another way.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux