Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote:
> Filipe,
> Thanks for the explanation. Those reasons were not so obvious for me.
> 
> Would it make sense not to COW the block in case-1, if we are mounted
> with "notreelog"? Or, perhaps, to check that the block does not belong
> to a log tree?
> 

Hi Alex,

The crc rules are the most important, we have to make sure the block
isn't changed while it is in flight.  Also, think about something like
this:

transaction write block A, puts pointer to it in the btree, generation Y

<hard disk properly completes the IO>

transaction rewrites block A, same generation Y

<hard disk drops the IO on the floor and never does it>

Later on, we try to read block A again.  We find it has the correct crc
and the correct generation number, but the contents are actually wrong.

> The second case is more difficult. One problem is that
> BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
> due to memory pressure (this is what I see happening), we complete the
> writeback, release the extent buffer, and pages are evicted from the
> page cache of btree_inode. After some time we read the block again
> (because we want to modify it in the same transaction), but its header
> is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
> this point it should be safe to avoid COW, we will re-COW.
> 
> Would it make sense to have some runtime-only mechanism to lock-out
> the write-back for an eb? I.e., if we know that eb is not under
> writeback, and writeback is locked out from starting, we can redirty
> the block without COW. Then we allow the writeback to start when it
> wants to.
> 
> In one of my test runs, btrfs had 6.4GB of metadata (before
> raid-induced overhead), but during a particular transaction total of
> 10GB of metadata (again, before raid-induced overhead) was written to
> disk. (Thisis  total of all ebs having
> header->generation==curr_transid, not only during commit of the
> transaction). This particular run was with "notreelog".
> 
> Machine had 8GB of RAM. Linux allows the btree_inode to grow its
> page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
> But even though the used amount of metadata is less than that, this
> re-COW'ing of already-COW'ed blocks seems to cause page-cache
> trashing...

Interesting.  We've addressed this in the past with changes to the
writepage(s) callback for the btree, basically skipping memory pressure
related writeback if there isn't that much dirty.  There is a lot of
room to improve those decisions, like preferring to write leaves over
nodes, especially full leaves that are not likely to change again.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux