Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas <alex@xxxxxxxxxxxxxxxxx> wrote:
> Greetings,
> Looking at the code of should_cow_block(), I see:
>
> if (btrfs_header_generation(buf) == trans->transid &&
>    !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
> ...
> So if the extent buffer has been written to disk, and now is changed again
> in the same transaction, we insist on COW'ing it. Can anybody explain why
> COW is needed in this case? The transaction has not committed yet, so what
> is the danger of rewriting to the same location on disk? My understanding
> was that a tree block needs to be COW'ed at most once in the same
> transaction. But I see that this is not the case.

That logic is there, as far as I can see, for at least 2 obvious reasons:

1) fsync/log trees. All extent buffers (tree blocks) of a log tree
have the same transaction id/generation, and you can have multiple
fsyncs (log transaction commits) per transaction so you need to ensure
consistency. If we skipped the COWing in the example below, you would
get an inconsistent log tree at log replay time when the fs is
mounted:

transaction N start

   fsync inode A start
   creates tree block X
   flush X to disk
   write a new superblock
   fsync inode A end

   fsync inode B start
   skip COW of X because its generation == current transaction id and
modify it in place
   flush X to disk

========== crash ===========

   write a new superblock
   fsync inode B end

transaction N commit

2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
written to disk but instead when we trigger writeback for it. So while
the writeback is ongoing we want to make sure the block's content
isn't concurrently modified (we don't keep the eb write locked to
allow concurrent reads during the writeback).

All tree blocks that don't belong to a log tree are normally written
only when at the end of a transaction commit. But often, due to memory
pressure for e.g., the VM can call the writepages() callback of the
btree inode to force dirty tree blocks to be written to disk before
the transaction commit.

>
> I am asking because I am doing some profiling of btrfs metadata work under
> heavy loads, and I see that sometimes btrfs COW's almost twice more tree
> blocks than the total metadata size.
>
> Thanks,
> Alex.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux