On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote: > Filipe, > Thanks for the explanation. Those reasons were not so obvious for me. > > Would it make sense not to COW the block in case-1, if we are mounted > with "notreelog"? Or, perhaps, to check that the block does not belong > to a log tree? > Hi Alex, The crc rules are the most important, we have to make sure the block isn't changed while it is in flight. Also, think about something like this: transaction write block A, puts pointer to it in the btree, generation Y <hard disk properly completes the IO> transaction rewrites block A, same generation Y <hard disk drops the IO on the floor and never does it> Later on, we try to read block A again. We find it has the correct crc and the correct generation number, but the contents are actually wrong. > The second case is more difficult. One problem is that > BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block > due to memory pressure (this is what I see happening), we complete the > writeback, release the extent buffer, and pages are evicted from the > page cache of btree_inode. After some time we read the block again > (because we want to modify it in the same transaction), but its header > is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at > this point it should be safe to avoid COW, we will re-COW. > > Would it make sense to have some runtime-only mechanism to lock-out > the write-back for an eb? I.e., if we know that eb is not under > writeback, and writeback is locked out from starting, we can redirty > the block without COW. Then we allow the writeback to start when it > wants to. > > In one of my test runs, btrfs had 6.4GB of metadata (before > raid-induced overhead), but during a particular transaction total of > 10GB of metadata (again, before raid-induced overhead) was written to > disk. (Thisis total of all ebs having > header->generation==curr_transid, not only during commit of the > transaction). This particular run was with "notreelog". > > Machine had 8GB of RAM. Linux allows the btree_inode to grow its > page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages). > But even though the used amount of metadata is less than that, this > re-COW'ing of already-COW'ed blocks seems to cause page-cache > trashing... Interesting. We've addressed this in the past with changes to the writepage(s) callback for the btree, basically skipping memory pressure related writeback if there isn't that much dirty. There is a lot of room to improve those decisions, like preferring to write leaves over nodes, especially full leaves that are not likely to change again. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
