On Tue, Mar 12, 2019 at 03:18:47PM +0200, Nikolay Borisov wrote: > If a an eb fails to be read for whatever reason - it's corrupted on disk > and parent transid/key validations fail or IO for eb pages fail then > this buffer must be removed from the buffer cache. Currently the code > calls free_extent_buffer if an error occurs. Unfortunately this doesn't > achieve the desired behavior since btrfs_find_create_tree_block returns > with eb->refs == 2. On the other hand free_extent_buffer will only > decrement the refs once leavin it added to the buffer cache radix tree. > This enables later code to look up the buffer from the cache and utilize > it potentially leading to a crash. > > The correct way to free the buffer is call free_extent_buffer_stale. > This function will correctly call atomic_dec explicitly for the buffer > and subsequently call release_extent_buffer which will decrement the > final reference thus correctly remove the invalid buffer from buffer > cache. This change affects only newly allocated buffers since they have > eb->refs == 2. Following the same pattern, should every use of btrfs_find_create_tree_block be followed by free_extent_buffer_stale in case of errors? There's almost the same sequence in readahead_tree_block as you update in reada_tree_block_flagged: btrfs_find_create_tree_block read_extent_buffer_pages free_extent_buffer so should be the _stale variant used there too? free_extent_buffer_stale is normally used in other contexts where the 'stale' part means that the eb is known to be disconnected. I think that btrfs_find_create_tree_block needs a dedicated freeing function that will make sure all invariants (refs == 2, no other structures connected) are met and then do effectively the same what _stale does now.
