On 13.03.19 г. 20:03 ч., David Sterba wrote: > On Tue, Mar 12, 2019 at 03:18:47PM +0200, Nikolay Borisov wrote: >> If a an eb fails to be read for whatever reason - it's corrupted on disk >> and parent transid/key validations fail or IO for eb pages fail then >> this buffer must be removed from the buffer cache. Currently the code >> calls free_extent_buffer if an error occurs. Unfortunately this doesn't >> achieve the desired behavior since btrfs_find_create_tree_block returns >> with eb->refs == 2. On the other hand free_extent_buffer will only >> decrement the refs once leavin it added to the buffer cache radix tree. >> This enables later code to look up the buffer from the cache and utilize >> it potentially leading to a crash. >> >> The correct way to free the buffer is call free_extent_buffer_stale. >> This function will correctly call atomic_dec explicitly for the buffer >> and subsequently call release_extent_buffer which will decrement the >> final reference thus correctly remove the invalid buffer from buffer >> cache. This change affects only newly allocated buffers since they have >> eb->refs == 2. > > Following the same pattern, should every use of > btrfs_find_create_tree_block be followed by free_extent_buffer_stale in > case of errors? > > There's almost the same sequence in readahead_tree_block as you update > in reada_tree_block_flagged: > > btrfs_find_create_tree_block > read_extent_buffer_pages > free_extent_buffer > > so should be the _stale variant used there too? You are right it should also be using the _stale variant. I will fix it. Despite my cleanup of the extent buffer freeing code I'm still not entirely happy with the special cases we have with the stale or unmapped logic. That's why we need all these special cases ;\ > > free_extent_buffer_stale is normally used in other contexts where the > 'stale' part means that the eb is known to be disconnected. > > I think that btrfs_find_create_tree_block needs a dedicated freeing > function that will make sure all invariants (refs == 2, no other > structures connected) are met and then do effectively the same what > _stale does now. Sounds good though I'm a bit reluctant to add yet another free_extent_buffer variant. IMO the logic of the extent buffer radix tree need to be revised which could lead to simplifications in the whole ref counting code. >
