On Thu, 23 Apr 2015 11:28:48 +0100, Filipe Manana wrote:
> There's a race between releasing extent buffers that are flagged as stale
> and recycling them that makes us it the following BUG_ON at
> btrfs_release_extent_buffer_page:
>
> BUG_ON(extent_buffer_under_io(eb))
>
> The BUG_ON is triggered because the extent buffer has the flag
> EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
> dirty by another concurrent task.
Awesome analysis!
> @@ -4768,6 +4768,25 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
> start >> PAGE_CACHE_SHIFT);
> if (eb && atomic_inc_not_zero(&eb->refs)) {
> rcu_read_unlock();
> + /*
> + * Lock our eb's refs_lock to avoid races with
> + * free_extent_buffer. When we get our eb it might be flagged
> + * with EXTENT_BUFFER_STALE and another task running
> + * free_extent_buffer might have seen that flag set,
> + * eb->refs == 2, that the buffer isn't under IO (dirty and
> + * writeback flags not set) and it's still in the tree (flag
> + * EXTENT_BUFFER_TREE_REF set), therefore being in the process
> + * of decrementing the extent buffer's reference count twice.
> + * So here we could race and increment the eb's reference count,
> + * clear its stale flag, mark it as dirty and drop our reference
> + * before the other task finishes executing free_extent_buffer,
> + * which would later result in an attempt to free an extent
> + * buffer that is dirty.
> + */
> + if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
> + spin_lock(&eb->refs_lock);
> + spin_unlock(&eb->refs_lock);
> + }
> mark_extent_buffer_accessed(eb, NULL);
> return eb;
> }
After staring at this (and the Lovecraftian horrors of free_extent_buffer())
for over an hour and trying to understand how and why this could even remotely
work, I cannot help but think that this fix would shift the race to the much
smaller window between the test_bit and the first spin_lock.
Essentially you subtly phase-shifted all participants and make them avoid the
race most of the time, yet I cannot help but think it's still there (just much
smaller), and could strike again with different scheduling intervals.
Would this be accurate?
-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html