On Mon, Sep 30, 2019 at 11:25 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote:
>
> We've historically had reports of being unable to mount file systems
> because the tree log root couldn't be read. Usually this is the "parent
> transid failure", but could be any of the related errors, including
> "fsid mismatch" or "bad tree block", depending on which block got
> allocated.
>
> The modification of the individual log root items are serialized on the
> per-log root root_mutex. This means that any modification to the
> per-subvol log root_item is completely protected.
>
> However we update the root item in the log root tree outside of the log
> root tree log_mutex. We do this in order to allow multiple subvolumes
> to be updated in each log transaction.
>
> This is problematic however because when we are writing the log root
> tree out we update the super block with the _current_ log root node
> information. Since these two operations happen independently of each
> other, you can end up updating the log root tree in between writing out
> the dirty blocks and setting the super block to point at the current
> root.
>
> This means we'll point at the new root node that hasn't been written
> out, instead of the one we should be pointing at. Thus whatever garbage
> or old block we end up pointing at complains when we mount the file
> system later and try to replay the log.
>
> Fix this by copying the log's root item into a local root item copy.
> Then once we're safely under the log_root_tree->log_mutex we update the
> root item in the log_root_tree. This way we do not modify the
> log_root_tree while we're committing it, fixing the problem.
>
> cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> Reviewed-by: Chris Mason <clm@xxxxxx>
Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx>
Looks good to me, great catch!
Thanks.
> ---
> fs/btrfs/tree-log.c | 36 +++++++++++++++++++++++++++---------
> 1 file changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 7cac09a6f007..1d7f22951ef2 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2908,7 +2908,8 @@ static int walk_log_tree(struct btrfs_trans_handle *trans,
> * in the tree of log roots
> */
> static int update_log_root(struct btrfs_trans_handle *trans,
> - struct btrfs_root *log)
> + struct btrfs_root *log,
> + struct btrfs_root_item *root_item)
> {
> struct btrfs_fs_info *fs_info = log->fs_info;
> int ret;
> @@ -2916,10 +2917,10 @@ static int update_log_root(struct btrfs_trans_handle *trans,
> if (log->log_transid == 1) {
> /* insert root item on the first sync */
> ret = btrfs_insert_root(trans, fs_info->log_root_tree,
> - &log->root_key, &log->root_item);
> + &log->root_key, root_item);
> } else {
> ret = btrfs_update_root(trans, fs_info->log_root_tree,
> - &log->root_key, &log->root_item);
> + &log->root_key, root_item);
> }
> return ret;
> }
> @@ -3017,6 +3018,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> struct btrfs_fs_info *fs_info = root->fs_info;
> struct btrfs_root *log = root->log_root;
> struct btrfs_root *log_root_tree = fs_info->log_root_tree;
> + struct btrfs_root_item new_root_item;
> int log_transid = 0;
> struct btrfs_log_ctx root_log_ctx;
> struct blk_plug plug;
> @@ -3080,17 +3082,25 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> goto out;
> }
>
> + /*
> + * We _must_ update under the root->log_mutex in order to make sure we
> + * have a consistent view of the log root we are trying to commit at
> + * this moment.
> + *
> + * We _must_ copy this into a local copy, because we are not holding the
> + * log_root_tree->log_mutex yet. This is important because when we
> + * commit the log_root_tree we must have a consistent view of the
> + * log_root_tree when we update the super block to point at the
> + * log_root_tree bytenr. If we update the log_root_tree here we'll race
> + * with the commit and possibly point at the new block which we may not
> + * have written out.
> + */
> btrfs_set_root_node(&log->root_item, log->node);
> + memcpy(&new_root_item, &log->root_item, sizeof(new_root_item));
>
> root->log_transid++;
> log->log_transid = root->log_transid;
> root->log_start_pid = 0;
> - /*
> - * Update or create log root item under the root's log_mutex to prevent
> - * races with concurrent log syncs that can lead to failure to update
> - * log root item because it was not created yet.
> - */
> - ret = update_log_root(trans, log);
> /*
> * IO has been started, blocks of the log tree have WRITTEN flag set
> * in their headers. new modifications of the log will be written to
> @@ -3111,6 +3121,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> mutex_unlock(&log_root_tree->log_mutex);
>
> mutex_lock(&log_root_tree->log_mutex);
> +
> + /*
> + * Now we are safe to update the log_root_tree because we're under the
> + * log_mutex, and we're a current writer so we're holding the commit
> + * open until we drop the log_mutex.
> + */
> + ret = update_log_root(trans, log, &new_root_item);
> +
> if (atomic_dec_and_test(&log_root_tree->log_writers)) {
> /* atomic_dec_and_test implies a barrier */
> cond_wake_up_nomb(&log_root_tree->log_writer_wait);
> --
> 2.21.0
>
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”