On Thu, Jan 09, 2020 at 03:44:44PM +0100, David Sterba wrote:
> On Thu, Jan 09, 2020 at 01:02:10PM +0200, Nikolay Borisov wrote:
> > This log shows how an fs is being unmounted which causes device close
> > routine to be invoked. It sets bdev to NULL but doesn't reset fs_info.
> > Afterwards the fs_info itself is freed from btrfs_kill_super at the same
> > time the device is still anchored at fs_devices list. Subsequently a
> > mount is triggered which sets btrfs_fs_device::bdev to a valid value, yet
> > btrfs_fs_device::fs_info is still stale/freed. Before btrfs_fill_super
> > is called and re-initializes btrfs_fs_device::fs_info a concurrent device
> > scan is triggered, it finds the device with its ->bdev pointer set to
> > valid value and eventually calls btrfs_info_in_rcu in device_list_add
> > which causes the crash.
> >
> > Simply setting btrfs_fs_device::fs_info to NULL prevents the crash but
> > doesn't fix the race. In fact the race cannot be solved because device
> > scan is asynchronous in its nature so it makes no sense to try and
> > synchronize it with pending mounts.
>
> We've had bugs when mount and scan raced, I don't see why you think this
> cannot be fixed and synchronized in this case. At minimum I'd think that
> the device_list_mutex should be enough as it's the one thing that
> excludes mount and scan for the new and removed members of the device
> lists etc.
>
> > Fixes: cc1824fcd334 ("btrfs: reset device back to allocation state when removing")
>
> That's still it misc-next so I'd rather fold it into the patch than to
> have this split fix.
Now folded to "btrfs: reset device back to allocation state when
removing" and pushed.