Re: [PATCH] Btrfs: fix loading of orphan roots leading to BUG_ON

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 3, 2016 at 6:29 AM, Qu Wenruo <quwenruo@xxxxxxxxxxxxxx> wrote:
>
>
>  wrote on 2016/03/02 15:49 +0000:
>>
>> From: Filipe Manana <fdmanana@xxxxxxxx>
>>
>> When looking for orphan roots during mount we can end up hitting a
>> BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
>> replayed and qgroups are enabled. This is because after a log tree is
>> replayed, a transaction commit is made, which triggers qgroup extent
>> accounting which in turn does backref walking which ends up reading and
>> inserting all roots in the radix tree fs_info->fs_root_radix, including
>> orphan roots (deleted snapshots). So after the log tree is replayed, when
>> finding orphan roots we hit the BUG_ON with the following trace:
>>
>> [118209.182438] ------------[ cut here ]------------
>> [118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
>> [118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic
>> ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm
>> psmouse
>> processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4
>> crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix
>> libata
>> virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
>> [118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G        W
>> 4.5.0-rc5-btrfs-next-24+ #1
>> [118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS by qemu-project.org 04/01/2014
>> [118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti:
>> ffff8800af34c000
>> [118209.186318] RIP: 0010:[<ffffffffa04237d7>]  [<ffffffffa04237d7>]
>> btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
>> [118209.186318] RSP: 0018:ffff8800af34faa8  EFLAGS: 00010246
>> [118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX:
>> 0000000000000001
>> [118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI:
>> 00000000ffffffff
>> [118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09:
>> 0000000000000000
>> [118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12:
>> ffff880171b97000
>> [118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15:
>> 0000160000000000
>> [118209.186318] FS:  00007f5bcb914840(0000) GS:ffff88023edc0000(0000)
>> knlGS:0000000000000000
>> [118209.186318] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4:
>> 00000000000006e0
>> [118209.186318] Stack:
>> [118209.186318]  fffffbffffffffff 010230ffffffffff 0101000000000000
>> ff84000000000000
>> [118209.186318]  fbffffffffffffff 30ffffffffffffff 0000000000000101
>> ffff880082348000
>> [118209.186318]  0000000000000000 ffff8800afa2e000 ffff8800afa2e000
>> 0000000000000000
>> [118209.186318] Call Trace:
>> [118209.186318]  [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs]
>> [118209.186318]  [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs]
>> [118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
>> [118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
>> [118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
>> [118209.186318]  [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs]
>> [118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
>> [118209.186318]  [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3
>> [118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
>> [118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
>> [118209.186318]  [<ffffffff81195637>] do_mount+0x8a6/0x9e8
>> [118209.186318]  [<ffffffff8119598d>] SyS_mount+0x77/0x9f
>> [118209.186318]  [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b
>> [118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49
>> 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75
>> 02 <0f> 0b
>> 4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
>> [118209.186318] RIP  [<ffffffffa04237d7>]
>> btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
>> [118209.186318]  RSP <ffff8800af34faa8>
>> [118209.230735] ---[ end trace 83938f987d85d477 ]---
>>
>> So fix this by not treating the error -EEXIST, returned when attempting
>> to insert a root already inserted by the backref walking code, as an
>> error.
>>
>> The following test case for xfstests reproduces the bug:
>>
>>    seq=`basename $0`
>>    seqres=$RESULT_DIR/$seq
>>    echo "QA output created by $seq"
>>    tmp=/tmp/$$
>>    status=1     # failure is the default!
>>    trap "_cleanup; exit \$status" 0 1 2 3 15
>>
>>    _cleanup()
>>    {
>>        _cleanup_flakey
>>        cd /
>>        rm -f $tmp.*
>>    }
>>
>>    # get standard environment, filters and checks
>>    . ./common/rc
>>    . ./common/filter
>>    . ./common/dmflakey
>>
>>    # real QA test starts here
>>    _supported_fs btrfs
>>    _supported_os Linux
>>    _require_scratch
>>    _require_dm_target flakey
>>    _require_metadata_journaling $SCRATCH_DEV
>>
>>    rm -f $seqres.full
>>
>>    _scratch_mkfs >>$seqres.full 2>&1
>>    _init_flakey
>>    _mount_flakey
>>
>>    _run_btrfs_util_prog quota enable $SCRATCH_MNT
>>
>>    # Create 2 directories with one file in one of them.
>>    # We use these just to trigger a transaction commit later, moving the
>> file from
>>    # directory a to directory b and doing an fsync against directory a.
>>    mkdir $SCRATCH_MNT/a
>>    mkdir $SCRATCH_MNT/b
>>    touch $SCRATCH_MNT/a/f
>>    sync
>>
>>    # Create our test file with 2 4K extents.
>>    $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar |
>> _filter_xfs_io
>>
>>    # Create a snapshot and delete it. This doesn't really delete the
>> snapshot
>>    # immediately, just makes it inaccessible and invisible to user space,
>> the
>>    # snapshot is deleted later by a dedicated kernel thread (cleaner
>> kthread)
>>    # which is woke up at the next transaction commit.
>>    # A root orphan item is inserted into the tree of tree roots, so that
>> if a
>>    # power failure happens before the dedicated kernel thread does the
>> snapshot
>>    # deletion, the next time the filesystem is mounted it resumes the
>> snapshot
>>    # deletion.
>>    _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
>>    _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap
>>
>>    # Now overwrite half of the extents we wrote before. Because we made a
>> snapshpot
>>    # before, which isn't really deleted yet (since no transaction commit
>> happened
>>    # after we did the snapshot delete request), the non overwritten
>> extents get
>>    # referenced twice, once by the default subvolume and once by the
>> snapshot.
>>    $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar |
>> _filter_xfs_io
>>
>>    # Now move file f from directory a to directory b and fsync directory
>> a.
>>    # The fsync on the directory a triggers a transaction commit (because a
>> file
>>    # was moved from it to another directory) and the file fsync leaves a
>> log tree
>>    # with file extent items to replay.
>>    mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
>>    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
>>    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
>>
>>    echo "File digest before power failure:"
>>    md5sum $SCRATCH_MNT/foobar | _filter_scratch
>>
>>    # Now simulate a power failure and mount the filesystem to replay the
>> log tree.
>>    # After the log tree was replayed, we used to hit a BUG_ON() when
>> processing
>>    # the root orphan item for the deleted snapshot. This is because when
>> processing
>>    # an orphan root the code expected to be the first code inserting the
>> root into
>>    # the fs_info->fs_root_radix radix tree, while in reallity it was the
>> second
>>    # caller attempting to do it - the first caller was the transaction
>> commit that
>>    # took place after replaying the log tree, when updating the qgroup
>> counters.
>>    _flakey_drop_and_remount
>>
>>    echo "File digest before after failure:"
>>    # Must match what he got before the power failure.
>>    md5sum $SCRATCH_MNT/foobar | _filter_scratch
>>
>>    _unmount_flakey
>>    status=0
>>    exit
>>
>> Fixes: 2d9e97761087 ("Btrfs: use btrfs_get_fs_root in
>> resolve_indirect_ref")
>> Cc: stable@xxxxxxxxxxxxxxx  # 4.4+
>> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
>
>
> Reviewed-by: Qu Wenruo <quwenruo@xxxxxxxxxxxxxx>
>
> Looks good, and the comment is clear enough.
>
> Thanks for your long effort to spot and fix corner cases like this.

Well, using qgroups, deleting snapshots and fsync'ing file data isn't
that much of a rare use case, is it? :P


>
> Thanks,
> Qu
>
>
>> ---
>>   fs/btrfs/root-tree.c | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
>> index a25f3b2..9fcd6df 100644
>> --- a/fs/btrfs/root-tree.c
>> +++ b/fs/btrfs/root-tree.c
>> @@ -310,8 +310,16 @@ int btrfs_find_orphan_roots(struct btrfs_root
>> *tree_root)
>>                 set_bit(BTRFS_ROOT_ORPHAN_ITEM_INSERTED, &root->state);
>>
>>                 err = btrfs_insert_fs_root(root->fs_info, root);
>> +               /*
>> +                * The root might have been inserted already, as before we
>> look
>> +                * for orphan roots, log replay might have happened, which
>> +                * triggers a transaction commit and qgroup accounting,
>> which
>> +                * in turn reads and inserts fs roots while doing backref
>> +                * walking.
>> +                */
>> +               if (err == -EEXIST)
>> +                       err = 0;
>>                 if (err) {
>> -                       BUG_ON(err == -EEXIST);
>>                         btrfs_free_fs_root(root);
>>                         break;
>>                 }
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux