Re: kernel BUG at linux-4.2.0/fs/btrfs/extent-tree.c:1833 on rebalance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





在 2015年09月23日 17:40, Stéphane Lesimple 写道:
Le 2015-09-23 09:03, Qu Wenruo a écrit :
Stéphane Lesimple wrote on 2015/09/22 16:31 +0200:
Le 2015-09-22 10:51, Qu Wenruo a écrit :
[92098.842261] Call Trace:
[92098.842277]  [<ffffffffc035a5d8>] ?
read_extent_buffer+0xb8/0x110
[btrfs]
[92098.842304]  [<ffffffffc0396d00>] ?
btrfs_find_all_roots+0x60/0x70
[btrfs]
[92098.842329]  [<ffffffffc039af3d>]
btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs]

Would you please show the code of it?
This one seems to be another stupid bug I made when rewriting the
framework.
Maybe I forgot to reinit some variants or I'm screwing memory...

(gdb) list *(btrfs_qgroup_rescan_worker+0x28d)
0x97f6d is in btrfs_qgroup_rescan_worker (fs/btrfs/ctree.h:2760).
2755
2756    static inline void btrfs_disk_key_to_cpu(struct btrfs_key
*cpu,
2757                                             struct
btrfs_disk_key
*disk)
2758    {
2759            cpu->offset =e64_to_cpu(disk->offset);
2760            cpu->type =isk->type;
2761            cpu->objectid =e64_to_cpu(disk->objectid);
2762    }
2763
2764    static inline void btrfs_cpu_key_to_disk(struct
btrfs_disk_key
*disk,
(gdb)


Does it makes sense ?
So it seems that the memory of cpu key is being screwed up...

The code is be specific thin inline function, so what about other
stack?
Like btrfs_qgroup_rescan_helper+0x12?

Thanks,
Qu
Oh, I forgot that you can just change the number of
btrfs_qgroup_rescan_worker+0x28d to smaller value.
Try +0x280 for example, which will revert to 14 bytes asm code back,
which may jump out of the inline function range, and may give you a
good hint.

Or gdb may have a better mode for inline function, but I don't know...

Actually, "list -" is our friend here (show 10 lignes before the last
src output)
No, that's not the case.

List - will only show lines around the source code.

What I need is to get the higher caller stack.
If debugging a running program, it's quite easy to just use frame
command.

But in this situation, we don't have call stack, so I'd like to change
the +0x28d to several bytes backward, until we jump out of the inline
function call, and see the meaningful codes.

Ah, you're right.
I had a hard time finding a value where I wouldn't end up in another inline
function or entirely somewhere else in the kernel code, but here it is :

(gdb) list *(btrfs_qgroup_rescan_worker+0x26e)
0x97f4e is in btrfs_qgroup_rescan_worker (fs/btrfs/qgroup.c:2237).
2232            memcpy(scratch_leaf, path->nodes[0],
sizeof(*scratch_leaf));
2233            slot = path->slots[0];
2234            btrfs_release_path(path);
2235            mutex_unlock(&fs_info->qgroup_rescan_lock);
2236
2237            for (; slot < btrfs_header_nritems(scratch_leaf); ++slot) {
2238                    btrfs_item_key_to_cpu(scratch_leaf, &found,
slot); <== here

2239                    if (found.type != BTRFS_EXTENT_ITEM_KEY &&
2240                        found.type != BTRFS_METADATA_ITEM_KEY)
2241                            continue;

the btrfs_item_key_to_cpu() inline func calls 2 other inline funcs:

static inline void btrfs_item_key_to_cpu(struct extent_buffer *eb,
                                   struct btrfs_key *key, int nr)
{
         struct btrfs_disk_key disk_key;
         btrfs_item_key(eb, &disk_key, nr);
         btrfs_disk_key_to_cpu(key, &disk_key); <== this is 0x28d
}

btrfs_disk_key_to_cpu() is the inline referenced by 0x28d and this is where
the GPF happens.

Thanks, now things are much more clear.
Not completely sure, but scratch_leaf seems invalid and cause the bug.
(found is in stack memory, so I don't think it's the cause).

But less related to the qgroup rework, as that's the existing code.

A quick glance already shows some dirty and maybe deadly hack, like copying the whole extent buffer, which includes pages and all kinds of locks.

But I'm not 100% sure if that's the problem, but I'll create a patch to for you to test in recent days.



BTW, did you tried the following patch?
https://patchwork.kernel.org/patch/7114321/
btrfs: qgroup: exit the rescan worker during umount

The problem seems a little related to the bug you encountered, so I'd
recommend to give it a try.

Not yet, but I've come across this bug too during my tests: starting a
rescan
and umounting gets you a crash. I didn't mention it because I was sure this
was an already known bug. Nice to see it has been fixed though !
I'll certainly give it a try but I'm not really sure it'll fix the specific
bug we're talking about.
However the group of patches posted by Mark should fix the qgroup count
disrepancies as I understand it, right ? It might be of interest to try
them
all at once for sure.

Yes, his patch should fix the qgroup count mismatch problem for subvolume remove.

If I read the codes correctly, after remove and sync, the accounting number for qgroup of deleted subvolume should be:
rfer = 0 and excl = 0.

Thanks,
Qu

Thanks,

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux