On 5/2/2012 5:22 PM, David Sterba wrote:
> On Mon, Apr 30, 2012 at 03:01:04PM +0200, Marco L. Crociani wrote:
>> ./btrfs device delete missing /mnt/sda3
>> ERROR: error removing the device 'missing' - Input/output error
>>
>>
>> Apr 30 13:17:57 evo kernel: [ 108.866205] btrfs: allowing degraded mounts
>> Apr 30 13:17:57 evo kernel: [ 108.866214] btrfs: disk space caching is enabled
>> Apr 30 13:18:32 evo kernel: [ 143.274899] btrfs: relocating block
>> group 1401002393600 flags 17
>> Apr 30 13:19:25 evo kernel: [ 196.888248] btrfs csum failed ino 257
>> off 910946304 csum 432355644 private 175165154
>> Apr 30 13:19:25 evo kernel: [ 196.889900] btrfs csum failed ino 257
>> off 910946304 csum 432355644 private 175165154
>> Apr 30 13:19:25 evo kernel: [ 196.890429] btrfs csum failed ino 257
>> off 910946304 csum 432355644 private 175165154
>> Apr 30 13:19:25 evo kernel: [ 197.087419] btrfs csum failed ino 257
>> off 910946304 csum 432355644 private 175165154
>> Apr 30 13:19:25 evo kernel: [ 197.087681] btrfs csum failed ino 257
>> off 910946304 csum 432355644 private 175165154
>
> the failed checksums prevent to remove the data from the device and then
> removing fails with the above error.
>
>> ./btrfs inspect-internal inode-resolve -v 257 /mnt/sda3/
>> ioctl ret=-1, error: No such file or directory
>
> So it's not a visible file, possibly a deleted yet uncleaned snapshot or
> the space_cache (guessing from the inode number). But AFAICS the
> checksums are turned off for the free space inode so ...
>
>> ./btrfs scrub status /mnt/sda3/
>> scrub status for c87975a0-a575-405e-9890-d3f7f25bbd96
>> scrub started at Mon Apr 30 13:26:26 2012 and was aborted after 4367 seconds
>> total bytes scrubbed: 406.64GB with 2 errors
>> error details: csum=2
>> corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
>
> Shouldn't the csum errors be included under uncorrectable?
"uncorrectable errors" would have been set to 2 if no crash had happened.
>
>> Apr 30 14:37:24 evo kernel: [ 4875.275776] btrfs: checksum error at
>> logical 752871157760 on dev /dev/sda3, sector 873795352, root 259,
>> inode 1580389, offset 612610048, length 4096, links 1 (path:
> ^^^^^^^
>
> so the scrub catches different checksum errors than appeared during
> balance (inode 257).
>
>> Apr 30 14:37:24 evo kernel: [ 4875.275838] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
>> Apr 30 14:37:24 evo kernel: [ 4875.275848] IP: [<ffffffff811ae841>] bio_add_page+0x11/0x60
>> Apr 30 14:37:24 evo kernel: [ 4875.276022] RIP:
>> 0010:[<ffffffff811ae841>] [<ffffffff811ae841>] bio_add_page+0x11/0x60
>
> this looks like something disappeared under hands of scrub
>
> 1045 BUG_ON(!page->page);
> 1046 bio = bio_alloc(GFP_NOFS, 1);
> 1047 if (!bio)
> 1048 return -EIO;
> 1049 bio->bi_bdev = page->bdev;
> 1050 bio->bi_sector = page->physical >> 9;
> 1051 bio->bi_end_io = scrub_complete_bio_end_io;
> 1052 bio->bi_private = &complete;
>
> 1054 ret = bio_add_page(bio, page->page, PAGE_SIZE, 0);
> 1055 if (PAGE_SIZE != ret) {
> 1056 bio_put(bio);
> 1057 return -EIO;
> 1058 }
>
> everything is initialized before use here, so it's hidden behind the
> pointers, my bet is at page->bdev->something . Thinking again how things
> got here:
>
> * unsuccesful device remove 'missing', due to csum errors in a
> non-regular file
> * crashed scrub, after inidirect access of a null pointer
>
> Is there anything I missed for steps to reproduce it?
Right. bdev is a NULL pointer for missing devices. Scrub tries to repair
the checksum error by accessing the mirrors, and that device is missing
and NULL.
I'll send a patch tomorrow to prevent the scrub crash in this situation.
Thanks!
>From 28fa74661f7a0e209a826e212b40d667516f5d1f Mon Sep 17 00:00:00 2001
From: Stefan Behrens <sbehrens@xxxxxxxxxxxxxxxx>
Date: Wed, 2 May 2012 18:49:57 +0200
Subject: [PATCH] Btrfs: fix crash in scrub correction code when device is missing
When scrub tries to fix an I/O or checksum error and one of the devices
containing the mirror is missing, it crashes on bdev being a NULL pointer.
Signed-off-by: Stefan Behrens <sbehrens@xxxxxxxxxxxxxxxx>
---
fs/btrfs/scrub.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b679bf6..967bcf1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -998,6 +998,8 @@ static int scrub_setup_recheck_block(struct scrub_dev *sdev,
page = sblock->pagev + page_index;
page->logical = logical;
page->physical = bbio->stripes[mirror_index].physical;
+ if (bbio->stripes[mirror_index].dev->missing)
+ continue;
page->bdev = bbio->stripes[mirror_index].dev->bdev;
page->mirror_num = mirror_index + 1;
page->page = alloc_page(GFP_NOFS);
--
1.7.10.1.362.g242cab3