On Mon, Apr 30, 2012 at 03:01:04PM +0200, Marco L. Crociani wrote:
> ./btrfs device delete missing /mnt/sda3
> ERROR: error removing the device 'missing' - Input/output error
>
>
> Apr 30 13:17:57 evo kernel: [ 108.866205] btrfs: allowing degraded mounts
> Apr 30 13:17:57 evo kernel: [ 108.866214] btrfs: disk space caching is enabled
> Apr 30 13:18:32 evo kernel: [ 143.274899] btrfs: relocating block
> group 1401002393600 flags 17
> Apr 30 13:19:25 evo kernel: [ 196.888248] btrfs csum failed ino 257
> off 910946304 csum 432355644 private 175165154
> Apr 30 13:19:25 evo kernel: [ 196.889900] btrfs csum failed ino 257
> off 910946304 csum 432355644 private 175165154
> Apr 30 13:19:25 evo kernel: [ 196.890429] btrfs csum failed ino 257
> off 910946304 csum 432355644 private 175165154
> Apr 30 13:19:25 evo kernel: [ 197.087419] btrfs csum failed ino 257
> off 910946304 csum 432355644 private 175165154
> Apr 30 13:19:25 evo kernel: [ 197.087681] btrfs csum failed ino 257
> off 910946304 csum 432355644 private 175165154
the failed checksums prevent to remove the data from the device and then
removing fails with the above error.
> ./btrfs inspect-internal inode-resolve -v 257 /mnt/sda3/
> ioctl ret=-1, error: No such file or directory
So it's not a visible file, possibly a deleted yet uncleaned snapshot or
the space_cache (guessing from the inode number). But AFAICS the
checksums are turned off for the free space inode so ...
> ./btrfs scrub status /mnt/sda3/
> scrub status for c87975a0-a575-405e-9890-d3f7f25bbd96
> scrub started at Mon Apr 30 13:26:26 2012 and was aborted after 4367 seconds
> total bytes scrubbed: 406.64GB with 2 errors
> error details: csum=2
> corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
Shouldn't the csum errors be included under uncorrectable?
> Apr 30 14:37:24 evo kernel: [ 4875.275776] btrfs: checksum error at
> logical 752871157760 on dev /dev/sda3, sector 873795352, root 259,
> inode 1580389, offset 612610048, length 4096, links 1 (path:
^^^^^^^
so the scrub catches different checksum errors than appeared during
balance (inode 257).
> Apr 30 14:37:24 evo kernel: [ 4875.275838] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
> Apr 30 14:37:24 evo kernel: [ 4875.275848] IP: [<ffffffff811ae841>] bio_add_page+0x11/0x60
> Apr 30 14:37:24 evo kernel: [ 4875.276022] RIP:
> 0010:[<ffffffff811ae841>] [<ffffffff811ae841>] bio_add_page+0x11/0x60
this looks like something disappeared under hands of scrub
1045 BUG_ON(!page->page);
1046 bio = bio_alloc(GFP_NOFS, 1);
1047 if (!bio)
1048 return -EIO;
1049 bio->bi_bdev = page->bdev;
1050 bio->bi_sector = page->physical >> 9;
1051 bio->bi_end_io = scrub_complete_bio_end_io;
1052 bio->bi_private = &complete;
1054 ret = bio_add_page(bio, page->page, PAGE_SIZE, 0);
1055 if (PAGE_SIZE != ret) {
1056 bio_put(bio);
1057 return -EIO;
1058 }
everything is initialized before use here, so it's hidden behind the
pointers, my bet is at page->bdev->something . Thinking again how things
got here:
* unsuccesful device remove 'missing', due to csum errors in a
non-regular file
* crashed scrub, after inidirect access of a null pointer
Is there anything I missed for steps to reproduce it?
david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html