Hi Qu, hi all, > RO snapshot, I remember there is a RO snapshot bug, but seems fixed in 4.x? Yes, that bug has already been fixed. > For recovery, first just try cp -r <mnt>/* to grab what's still completely OK. > Maybe recovery mount option can do some help in the process? That's what I did now. I mounted with "recovery" and copied all of my important data. But several folders/files couldn't be read, the whole system stopped responding. Nothing in the logs, nothing on the screen - but everything is frozen. So I have to take these files out of my backup. Also several files produced "checksum verify failed", "csum failed" and "no csum found" errrors in the syslog. > Then you may try "btrfs restore", which is the safest method, won't > write any byte into the offline disks. Yes but I would need at least the same storage space as for the original data - and I don't have as much free space somewhere else (or not quickly available). > Lastly, you can try the btrfsck --repair, *WITH BINARY BACKUP OF YOUR DISKS* I don't have a bitwise copy of my disks, but all important data is secure now. So I tried it, see below. > BTW, if you decided to use btrfs --repair, please upload the full > output, since we can use it to improve the b-tree recovery codes. OK, see below. > (Yeah, welcome to be a laboratory mice of real world b-tree recovery codes) Haha, right. Since I have been testing the experimental RAID6-features of btrfs for a while I know what it means to be a laboratory mice ;) So back to btrfsck. I started it and after a while this happened in the syslog. Again and again: https://paste.ee/p/BIs56 According to the internet this is a known but very rare problem with my LSI 9211-8i controller. It happens when the PCIe-generation-autodetection detects the card as a PCIe-3.0-card instead of 2.0 and heavy I/O is happening. Because I never ever had this bug before it must be coincidence... But not the root cause of this broken filesystem. As a result there were many "blk_update_request: I/O error", "FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE", "Add. Sense: Power on, reset, or bus device reset occurred" and "Buffer I/O error"/"lost async page write" in the syslog. The result of "btrfsck --repair" until this point: https://paste.ee/p/nzzAo Then btrfsck died: https://paste.ee/p/0Brku Now I rebooted and forced the card to PCIe-generation 2.0, so this bug shouldn't happen again, and started "btrfsck --repair" again. This time it ran without controller-problems and you can find the full output here: https://ssl-account.com/oc.tobby.eu/public.php?service=files&t=8b93f56a69ea04886e9bc2c8534b32f6 (huge, about 13MB) Result: One (out of four) folder in my root-directory is completly gone (about 8 TB). Two folders seem to be ok (about 1.4 TB). And the last folder is ok in terms of folder- and subfolder-structure, but nearly all subfolders are empty (only 230GB of 3.1TB are still there). So roughly 90% of the data is gone now. I will now destroy the filesystem, create a new btrfs-RAID-6 and fetch the data out of my backups. I hope, my logs help a little bit to find the cause. I didn't have the time to try to reproduce this broken filesystem - did you try it with loop devices? Regards, Tobias 2015-05-29 4:27 GMT+02:00 Qu Wenruo <quwenruo@xxxxxxxxxxxxxx>: > > > -------- Original Message -------- > Subject: Re: Uncorrectable errors on RAID6 > From: Tobias Holst <tobby@xxxxxxxx> > To: Qu Wenruo <quwenruo@xxxxxxxxxxxxxx> > Date: 2015年05月29日 10:00 > >> Thanks, Qu, sad news... :-( >> No, I also didn't defrag with older kernels. Maybe I did it a while >> ago with 3.19.x, but there was a scrub afterwards and it showed no >> error, so this shouldn't be the problem. The things described above >> were all done with 4.0.3/4.0.4. >> >> Balances and scrubs all stop at ~1.5 TiB of ~13.3TiB. Balance with an >> error in the log, scrub just doesn't do anything according to dstat >> without any error and still shows "running". >> >> The errors/problems started during the first balance but maybe this >> only showed them and is not the cause. >> >> Here detailed debug infos to (maybe?) recreate the problem. This is >> exactly what happened here over some time. As I can only tell when it >> definitively has been clean (scrub at the beginning of May) an when it >> definitively was broken (now, end of May), there may be some more >> steps neccessary to reproduce, because several things happened in the >> meantime: >> - filesystem was created with "mkfs.btrfs -f -m raid6 -d raid6 -L >> t-raid -O extref,raid56,skinny-metadata,no-holes" with 6 >> LUKS-encrypted HDDs on kernel 3.19 > > LUKS... > Even LUKS is much stabler than btrfs, and may not be related to the > bug, your setup is quite complex anyway. >> >> - mounted with options >> "defaults,compress-force=zlib,space_cache,autodefrag" > > > Normally i'd not recommend compress-force as btrfs can auto detect compress > ratio. > But such complex setting up with such mount option as LUKS base should > be quite a good playground to produce some of bug. >> >> - copies all data onto it >> - all data on the devices is now compressed with zlib >> -> until now the filesystem is ok, scrub shows no errors > > autodefrag seems not related to this bug as you removed it from the > mount option. > As it doesn't even have effect, as you copy data from other place, > without overwrite. > >> - now mount it with "defaults,compress-force=lzo,space_cache" instead >> - use kernel 4.0.3/4.0.4 >> - create a r/o-snapshot > > RO snapshot, I remember there is a RO snapshot bug, but seems fixed in 4.x? >> >> - defrag some data with "-clzo" >> - have some (not much) I/O during the process >> - this should approx. double the size of the defragged data because >> your snapshot contains your data compressed with zlib and your volume >> contains your data compressed with lzo >> - delete the snapshot >> - wait some time until the cleaning is complete, still some other I/O >> during this >> - this doesn't free as much data as the snapshot contained (?) >> -> is this ok? Maybe here the problem already existed/started >> - defrag the rest of all data on the devices with "-clzo", still some >> other I/O during this >> - now start a balance of the whole array >> -> errors will spam the log and it's broken. >> >> I hope, it is possible to reproduce the errors and find out exactly >> when this happens. I'll do the same steps again, too, but maybe there >> is someone else who could try it as well? > > I'll try it with script, but maybe without LUKS to simplify the setup. >> >> With some small loop-devices >> just for testing this shouldn't take too long even if it sounds like >> that ;-) >> >> Back to my actual data: Are there any tips on how to recover? > > For recovery, first just try cp -r <mnt>/* to grab what's still completely > OK. > Maybe recovery mount option can do some help in the process? > > Then you may try "btrfs restore", which is the safest method, won't > write any byte into the offline disks. > > Lastly, you can try the btrfsck --repair, *WITH BINARY BACKUP OF YOUR DISKS* > > For best luck, it can make your filesystem completely clean at the cost > of some file lost(maybe file name lost or part of data lost, or nothing > remaining). > Some corrupted file can be partly recovered into 'lost+found' dir of each > subvolume. > At the best case, the recovered fs can pass btrfsck without any error. > > But for your case, the salvaged data will be somewhat meaningless, as > it works best for uncompressed data! > > And for the worst case, your filesystem will be corrupted even more. > So consider twice before using btrfsck --repair. > > BTW, if you decided to use btrfs --repair, please upload the full > output, since we can use it to improve the b-tree recovery codes. > (Yeah, welcome to be a laboratory mice of real world b-tree recovery codes) > > Thanks, > Qu > >> Mount >> >> with "recover", copy over and see the log, which files seem to be >> broken? Or some (dangerous) tricks on how to repair this broken file >> system? >> I do have a full backup, but it's very slow and may take weeks >> (months?), if I have to recover everything. >> >> Regards, >> Tobias >> >> >> >> 2015-05-29 2:36 GMT+02:00 Qu Wenruo <quwenruo@xxxxxxxxxxxxxx>: >>> >>> >>> >>> -------- Original Message -------- >>> Subject: Re: Uncorrectable errors on RAID6 >>> From: Tobias Holst <tobby@xxxxxxxx> >>> To: Qu Wenruo <quwenruo@xxxxxxxxxxxxxx> >>> Date: 2015年05月28日 21:13 >>> >>>> Ah it's already done. You can find the error-log over here: >>>> https://paste.ee/p/sxCKF >>>> >>>> In short there are several of these: >>>> bytenr mismatch, want=6318462353408, have=56676169344768 >>>> checksum verify failed on 8955306033152 found 14EED112 wanted 6F1EB890 >>>> checksum verify failed on 8955306033152 found 14EED112 wanted 6F1EB890 >>>> checksum verify failed on 8955306033152 found 5B5F717A wanted C44CA54E >>>> checksum verify failed on 8955306033152 found CF62F201 wanted E3B7021A >>>> checksum verify failed on 8955306033152 found CF62F201 wanted E3B7021A >>>> >>>> and these: >>>> ref mismatch on [13431504896 16384] extent item 1, found 0 >>>> Backref 13431504896 root 7 not referenced back 0x1202acc0 >>>> Incorrect global backref count on 13431504896 found 1 wanted 0 >>>> backpointer mismatch on [13431504896 16384] >>>> owner ref check failed [13431504896 16384] >>>> >>>> and these: >>>> ref mismatch on [1951739412480 524288] extent item 0, found 1 >>>> Backref 1951739412480 root 5 owner 27852 offset 644349952 num_refs 0 >>>> not found in extent tree >>>> Incorrect local backref count on 1951739412480 root 5 owner 27852 >>>> offset 644349952 found 1 wanted 0 back 0x1a92aa20 >>>> backpointer mismatch on [1951739412480 524288] >>>> >>>> Any ideas? :) >>>> >>> The metadata is really corrupted... >>> >>> I'd recommend to salvage your data as soon as possible. >>> >>> For the reason, as you didn't run replace, it should at least not the >>> bug spotted by Zhao Lei. >>> >>> BTW, did you run defrag on older kernels? >>> IIRC, old kernel has bug with snapshot aware defrag, so it's later >>> disabled in newer kernel. >>> Not sure if it's related. >>> >>> Balance may be related but I'm not familiar with balance with RAID5/6. >>> So hard to say. >>> >>> Sorry for unable to provide much help. >>> >>> But if you have enough time to find a stable method to reproduce the bug, >>> best try it on loop device, it would definitely help us to debug. >>> >>> Thanks, >>> Qu >>> >>> >>>> Regards >>>> Tobias >>>> >>>> >>>> 2015-05-28 14:57 GMT+02:00 Tobias Holst <tobby@xxxxxxxx>: >>>>> >>>>> >>>>> Hi Qu, >>>>> >>>>> no, I didn't run a replace. But I ran a defrag with "-clzo" on all >>>>> files while there has been slightly I/O on the devices. Don't know if >>>>> this could cause corruptions, too? >>>>> >>>>> Later on I deleted a r/o-snapshot which should free a big amount of >>>>> storage space. It didn't free as much as it should so after a few days >>>>> I started a balance to free the space. During the balance the first >>>>> checksum errors happened and the whole balance process crashed: >>>>> >>>>> [19174.342882] BTRFS: dm-5 checksum verify failed on 6318462353408 >>>>> wanted 25D94CD6 found 8BA427D4 level 1 >>>>> [19174.365473] BTRFS: dm-5 checksum verify failed on 6318462353408 >>>>> wanted 25D94CD6 found 8BA427D4 level 1 >>>>> [19174.365651] BTRFS: dm-5 checksum verify failed on 6318462353408 >>>>> wanted 25D94CD6 found 8BA427D4 level 1 >>>>> [19174.366168] BTRFS: dm-5 checksum verify failed on 6318462353408 >>>>> wanted 25D94CD6 found 8BA427D4 level 1 >>>>> [19174.366250] BTRFS: dm-5 checksum verify failed on 6318462353408 >>>>> wanted 25D94CD6 found 8BA427D4 level 1 >>>>> [19174.366392] BTRFS: dm-5 checksum verify failed on 6318462353408 >>>>> wanted 25D94CD6 found 8BA427D4 level 1 >>>>> [19174.367313] ------------[ cut here ]------------ >>>>> [19174.367340] kernel BUG at >>>>> /home/kernel/COD/linux/fs/btrfs/relocation.c:242! >>>>> [19174.367384] invalid opcode: 0000 [#1] SMP >>>>> [19174.367418] Modules linked in: iosf_mbi kvm_intel kvm >>>>> crct10dif_pclmul ppdev dm_crypt crc32_pclmul ghash_clmulni_intel >>>>> aesni_intel aes_x86_64 lrw gf128mul glue_helper parport_pc ablk_helper >>>>> cryptd mac_hid 8250_fintek virtio_rng serio_raw i2c_piix4 pvpanic lp >>>>> parport btrfs xor raid6_pq cirrus syscopyarea sysfillrect sysimgblt >>>>> ttm mpt2sas drm_kms_helper raid_class scsi_transport_sas drm floppy >>>>> psmouse pata_acpi >>>>> [19174.367656] CPU: 1 PID: 4960 Comm: btrfs Not tainted >>>>> 4.0.4-040004-generic #201505171336 >>>>> [19174.367703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >>>>> BIOS Bochs 01/01/2011 >>>>> [19174.367752] task: ffff8804274e8000 ti: ffff880367b50000 task.ti: >>>>> ffff880367b50000 >>>>> [19174.367797] RIP: 0010:[<ffffffffc05ec4ba>] [<ffffffffc05ec4ba>] >>>>> backref_cache_cleanup+0xea/0x100 [btrfs] >>>>> [19174.367867] RSP: 0018:ffff880367b53bd8 EFLAGS: 00010202 >>>>> [19174.367905] RAX: ffff88008250d8f8 RBX: ffff88008250d820 RCX: >>>>> 0000000180200001 >>>>> [19174.367948] RDX: ffff88008250d8d8 RSI: ffff88008250d8e8 RDI: >>>>> 0000000040000000 >>>>> [19174.367992] RBP: ffff880367b53bf8 R08: ffff880418b77780 R09: >>>>> 0000000180200001 >>>>> [19174.368037] R10: ffffffffc05ec1d9 R11: 0000000000018bf8 R12: >>>>> 0000000000000001 >>>>> [19174.368081] R13: ffff88008250d8e8 R14: 00000000fffffffb R15: >>>>> ffff880367b53c28 >>>>> [19174.368125] FS: 00007f7fd6831c80(0000) GS:ffff88043fc40000(0000) >>>>> knlGS:0000000000000000 >>>>> [19174.368172] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [19174.368210] CR2: 00007f65f7564770 CR3: 00000003ac92f000 CR4: >>>>> 00000000001407e0 >>>>> [19174.368257] Stack: >>>>> [19174.368279] 00000000fffffffb ffff88008250d800 ffff88042b3d46e0 >>>>> ffff88006845f990 >>>>> [19174.368327] ffff880367b53c78 ffffffffc05f25eb ffff880367b53c78 >>>>> 0000000000000002 >>>>> [19174.368376] 00ff880429e4c670 a9000010d8fb7e00 0000000000000000 >>>>> 0000000000000000 >>>>> [19174.368424] Call Trace: >>>>> [19174.368459] [<ffffffffc05f25eb>] relocate_block_group+0x2cb/0x510 >>>>> [btrfs] >>>>> [19174.368509] [<ffffffffc05f29e0>] >>>>> btrfs_relocate_block_group+0x1b0/0x2d0 [btrfs] >>>>> [19174.368562] [<ffffffffc05c6eab>] >>>>> btrfs_relocate_chunk.isra.75+0x4b/0xd0 [btrfs] >>>>> [19174.368615] [<ffffffffc05c82e8>] __btrfs_balance+0x348/0x460 >>>>> [btrfs] >>>>> [19174.368663] [<ffffffffc05c87b5>] btrfs_balance+0x3b5/0x5d0 [btrfs] >>>>> [19174.368710] [<ffffffffc05d5cac>] btrfs_ioctl_balance+0x1cc/0x530 >>>>> [btrfs] >>>>> [19174.368756] [<ffffffff811b52e0>] ? handle_mm_fault+0xb0/0x160 >>>>> [19174.368802] [<ffffffffc05d7c7e>] btrfs_ioctl+0x69e/0xb20 [btrfs] >>>>> [19174.368845] [<ffffffff8120f5b5>] do_vfs_ioctl+0x75/0x320 >>>>> [19174.368882] [<ffffffff8120f8f1>] SyS_ioctl+0x91/0xb0 >>>>> [19174.368923] [<ffffffff817f098d>] system_call_fastpath+0x16/0x1b >>>>> [19174.368962] Code: 3b 00 75 29 44 8b a3 00 01 00 00 45 85 e4 75 1b >>>>> 44 8b 9b 04 01 00 00 45 85 db 75 0d 48 83 c4 08 5b 41 5c 41 5d 5d c3 >>>>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 >>>>> 00 00 >>>>> [19174.369133] RIP [<ffffffffc05ec4ba>] >>>>> backref_cache_cleanup+0xea/0x100 [btrfs] >>>>> [19174.369186] RSP <ffff880367b53bd8> >>>>> [19174.369827] ------------[ cut here ]------------ >>>>> [19174.369827] kernel BUG at >>>>> /home/kernel/COD/linux/arch/x86/mm/pageattr.c:216! >>>>> [19174.369827] invalid opcode: 0000 [#2] SMP >>>>> [19174.369827] Modules linked in: iosf_mbi kvm_intel kvm >>>>> crct10dif_pclmul ppdev dm_crypt crc32_pclmul ghash_clmulni_intel >>>>> aesni_intel aes_x86_64 lrw gf128mul glue_helper parport_pc ablk_helper >>>>> cryptd mac_hid 8250_fintek virtio_rng serio_raw i2c_piix4 pvpanic lp >>>>> parport btrfs xor raid6_pq cirrus syscopyarea sysfillrect sysimgblt >>>>> ttm mpt2sas drm_kms_helper raid_class scsi_transport_sas drm floppy >>>>> psmouse pata_acpi >>>>> [19174.369827] CPU: 1 PID: 4960 Comm: btrfs Not tainted >>>>> 4.0.4-040004-generic #201505171336 >>>>> [19174.369827] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >>>>> BIOS Bochs 01/01/2011 >>>>> [19174.369827] task: ffff8804274e8000 ti: ffff880367b50000 task.ti: >>>>> ffff880367b50000 >>>>> [19174.369827] RIP: 0010:[<ffffffff8106875f>] [<ffffffff8106875f>] >>>>> cpa_flush_array+0x10f/0x120 >>>>> [19174.369827] RSP: 0018:ffff880367b52cf8 EFLAGS: 00010046 >>>>> [19174.369827] RAX: 0000000000000092 RBX: 0000000000000000 RCX: >>>>> 0000000000000005 >>>>> [19174.369827] RDX: 0000000000000001 RSI: 0000000000000200 RDI: >>>>> 0000000000000000 >>>>> [19174.369827] RBP: ffff880367b52d48 R08: ffff880411ef2000 R09: >>>>> 0000000000000001 >>>>> [19174.369827] R10: 0000000000000004 R11: ffffffff81adb6be R12: >>>>> 0000000000000200 >>>>> [19174.369827] R13: 0000000000000001 R14: 0000000000000005 R15: >>>>> 0000000000000000 >>>>> [19174.369827] FS: 00007f7fd6831c80(0000) GS:ffff88043fc40000(0000) >>>>> knlGS:0000000000000000 >>>>> [19174.369827] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [19174.369827] CR2: 00007f65f7564770 CR3: 00000003ac92f000 CR4: >>>>> 00000000001407e0 >>>>> [19174.369827] Stack: >>>>> [19174.369827] 0000000000000001 ffff880411ef2000 0000000000000001 >>>>> 0000000000000001 >>>>> [19174.369827] ffff880367b52d48 0000000000000000 0000000000000200 >>>>> 0000000000000000 >>>>> [19174.369827] 0000000000000004 0000000000000000 ffff880367b52de8 >>>>> ffffffff8106979c >>>>> [19174.369827] Call Trace: >>>>> [19174.369827] [<ffffffff8106979c>] >>>>> change_page_attr_set_clr+0x23c/0x2c0 >>>>> [19174.369827] [<ffffffff810699b0>] _set_pages_array+0xf0/0x140 >>>>> [19174.369827] [<ffffffff81069a13>] set_pages_array_wc+0x13/0x20 >>>>> [19174.369827] [<ffffffffc052d926>] ttm_set_pages_caching+0x46/0x80 >>>>> [ttm] >>>>> [19174.369827] [<ffffffffc052da24>] >>>>> ttm_alloc_new_pages.isra.6+0xc4/0x1a0 [ttm] >>>>> [19174.369827] [<ffffffffc052dc76>] >>>>> ttm_page_pool_fill_locked.isra.7.constprop.12+0x96/0x140 [ttm] >>>>> [19174.369827] [<ffffffffc052dd5a>] >>>>> ttm_page_pool_get_pages.isra.8.constprop.10+0x3a/0xe0 [ttm] >>>>> [19174.369827] [<ffffffffc052dea0>] >>>>> ttm_get_pages.constprop.11+0xa0/0x1f0 [ttm] >>>>> [19174.369827] [<ffffffffc052e07c>] ttm_pool_populate+0x8c/0xf0 [ttm] >>>>> [19174.369827] [<ffffffffc052a0f3>] ? ttm_mem_reg_ioremap+0x63/0xf0 >>>>> [ttm] >>>>> [19174.369827] [<ffffffffc056146e>] cirrus_ttm_tt_populate+0xe/0x10 >>>>> [cirrus] >>>>> [19174.369827] [<ffffffffc052a7ea>] ttm_bo_move_memcpy+0x5ea/0x650 >>>>> [ttm] >>>>> [19174.369827] [<ffffffffc05266ac>] ? ttm_tt_init+0x8c/0xb0 [ttm] >>>>> [19174.369827] [<ffffffff811c3aee>] ? __vmalloc_node+0x3e/0x40 >>>>> [19174.369827] [<ffffffffc0561418>] cirrus_bo_move+0x18/0x20 [cirrus] >>>>> [19174.369827] [<ffffffffc0527f5f>] ttm_bo_handle_move_mem+0x27f/0x6f0 >>>>> [ttm] >>>>> [19174.369827] [<ffffffffc0528f7c>] ttm_bo_move_buffer+0xdc/0xf0 [ttm] >>>>> [19174.369827] [<ffffffffc0529023>] ttm_bo_validate+0x93/0xb0 [ttm] >>>>> [19174.369827] [<ffffffffc0561c3f>] cirrus_bo_push_sysram+0x8f/0xe0 >>>>> [cirrus] >>>>> [19174.369827] [<ffffffffc055feb3>] >>>>> cirrus_crtc_do_set_base.isra.9.constprop.10+0x83/0x2b0 [cirrus] >>>>> [19174.369827] [<ffffffff811df534>] ? >>>>> kmem_cache_alloc_trace+0x1c4/0x210 >>>>> [19174.369827] [<ffffffffc056056f>] cirrus_crtc_mode_set+0x48f/0x4f0 >>>>> [cirrus] >>>>> [19174.369827] [<ffffffffc04c29de>] >>>>> drm_crtc_helper_set_mode+0x35e/0x5c0 [drm_kms_helper] >>>>> [19174.369827] [<ffffffffc04c35f2>] >>>>> drm_crtc_helper_set_config+0x6d2/0xad0 [drm_kms_helper] >>>>> [19174.369827] [<ffffffffc0560f9a>] ? cirrus_dirty_update+0xca/0x320 >>>>> [cirrus] >>>>> [19174.369827] [<ffffffff811df534>] ? >>>>> kmem_cache_alloc_trace+0x1c4/0x210 >>>>> [19174.369827] [<ffffffffc0406026>] >>>>> drm_mode_set_config_internal+0x66/0x110 [drm] >>>>> [19174.369827] [<ffffffffc04ceee2>] >>>>> drm_fb_helper_pan_display+0xa2/0xf0 [drm_kms_helper] >>>>> [19174.369827] [<ffffffff814382cd>] fb_pan_display+0xbd/0x170 >>>>> [19174.369827] [<ffffffff81432629>] bit_update_start+0x29/0x60 >>>>> [19174.369827] [<ffffffff81431ee2>] fbcon_switch+0x3b2/0x560 >>>>> [19174.369827] [<ffffffff814c22f9>] redraw_screen+0x179/0x220 >>>>> [19174.369827] [<ffffffff8143024a>] fbcon_blank+0x21a/0x2d0 >>>>> [19174.369827] [<ffffffff810d0aa2>] ? wake_up_klogd+0x32/0x40 >>>>> [19174.369827] [<ffffffff810d0cd8>] ? >>>>> console_unlock.part.19+0x228/0x2a0 >>>>> [19174.369827] [<ffffffff810e343c>] ? internal_add_timer+0x6c/0x90 >>>>> [19174.369827] [<ffffffff810e58d9>] ? mod_timer+0xf9/0x200 >>>>> [19174.369827] [<ffffffff814c2de0>] >>>>> do_unblank_screen.part.22+0xa0/0x180 >>>>> [19174.369827] [<ffffffff814c2f0c>] do_unblank_screen+0x4c/0x80 >>>>> [19174.369827] [<ffffffffc05ec4ba>] ? backref_cache_cleanup+0xea/0x100 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffff814c2f50>] unblank_screen+0x10/0x20 >>>>> [19174.369827] [<ffffffff813c3ccd>] bust_spinlocks+0x1d/0x40 >>>>> [19174.369827] [<ffffffff81019bd3>] oops_end+0x43/0x120 >>>>> [19174.369827] [<ffffffff8101a2f8>] die+0x58/0x90 >>>>> [19174.369827] [<ffffffff8101642d>] do_trap+0xcd/0x160 >>>>> [19174.369827] [<ffffffff810167e6>] do_error_trap+0xe6/0x170 >>>>> [19174.369827] [<ffffffffc05ec4ba>] ? backref_cache_cleanup+0xea/0x100 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffff817dce0f>] ? __slab_free+0xee/0x234 >>>>> [19174.369827] [<ffffffff817dce0f>] ? __slab_free+0xee/0x234 >>>>> [19174.369827] [<ffffffffc05baf0e>] ? clear_state_bit+0xae/0x170 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffffc05ba67a>] ? free_extent_state+0x6a/0xd0 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffff810172e0>] do_invalid_op+0x20/0x30 >>>>> [19174.369827] [<ffffffff817f24ee>] invalid_op+0x1e/0x30 >>>>> [19174.369827] [<ffffffffc05ec1d9>] ? >>>>> free_backref_node.isra.36+0x19/0x20 [btrfs] >>>>> [19174.369827] [<ffffffffc05ec4ba>] ? backref_cache_cleanup+0xea/0x100 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffffc05ec43c>] ? backref_cache_cleanup+0x6c/0x100 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffffc05f25eb>] relocate_block_group+0x2cb/0x510 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffffc05f29e0>] >>>>> btrfs_relocate_block_group+0x1b0/0x2d0 [btrfs] >>>>> [19174.369827] [<ffffffffc05c6eab>] >>>>> btrfs_relocate_chunk.isra.75+0x4b/0xd0 [btrfs] >>>>> [19174.369827] [<ffffffffc05c82e8>] __btrfs_balance+0x348/0x460 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffffc05c87b5>] btrfs_balance+0x3b5/0x5d0 [btrfs] >>>>> [19174.369827] [<ffffffffc05d5cac>] btrfs_ioctl_balance+0x1cc/0x530 >>>>> [btrfs] >>>>> [19174.369827] [<ffffffff811b52e0>] ? handle_mm_fault+0xb0/0x160 >>>>> [19174.369827] [<ffffffffc05d7c7e>] btrfs_ioctl+0x69e/0xb20 [btrfs] >>>>> [19174.369827] [<ffffffff8120f5b5>] do_vfs_ioctl+0x75/0x320 >>>>> [19174.369827] [<ffffffff8120f8f1>] SyS_ioctl+0x91/0xb0 >>>>> [19174.369827] [<ffffffff817f098d>] system_call_fastpath+0x16/0x1b >>>>> [19174.369827] Code: 4e 8b 2c 23 eb cd 66 0f 1f 44 00 00 48 83 c4 28 >>>>> 5b 41 5c 41 5d 41 5e 41 5f 5d c3 90 be 00 10 00 00 4c 89 ef e8 a3 ee >>>>> ff ff eb c7 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f >>>>> 44 00 >>>>> [19174.369827] RIP [<ffffffff8106875f>] cpa_flush_array+0x10f/0x120 >>>>> [19174.369827] RSP <ffff880367b52cf8> >>>>> [19174.369827] ---[ end trace 60adc437bd944044 ]--- >>>>> >>>>> After a reboot and a remount it always tried to resume the balance and >>>>> and then crashed again, so I had to be quick to do a "btrfs balance >>>>> cancel". Then I started the scrub and got these uncorrectable errors I >>>>> mentioned in the first mail. >>>>> >>>>> I just unmounted it and started a btrfsck. Will post the output when >>>>> it's >>>>> done. >>>>> It's already showing me several of these: >>>>> >>>>> checksum verify failed on 18523667709952 found C240FB11 wanted 1ED6A587 >>>>> checksum verify failed on 18523667709952 found C240FB11 wanted 1ED6A587 >>>>> checksum verify failed on 18523667709952 found 5EAB6BFE wanted BA48D648 >>>>> checksum verify failed on 18523667709952 found 8E19F60E wanted E3A34D18 >>>>> checksum verify failed on 18523667709952 found C240FB11 wanted 1ED6A587 >>>>> bytenr mismatch, want=18523667709952, have=10838194617263884761 >>>>> >>>>> >>>>> Thanks, >>>>> Tobias >>>>> >>>>> >>>>> >>>>> 2015-05-28 4:49 GMT+02:00 Qu Wenruo <quwenruo@xxxxxxxxxxxxxx>: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -------- Original Message -------- >>>>>> Subject: Uncorrectable errors on RAID6 >>>>>> From: Tobias Holst <tobby@xxxxxxxx> >>>>>> To: linux-btrfs@xxxxxxxxxxxxxxx <linux-btrfs@xxxxxxxxxxxxxxx> >>>>>> Date: 2015年05月28日 10:18 >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I am doing a scrub on my 6-drive btrfs RAID6. Last time it found zero >>>>>>> errors, but now I am getting this in my log: >>>>>>> >>>>>>> [ 6610.888020] BTRFS: checksum error at logical 478232346624 on dev >>>>>>> /dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2 >>>>>>> [ 6610.888025] BTRFS: checksum error at logical 478232346624 on dev >>>>>>> /dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2 >>>>>>> [ 6610.888029] BTRFS: bdev /dev/dm-2 errs: wr 0, rd 0, flush 0, >>>>>>> corrupt >>>>>>> 1, >>>>>>> gen 0 >>>>>>> [ 6611.271334] BTRFS: unable to fixup (regular) error at logical >>>>>>> 478232346624 on dev /dev/dm-2 >>>>>>> [ 6611.831370] BTRFS: checksum error at logical 478232346624 on dev >>>>>>> /dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2 >>>>>>> [ 6611.831373] BTRFS: checksum error at logical 478232346624 on dev >>>>>>> /dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2 >>>>>>> [ 6611.831375] BTRFS: bdev /dev/dm-2 errs: wr 0, rd 0, flush 0, >>>>>>> corrupt >>>>>>> 2, >>>>>>> gen 0 >>>>>>> [ 6612.396402] BTRFS: unable to fixup (regular) error at logical >>>>>>> 478232346624 on dev /dev/dm-2 >>>>>>> [ 6904.027456] BTRFS: checksum error at logical 478232346624 on dev >>>>>>> /dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2 >>>>>>> [ 6904.027460] BTRFS: checksum error at logical 478232346624 on dev >>>>>>> /dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2 >>>>>>> [ 6904.027463] BTRFS: bdev /dev/dm-2 errs: wr 0, rd 0, flush 0, >>>>>>> corrupt >>>>>>> 3, >>>>>>> gen 0 >>>>>>> >>>>>>> Looks like it is always the same sector. >>>>>>> >>>>>>> "btrfs balance status" shows me: >>>>>>> scrub status for a34ce68b-bb9f-49f0-91fe-21a924ef11ae >>>>>>> scrub started at Thu May 28 02:25:31 2015, running for >>>>>>> 6759 >>>>>>> seconds >>>>>>> total bytes scrubbed: 448.87GiB with 14 errors >>>>>>> error details: read=8 csum=6 >>>>>>> corrected errors: 3, uncorrectable errors: 11, unverified >>>>>>> errors: >>>>>>> 0 >>>>>>> >>>>>>> What does it mean and why are these erros uncorrectable even on a >>>>>>> RAID6? >>>>>>> Can I find out, which files are affected? >>>>>> >>>>>> >>>>>> >>>>>> If it's OK for you to put the fs offline, >>>>>> btrfsck is the best method to check what happens, although it may take >>>>>> a >>>>>> long time. >>>>>> >>>>>> There is a known bug that replace can cause checksum error, found by >>>>>> Zhao >>>>>> Lei. >>>>>> So did you run replace while there is still some other disk I/O >>>>>> happens? >>>>>> >>>>>> Thanks, >>>>>> Qu >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> system: Ubuntu 14.04.2 >>>>>>> kernel version 4.0.4 >>>>>>> btrfs-tools version: 4.0 >>>>>>> >>>>>>> Regards >>>>>>> Tobias >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>> linux-btrfs" >>>>>>> in >>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>> >>> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
