On 2018/12/30 上午8:48, Tomáš Metelka wrote:
> Ok, I've got it:-(
>
> But just a few questions: I've tried (with btrfs-progs v4.19.1) to
> recover files through btrfs restore -s -m -S -v -i ... and following
> events occurred:
>
> 1) Just 1 "hard" error:
> ERROR: cannot map block logical 117058830336 length 1073741824: -2
> Error copying data for /mnt/...
> (file which absence really doesn't pain me:-))
This means one data extent can't be recovered due to missing chunk mapping.
Not impossible for heavily damaged fs, but nothing serious.
>
> 2) For 24 files a I got "too much loops" warning (U mean this: "if
> (loops >= 0 && loops++ >= 1024) { ..."). I've always answered yes but
> I'm afraid these files are corrupted (at least 2 of them seems corrupted).
>
> How much bad is this?
Not sure, but I don't think store is robust enough for such case.
Maybe false alert.
> Does the error mentioned in #1 mean that it's the
> only file which is totally lost?
Not even total lost, as it's just one file extent, maybe other part is OK.
Thanks,
Qu
> I can live without those 24 + 1 files
> so if #1 and #2 would be the only errors then I could say the recovery
> was successful ... but I'm afraid things aren't such easy:-)
>
> Thanks
> M.
>
>
> Tomáš Metelka
> Business & IT Analyst
>
> Tel: +420 728 627 252
> Email: tomas.metelka@xxxxxxxxxxx
>
>
>
> On 24. 12. 18 15:19, Qu Wenruo wrote:
>>
>>
>> On 2018/12/24 下午9:52, Tomáš Metelka wrote:
>>> On 24. 12. 18 14:02, Qu Wenruo wrote:
>>>> btrfs check --readonly output please.
>>>>
>>>> btrfs check --readonly is always the most reliable and detailed output
>>>> for any possible recovery.
>>>
>>> This is very weird because it prints only:
>>> ERROR: cannot open file system
>>
>> A new place to enhance ;)
>>
>>>
>>> I've tried also "btrfs check -r 75152310272" but it only says:
>>> parent transid verify failed on 75152310272 wanted 2488742 found 2488741
>>> parent transid verify failed on 75152310272 wanted 2488742 found 2488741
>>> Ignoring transid failure
>>> ERROR: cannot open file system
>>>
>>> I've tried that because:
>>> backup 3:
>>> backup_tree_root: 75152310272 gen: 2488741 level: 1
>>>
>>>> Also kernel message for the mount failure could help.
>>>
>>> Sorry, my fault, I should start from this point:
>>>
>>> Dec 23 21:59:07 tisc5 kernel: [10319.442615] BTRFS: device fsid
>>> be557007-42c9-4079-be16-568997e94cd9 devid 1 transid 2488742 /dev/loop0
>>> Dec 23 22:00:49 tisc5 kernel: [10421.167028] BTRFS info (device loop0):
>>> disk space caching is enabled
>>> Dec 23 22:00:49 tisc5 kernel: [10421.167034] BTRFS info (device loop0):
>>> has skinny extents
>>> Dec 23 22:00:50 tisc5 kernel: [10421.807564] BTRFS critical (device
>>> loop0): corrupt node: root=1 block=75150311424 slot=245, invalid NULL
>>> node pointer
>> This explains the problem.
>>
>> Your root tree has one node pointer which is not correct.
>> For pointer it should never points to 0.
>>
>> This is pretty weird, at least some corruption pattern I have never seen.
>>
>> Since your tree root get corrupted, there isn't much thing we can do,
>> but try to use older tree roots.
>>
>> You could go try all backup roots, starting from the newest backup (with
>> highest generation), and check the backup root bytenr using:
>> # btrfs check -r <backup root bytenr> <device>
>>
>> To see which one get least error, but normally the chance is near 0.
>>
>>> Dec 23 22:00:50 tisc5 kernel: [10421.807653] BTRFS error (device loop0):
>>> failed to read block groups: -5
>>> Dec 23 22:00:50 tisc5 kernel: [10421.877001] BTRFS error (device loop0):
>>> open_ctree failed
>>>
>>>
>>> So i tried to do:
>>> 1) btrfs inspect-internal dump-super (with the snippet posted above)
>>> 2) btrfs inspect-internal dump-tree -b 75150311424
>>>
>>> And it showed (header + snippet for items 243-248):
>>> node 75150311424 level 1 items 249 free 244 generation 2488741 owner 2
>>> fs uuid be557007-42c9-4079-be16-568997e94cd9
>>> chunk uuid dbe69c7e-2d50-4001-af31-148c5475b48b
>>> ...
>>> key (14799519744 EXTENT_ITEM 4096) block 233423224832 (14247023) gen
>>> 2484894
>>> key (14811271168 EXTENT_ITEM 135168) block 656310272 (40058) gen
>>> 2488049
>>
>>
>>> key (1505328190277054464 UNKNOWN.4 366981796979539968) block 0 (0)
>>> gen 0
>>> key (0 UNKNOWN.0 1419267647995904) block 6468220747776 (394788864)
>>> gen
>>> 7786775707648
>>
>> Pretty obviously, these two nodes are garbage.
>> Something corrupted the memory at runtime, and we don't have runtime
>> check against corruption yet.
>>
>> So IMHO, I think the problem is, some kernel code, either btrfs or other
>> parts, corrupted the memory.
>> And then btrfs fails to detect it, write it back to disk, and finally
>> kernel get its chance to read the tree block from disk and finally
>> caught the problem.
>>
>> I could add such check for node, but normally it needs
>> CONFIG_BTRFS_FS_CHECK_INTEGRITY, so makes no sense for normal user.
>>
>>> key (12884901888 EXTENT_ITEM 24576) block 816693248 (49847) gen
>>> 2484931
>>> key (14902849536 EXTENT_ITEM 131072) block 75135844352 (4585928) gen
>>> 2488739
>>>
>>>
>>> I looked at that numbers quite a while (also in hex) trying to figure
>>> out what has happened (bit flips (it was on SSD), byte shifts (I
>>> suspected bad CPU also ... because it has died after 2 months from
>>> that)) and tried to guess "correct" values for that items ... but no
>>> idea:-(
>>
>> I'm not that sure, unless you're super lucky (or unlucky in this case),
>> or it will normally get caught by csum first.
>>
>>>
>>> So this why I have asked about that log_root and whether there is a
>>> chance to "log-replay things":-)
>>
>> For your case, definitely not related to log replay.
>>
>> Thanks,
>> Qu
>>
>>>
>>>
>>> Thanks
>>> M.
>>
Attachment:
signature.asc
Description: OpenPGP digital signature
