Re: Btrfs check reports errors, filesystem seems fine

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2017年07月14日 20:26, Filippe LeMarchand wrote:
So, my options are
a) Delete and re-create sobvolume
b) Try btrfs check --repair --mode original (if original mode is default, it already didn't help)

Then --repair doesn't help now.

c) Do nothing and wait for further update

Further update plan includes:
c) Update btrfs check --repair to handle your case.
   This will take some time for us to test and other guys to review.

d) Create a special purposed btrfs-corrupt-block patch for your image.
   This will fix your fs, but only for your fs.
   Not a generic solution, but at least it should work.

For now, it's recommend to backup important data, in case both c) and d) fail.

Thanks,
Qu
?

In a letter from Friday, July 14, 2017 15:11:05 MSK user Qu Wenruo wrote:

On 2017年07月14日 20:04, Filippe LeMarchand wrote:
Currently possible solution may be deleting the whole subvolume.
Can btrfs send (to external drive) and then btrfs receive back fix it? Or should I use simple cp/rsync?

You could try if you have backup.

Personally speaking, I'm not sure if it will work or make things worse.
Such hash and name mismatch is really rare, I don't know how kernel send
will handle it.

If you have full backup, then you could try it.
It is my root subvolume (sensitive data is on other ones), thus it is expendable. Can btrfs check --repair damage other subvolumes?

Unfortunately, it may corrupt other subvolumes.
But from your fsck output, the possibility of corruption is not that
high AFAIK.

I recommend to backup other good subvolumes/snapshots using send and
receive just in case.


Any idea about the reproducer? Or just random memory corruption?
No idea why and no idea when. This partition is about year and a half old, and I did btrfs check for the first time just about a month ago.
Also I ran memtest recently and it didn't find any errors.

Well, that's common.
I'll focus on checking your dump result to make a special purposed
btrfs-corrupt-block to fix your situation if no other method works for you.

Thanks,
Qu


In a letter from Friday, July 14, 2017 14:28:58 MSK user Qu Wenruo wrote:

On 2017年07月14日 18:12, Filippe LeMarchand wrote:
First "rm" on deprecated.txt worked, but file is still there. Neither the file, nor its parent directory cannot be deleted:

$ sudo rm /usr/share/doc/packages/util-linux/deprecated.txt
rm: cannot remove '/usr/share/doc/packages/util-linux/deprecated.txt': No such file or directory

$ sudo rm -rf /usr/share/doc/packages/util-linux/
rm: cannot remove '/usr/share/doc/packages/util-linux/': Directory not empty

$ sudo ls -l /usr/share/doc/packages/util-linux/
ls: cannot access '/usr/share/doc/packages/util-linux/deprecated.txt': No such file or directory
total 0
-????????? ? ? ? ?            ? deprecated.txt

Similar behavior is also detected using manually crafted image in our
environment.

Su Yue have sent patches to enhance error detection and test case for
it, but repairing is not supported.


Reinstall of util-linux package gives me two of that file (and also two files present on previous snapshot):

$ ls -l /usr/share/doc/packages/util-linux/
total 104
-rw-r--r-- 1 root root 18092 Jul 20  2016 COPYING
-rw-r--r-- 1 root root  1391 Jul 20  2016 COPYING.BSD-3
-rw-r--r-- 1 root root 26530 Jul 20  2016 COPYING.LGPLv2.1
-rw-r--r-- 1 root root  1824 Jul 20  2016 COPYING.UCB
-rw-r--r-- 1 root root   555 Jul 20  2016 README.licensing
-rw-r--r-- 1 root root  3257 Jul 20  2016 blkid.txt
-rw-r--r-- 1 root root  2264 Jul 20  2016 cal.txt
-rw-r--r-- 1 root root  1913 Jul 20  2016 col.txt
-rw-r--r-- 1 root root  2825 May  2 13:17 deprecated.txt
-rw-r--r-- 1 root root  2825 May  2 13:17 deprecated.txt
-rw-r--r-- 1 root root   992 Jul 20  2016 getopt.txt
-rw-r--r-- 1 root root  2437 Nov  2  2016 howto-debug.txt
-rw-r--r-- 1 root root   148 Jul 20  2016 hwclock.txt
-rw-r--r-- 1 root root  2617 Jul 20  2016 modems-with-agetty.txt
-rw-r--r-- 1 root root   522 Jul 20  2016 mount.txt
-rw-r--r-- 1 root root   448 Jul 20  2016 pg.txt

So, is this situation actually dangerous? And what can I do to gather more information for you?

The situation won't be worse. I'd recommend not to take any snapshot of
those subvolumes (4546 and 5134) to limit the corruption to those
subvolumes.

However there is also no easy way to fix it yet.

Currently possible solution may be deleting the whole subvolume.
If no further error happens, it may be fixed.

IIRC btrfs check --repair in original mode has
DIR_ITEM/DIR_INDEX/INODE_REF repair function, but I'm not sure if it can
handle it well.
Btrfs check --repair *MAY* fix it, or it may make things worse.
If you have full backup, then you could try it.
Otherwise, don't try it at all.

Other solution includes a specific repair program just for your case.
We can modify btrfs-corrupt-block to just delete the corrupted DIR_ITEM
(".sxt" one) and related DIR_INDEX/INODE_REF.
But I'll only choose this if you really need to fix it as soon as possible.

At least we have solution for it.
I'm more concerned about how this happened.

Any idea about the reproducer? Or just random memory corruption?

Thanks,
Qu

In a letter from Friday, July 14, 2017 9:11:06 MSK user Qu Wenruo wrote:
Thanks for your dump.

We're clear what is the direct cause of the problem.

It's one corrupted DIR_ITEM causing the problem.
And further more, original mode btrfs check can't detect it, and we will
fix it soon.

The corrupted DIR_ITEM is as the following:
	item 72 key (79177 DIR_ITEM 54846528) itemoff 12380 itemsize 88
		location key (4222342 INODE_ITEM 0) type FILE
		transid 170929 data_len 0 name_len 14
		name: deprecated.sxt
		location key (13590433 INODE_ITEM 0) type FILE
		transid 796448 data_len 0 name_len 14
		name: deprecated.txt

For dir inode 79177, it has 2 child inodes, with name "deprecated.txt"
(ino=4222342) and "deprecated.sxt" (ino=13590433)

But something goes wrong here:

1) Hash of "deprecated.sxt" doesn't match 54846528

2) Inode backref of inode 4222342 thinks its filename is "deprecated.txt"
Also captured by dump:
	item 40 key (4222342 INODE_REF 79177) itemoff 7189 itemsize 24
		inode ref index 417 namelen 14 name: deprecated.txt

3) DIR_INDEX also shows that filename for inode 4222342 should be
"deprecated.txt"
	item 87 key (79177 DIR_INDEX 417) itemoff 11757 itemsize 44
		location key (4222342 INODE_ITEM 0) type FILE
		transid 170929 data_len 0 name_len 14
		name: deprecated.txt

So generic speaking, it's DIR_ITEM wrong and causing the problem.

But the root reason is still unknown.

What I can see is, the corrupted DIR_ITEM points to an very old inode,
its mtime is back to 2016-09-07.
While the good DIR_ITEM points to newer inode, whose mtime is just
2017-05-02.

But more weird, there should not be two child inodes with the same
filename ("depercated.txt", I assume the sxt one is caused by a memory
bit corruption).

So, any details on the operation with util-linux/deprecated.txt will
help us to locate the root cause in kernel.

Thanks,
Qu


On 2017年07月12日 21:11, Filippe LeMarchand wrote:
Done, files added to same GDrive folder with corresponding names.
If it matters, subvol 4546 is my root filesystem (r/w snapshot created with snapper rollback), and 5134 is its snapshot.

In a letter dated Wednesday, July 12, 2017 15:44:52 MSK user Qu Wenruo wrote:

On 2017年07月12日 19:12, Filippe LeMarchand wrote:
Maybe something wrong in grep happened which skip "(79177" ?
Yes, my bad. Now I used grep -E "\(79177| 79177" pattern, file on GDrive updated.

It looks much better, thanks.


And btrfs check --mode=lowmem gives this:

checking extents
ERROR: extent[1609877700608, 94208] referencer count mismatch (root: 260, owner: 61720, offset: 6742016) wanted: 2, have: 5
ERROR: extent[1630301675520, 39583744] referencer count mismatch (root: 260, owner: 5847554, offset: 0) wanted: 36, have: 114
ERROR: extent[1658646986752, 10551296] referencer count mismatch (root: 274, owner: 283675, offset: 0) wanted: 2, have: 5
ERROR: extent[1672239132672, 84381696] referencer count mismatch (root: 274, owner: 2521382, offset: 0) wanted: 21, have: 25
ERROR: errors found in extent allocation tree or chunk allocation

Looks much like an exposed lowmem mode bug.
Feel free to ignore these error from extent tree, they are just false
alerts.

checking free space cache
checking fs roots
ERROR: root 4546 DIR_ITEM[79177 54846528] relative INODE_REF missing namelen 14 filename deprecated.sxt filetype 1

The error report is much better than original mode, and that's what I need.

Now I can wipe out all other noise as we know exactly which tree and
which DIR_ITEM/INODE_REF is causing the problem.

Would you please update the dump result with "-t 4546" passed to
btrfs-debug-tree like:

# btrfs-debug-tree -t 4546 <device>| grep 79177

Only "-t 4546" is added, to only dump the result of subvolume 4546.
As always, all 3 grep results (2 "deprecated" and one 79177) need to be
updated.

And it seems that my previous assumption is still right for this case.
If it's caused by kernel, your dump would definitely help us to locate
the problem.

ERROR: root 4546 INODE REF[4222342 79177] and DIR_ITEM[79177 54846528] mismatch namelen 14 filename deprecated.txt filetype 1
ERROR: root 5134 DIR_ITEM[79177 54846528] relative INODE_REF missing namelen 14 filename deprecated.sxt filetype 1

Also for root 5134 please.

Thanks,
Qu

ERROR: errors found in fs roots
Checking filesystem on /dev/sda2
UUID: 12c84aa3-ce65-4390-807e-a72cc8a7445e
found 153429872640 bytes used, error(s) found
total csum bytes: 121991672
total tree bytes: 1940160512
total fs tree bytes: 1683767296
total extent tree bytes: 103841792
btree space waste bytes: 310722480
file data blocks allocated: 842455031808
      referenced 159286636544

In a letter from Wednesday, July 12, 2017 10:15:18 MSK user Qu Wenruo wrote:
Sorry for the late reply.

After investigating the dumps, I found the output is quite strange.

1) Mismatching output.
In "btrfs-debug-tree-grep-79177.txt" I found only 79177 as offset for
INODE_REF is here, while 79177 as objectid for DIR_ITEM/DIR_INDEX is not
here at all.

While in "btrfs-debug-tree-grep-deprecated-txt.txt" there is epected
79177 DIR_ITEM/DIR_INDEX.

Maybe something wrong in grep happened which skip "(79177" ?

2) Mismatched hash
The main problem I found is that, for key (79177 DIR_ITEM 54846528), the
number 54846528 is the hash(crc32c) of filename, and it contains 2
items, one for "deprecated.txt" and one for "deprecated.sxt".

But we found that 54846528 only matches the hash for "deprecated.txt",
not "deprecated.sxt".

I think that's the main problem.

BTW, would you please try "btrfs check --mode=lowmem" to see if lowmem
mode reports similar (well, output may differ) error?

If lowmem mode also reports error on such DIR_ITEM, I'm pretty sure
that's the problem.

However it may take some time before we can fix it in repair mode.

Thanks,
Qu



在 2017年07月04日 21:24, Filippe LeMarchand 写道:
Sure, here it is:
https://drive.google.com/drive/folders/0B1ax9Am81gx9YjJBVVA0LXRHeGc

In a letter dated Tuesday, July 4, 2017 16:16:36 MSK user Lu Fengqi wrote:
On Mon, Jul 03, 2017 at 08:34:52AM +0800, Qu Wenruo wrote:


At 07/01/2017 07:59 PM, Filippe LeMarchand wrote:
Hello everyone.

I have an btrfs root partition on Intel 530 ssd, which mounts without errors and seem to work fine,
but `btrfs check` gives me foloowing output (and --repair doesn't remove errors):

enabling repair mode
Checking filesystem on /dev/sda2
UUID: 12c84aa3-ce65-4390-807e-a72cc8a7445e
checking extents
Fixed 0 roots.
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref

This means that in dir whose inode number is 79177, it has a child inode
pointer pointing to depercated.sxt.

But it doesn't have dir index and corresponding inode ref, which is breaking
the cross reference rule of btrfs.

Would you please run the following command to dump needed info for us to
debug?

# btrfs-debug-tree /dev/sda2 | grep 79177 -C 10

and

# btrfs-debug-tree /dev/sda2 | grep deprecated.sxt -C 10

and

# btrfs-debug-tree /dev/sda2 | grep deprecated.txt -C 10


Considering the output has both .txt and .sxt, I think that's the problem.
But such bit-flip should be detected by tree block csum.
I'm not sure what's wrong with it.

Thanks,
Qu

	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref
	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref
	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref
	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref
	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref
	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref
	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref
	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
	unresolved ref dir 79177 index 0 namelen 14 name deprecated.sxt filetype 1 errors 6, no dir index, no inode ref
	unresolved ref dir 79177 index 417 namelen 14 name deprecated.txt filetype 1 errors 1, no dir item
checking csums
checking root refs
found 23421812736 bytes used err is 0
total csum bytes: 21531608
total tree bytes: 776650752
total fs tree bytes: 711278592
total extent tree bytes: 36798464
btree space waste bytes: 116002036
file data blocks allocated: 850546470912
        referenced 27611987968

Is it dangerous and what should I do about it?

I also tried --clear-space-cache, but it just removes the line about space cache.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

I'm afraid that your mail may be rejected because the attachment size
exceeds the allowable limit(100kB) of btrfs mailing list. Could you
share the attachment by google drive?

Lastly, while Qu's timing is too tight, I will assist you on this issue.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux