Re: "Corrected" errors persist after scrubbing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chris Murphy posted on Mon, 08 May 2017 13:26:16 -0600 as excerpted:

> On Sat, May 6, 2017 at 4:33 AM, Tom Hale <tom@xxxxxxx> wrote:
>> Below (and also attached because of formatting) is an example of `btrfs
>> scrub` incorrectly reporting that errors have been corrected.
>>
>> In this example, /dev/md127 is the device created by running:
>> mdadm --build /dev/md0 --level=faulty --raid-devices=1 /dev/loop0
>>
>> The filesystem is RAID1.
>>
>> # mdadm --grow /dev/md0 --layout=rp400
>> layout for /dev/md0 set to 12803
>> # btrfs scrub start -Bd /mnt/tmp
>> scrub device /dev/md127 (id 1) done
>>         scrub started at Fri May  5 19:23:54 2017 and finished after
>> 00:00:01
>>         total bytes scrubbed: 200.47MiB with 8 errors
>>         error details: read=8
>>         corrected errors: 8, uncorrectable errors: 0, unverified errors: 248
>> scrub device /dev/loop1 (id 2) done
>>         scrub started at Fri May  5 19:23:54 2017 and finished after
>> 00:00:01
>>         total bytes scrubbed: 200.47MiB with 0 errors
>> WARNING: errors detected during scrubbing, corrected
>> # ### But the errors haven't really been corrected, they're still there:
>> # mdadm --grow /dev/md0 --layout=clear # Stop producing additional errors
>> layout for /dev/md0 set to 31
>> # btrfs scrub start -Bd /mnt/tmp
>> scrub device /dev/md127 (id 1) done
>>         scrub started at Fri May  5 19:24:24 2017 and finished after
>> 00:00:00
>>         total bytes scrubbed: 200.47MiB with 8 errors
>>         error details: read=8
>>         corrected errors: 8, uncorrectable errors: 0, unverified errors: 248
>> scrub device /dev/loop1 (id 2) done
>>         scrub started at Fri May  5 19:24:24 2017 and finished after
>> 00:00:00
>>         total bytes scrubbed: 200.47MiB with 0 errors
>> WARNING: errors detected during scrubbing, corrected
>> #
> 
> 
> What are the complete kernel messages for the scrub event? This should
> show what problem Btrfs detects and how it fixes it, and what sectors
> it's fixing each time.

I'm also wondering what version of kernel and btrfs-progs are being
used here.  For two reasons:

First:

AFAIK newer code shouldn't report unverified, which was originally
reported for blocks where the checksums of the blocks containing the
checksums of the unverified errors were in bad blocks.  IOW, the lower
branches of the tree couldn't be checked because higher ones were
still being repaired.

Back then, in ordered to fix such errors, one had to do multiple passes
manually, until there were no more unverified errors, each pass fixing
errors at one level so the levels below it could actually be checked in
the next pass.

I know because I had a pair of ssds where I deliberately kept an ssd
that was going bad in the btrfs raid1 pair, in ordered to see how things
worked over time.  So I got quite some experience running and rerunning
scrubs until all the errors were corrected after multiple passes!

But newer versions catch that problem and I believe actually use the
second copy for the verifications, so as long as there's no uncorrectable,
there should be no unverified, either.  (Either that or they do multiple
passes automatically, like I used to do manually.  I'm not sure which
except the former should be simpler and faster so I suspect that's what's
done.)

So the fact that there's unverified errors reported hints to me that
the used versions may be old.  Either that, or it's a different mechanism
generating the unverified, that I'm not familiar with and that doesn't
get corrected from the other copy or via multipass like the ones I have
experience with do.  But that's why I say "hint".

Second:

AFAIK there was a short period around kernel 4.10 and early 4.11-rcs where
read-errors were indeed not being corrected properly.  To my knowledge this
was in operation, not scrub, but perhaps certain scrub cases were affected
as well.

AFAIK this problem is entirely fixed in 4.11 release, and presumably in the
4.10 stables, and I don't believe 4.8 and earlier were affected at all (but
I'm not sure about 4.9, I /think/ it was before the regression, but some
4.9-stable releases /might/ be affected), but whatever the OP's running
/might/ just be in that gap, it'd take a dev or someone following those
specific patches closer than I did to know specifically what's affected
and thus be able to say for sure, if the OP's running something 4.9 or 4.10,
or early 4.11-rcs, but not the latest 4.11 release.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux