Re: Scrub aborts on newer kernels

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Back to the original email...



On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@xxxxxxxxx> wrote:
> Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs scrub aborts before completing.
>
> If I boot back into an older kernel (4.1 or 4.3, not sure about 4.2) then it runs to completion without any issues.
>
> Steps to reproduce:
>
> 1 - make a raid1 system
> 2 - run with only one disk for awhile to introduce inconsistency
> 3 - add the other disk back and run btrfs scrub
>
> The newer kernels will get part way through the scrub and then die.  For example, with 4.6
>
> # btrfs scrub status -dR /
> scrub status for 61267e7b-e8e3-43e1-99f3-40cb2b004a6a
> scrub device /dev/sda3 (id 1) history
>         scrub started at Thu May 26 10:59:31 2016 and was aborted after 00:02:23
>         data_extents_scrubbed: 256140
>         tree_extents_scrubbed: 35016
>         data_bytes_scrubbed: 14865694720
>         tree_bytes_scrubbed: 573702144
>         read_errors: 0
>         csum_errors: 0
>         verify_errors: 0
>         no_csum: 2032
>         csum_discards: 0
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 0
>         last_physical: 16004874240
> scrub device /dev/sdb3 (id 2) history
>         scrub started at Thu May 26 10:59:31 2016 and was aborted after 00:02:35
>         data_extents_scrubbed: 256139
>         tree_extents_scrubbed: 35016
>         data_bytes_scrubbed: 14865690624
>         tree_bytes_scrubbed: 573702144
>         read_errors: 0
>         csum_errors: 205
>         verify_errors: 24
>         no_csum: 2032
>         csum_discards: 0
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 229
>         last_physical: 15984951296

no_csum is not unusual as there are often things set with xattr +C
(nodatacow) for example this is now the default with newer versions of
systemd for systemd-journald logs.

But this 2nd device has verify_errors and csum_errors both of which
add up to the same value as corrected_errors, before the abort. I
think that's odd. It's a lot of errors.

Also odd is the abort doesn't happen at exactly the same time for both
devices; maybe explained by it taking 12 seconds for the corrections
to happen on the 2nd device? But 229 4KiB blocks being corrected
wouldn't take 12 seconds... for any reason that I can't think of.



> The kernel logs show nothing other than the standard "no csum found for inode ..." and "parent transid verify failed ..." messages

Maybe include a btrfs check for the volume, using btrfs progs 4.4.1 or 4.5.3.

>
> Then booting back into 4.3 and rerunning the scrub.
>
> # btrfs scrub start -BdR /
> scrub device /dev/sda3 (id 1) done
>         scrub started at Thu May 26 11:43:00 2016 and finished after 00:56:25
>         data_extents_scrubbed: 6939254
>         tree_extents_scrubbed: 68269
>         data_bytes_scrubbed: 426809974784
>         tree_bytes_scrubbed: 1118519296
>         read_errors: 0
>         csum_errors: 0
>         verify_errors: 0
>         no_csum: 62895
>         csum_discards: 0
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 0
>         last_physical: 482390048768
> scrub device /dev/sdb3 (id 2) done
>         scrub started at Thu May 26 11:43:00 2016 and finished after 00:58:41
>         data_extents_scrubbed: 6939240
>         tree_extents_scrubbed: 68118
>         data_bytes_scrubbed: 426809335808
>         tree_bytes_scrubbed: 1116045312
>         read_errors: 0
>         csum_errors: 1051510
>         verify_errors: 0
>         no_csum: 62767
>         csum_discards: 0
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 1051510
>         last_physical: 482390048768
> WARNING: errors detected during scrubbing, corrected


OK and now it's over on million corrections for a single device, the
other one isn't affected.

I know btrfs dev stats are cumulative, I forget if scrubs stats are.
If they are, that's a bit confusing. But in any case, lifetime or one
time, a million corrections is crazy unless this is intentional,
trying to hammer on Btrfs's self-healing abilities. Good test. Not a
good in-production behavior though.

So I think there are two problems. The first is why are there so many
problems in the first place? And why is fixing them causing an abort
with new kernels? You might have found a bug/regression that isn't
being caught with testing if the test volumes don't have some unknown
minimum number of csum errors. See what I'm getting at?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux