Back to the original email... On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@xxxxxxxxx> wrote: > Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs scrub aborts before completing. > > If I boot back into an older kernel (4.1 or 4.3, not sure about 4.2) then it runs to completion without any issues. > > Steps to reproduce: > > 1 - make a raid1 system > 2 - run with only one disk for awhile to introduce inconsistency > 3 - add the other disk back and run btrfs scrub > > The newer kernels will get part way through the scrub and then die. For example, with 4.6 > > # btrfs scrub status -dR / > scrub status for 61267e7b-e8e3-43e1-99f3-40cb2b004a6a > scrub device /dev/sda3 (id 1) history > scrub started at Thu May 26 10:59:31 2016 and was aborted after 00:02:23 > data_extents_scrubbed: 256140 > tree_extents_scrubbed: 35016 > data_bytes_scrubbed: 14865694720 > tree_bytes_scrubbed: 573702144 > read_errors: 0 > csum_errors: 0 > verify_errors: 0 > no_csum: 2032 > csum_discards: 0 > super_errors: 0 > malloc_errors: 0 > uncorrectable_errors: 0 > unverified_errors: 0 > corrected_errors: 0 > last_physical: 16004874240 > scrub device /dev/sdb3 (id 2) history > scrub started at Thu May 26 10:59:31 2016 and was aborted after 00:02:35 > data_extents_scrubbed: 256139 > tree_extents_scrubbed: 35016 > data_bytes_scrubbed: 14865690624 > tree_bytes_scrubbed: 573702144 > read_errors: 0 > csum_errors: 205 > verify_errors: 24 > no_csum: 2032 > csum_discards: 0 > super_errors: 0 > malloc_errors: 0 > uncorrectable_errors: 0 > unverified_errors: 0 > corrected_errors: 229 > last_physical: 15984951296 no_csum is not unusual as there are often things set with xattr +C (nodatacow) for example this is now the default with newer versions of systemd for systemd-journald logs. But this 2nd device has verify_errors and csum_errors both of which add up to the same value as corrected_errors, before the abort. I think that's odd. It's a lot of errors. Also odd is the abort doesn't happen at exactly the same time for both devices; maybe explained by it taking 12 seconds for the corrections to happen on the 2nd device? But 229 4KiB blocks being corrected wouldn't take 12 seconds... for any reason that I can't think of. > The kernel logs show nothing other than the standard "no csum found for inode ..." and "parent transid verify failed ..." messages Maybe include a btrfs check for the volume, using btrfs progs 4.4.1 or 4.5.3. > > Then booting back into 4.3 and rerunning the scrub. > > # btrfs scrub start -BdR / > scrub device /dev/sda3 (id 1) done > scrub started at Thu May 26 11:43:00 2016 and finished after 00:56:25 > data_extents_scrubbed: 6939254 > tree_extents_scrubbed: 68269 > data_bytes_scrubbed: 426809974784 > tree_bytes_scrubbed: 1118519296 > read_errors: 0 > csum_errors: 0 > verify_errors: 0 > no_csum: 62895 > csum_discards: 0 > super_errors: 0 > malloc_errors: 0 > uncorrectable_errors: 0 > unverified_errors: 0 > corrected_errors: 0 > last_physical: 482390048768 > scrub device /dev/sdb3 (id 2) done > scrub started at Thu May 26 11:43:00 2016 and finished after 00:58:41 > data_extents_scrubbed: 6939240 > tree_extents_scrubbed: 68118 > data_bytes_scrubbed: 426809335808 > tree_bytes_scrubbed: 1116045312 > read_errors: 0 > csum_errors: 1051510 > verify_errors: 0 > no_csum: 62767 > csum_discards: 0 > super_errors: 0 > malloc_errors: 0 > uncorrectable_errors: 0 > unverified_errors: 0 > corrected_errors: 1051510 > last_physical: 482390048768 > WARNING: errors detected during scrubbing, corrected OK and now it's over on million corrections for a single device, the other one isn't affected. I know btrfs dev stats are cumulative, I forget if scrubs stats are. If they are, that's a bit confusing. But in any case, lifetime or one time, a million corrections is crazy unless this is intentional, trying to hammer on Btrfs's self-healing abilities. Good test. Not a good in-production behavior though. So I think there are two problems. The first is why are there so many problems in the first place? And why is fixing them causing an abort with new kernels? You might have found a bug/regression that isn't being caught with testing if the test volumes don't have some unknown minimum number of csum errors. See what I'm getting at? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
