Re: strange i/o errors with btrfs on raid/lvm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Sep 25, 2015 at 2:26 PM, Jogi Hofmüller <jogi@xxxxxx> wrote:

> That was right while the RAID was in degraded state and rebuilding.

On the guest:

Aug 28 05:17:01 vm kernel: [140683.741688] BTRFS info (device vdc):
disk space caching is enabled
Aug 28 05:17:13 vm kernel: [140695.575896] BTRFS warning (device vdc):
block group 13988003840 has wrong amount of free space
Aug 28 05:17:13 vm kernel: [140695.575901] BTRFS warning (device vdc):
failed to load free space cache for block group 13988003840, rebuild
it now
Aug 28 05:17:13 vm kernel: [140695.626035] BTRFS warning (device vdc):
block group 17209229312 has wrong amount of free space
Aug 28 05:17:13 vm kernel: [140695.626039] BTRFS warning (device vdc):
failed to load free space cache for block group 17209229312, rebuild
it now
Aug 28 05:17:13 vm kernel: [140695.683517] BTRFS warning (device vdc):
block group 20430454784 has wrong amount of free space
Aug 28 05:17:13 vm kernel: [140695.683521] BTRFS warning (device vdc):
failed to load free space cache for block group 20430454784, rebuild
it now
Aug 28 05:17:13 vm kernel: [140695.822818] BTRFS warning (device vdc):
block group 68211965952 has wrong amount of free space
Aug 28 05:17:13 vm kernel: [140695.822822] BTRFS warning (device vdc):
failed to load free space cache for block group 68211965952, rebuild
it now



On the host, there are no messages that correspond to this time index,
but a bit over an hour and a half later are when there are sas error
messages, and the first reported write error.

I see the rebuild event starting:

Aug 28 07:04:23 host mdadm[2751]: RebuildStarted event detected on md
device /dev/md/0

But there are subsequent sas errors still, including hard resetting of
the link, and additional read errors. This continues more than once...

And then

Aug 28 17:06:49 host mdadm[2751]: RebuildFinished event detected on md
device /dev/md/0, component device  mismatches found: 2048 (on raid
level 10)
Aug 28 17:06:49 host mdadm[2751]: SpareActive event detected on md
device /dev/md/0, component device /dev/sdd1

and also a number of SMART warnings about seek error on another device

Aug 28 17:35:55 host smartd[3146]: Device: /dev/sda [SAT], SMART Usage
Attribute: 7 Seek_Error_Rate changed from 180 to 179


So it sounds like more than one problem, either two drives, or maybe
even a controller problem, I can't really tell as there are lots of
messages.

But 2048 mismatches found after a rebuild is a problem. So there's
already some discrepancy in the mdadm raid10. And mdadm raid1 (or 10)
cannot resolve mismatches because which block is correct is ambiguous.
So that means something is definitely going to get corrupt. Btrfs, if
the metadata profile is DUP can recover from that. But data can't.
Normally this results in an explicit Btrfs message about a checksum
mismatch and no ability to fix it, but will still report the path to
affected file.  But I'm not finding that.

Anyway, if the hardware errors are resolved, try doing a scrub on the
file system, maybe when there's a period of reduced usage, to make
sure data and metadata are OK.

And then you could also manually reset the free space cache by
umounting and then remounting with -o clear_cache. This is a one time
thing, you do not need to remount.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux