On Tue, Feb 23, 2016 at 11:22:47PM +0000, Duncan wrote:
> Forgot to mention, tho you're probably already considering it, if this is
> the same raid5-backed btrfs you were complaining about being slow in the
> other thread,
No, that's another one :)
This one was remade from scratch after the filesystem on it got
corrupted.
5 x 4TB swraid5 64GB SSD
bcache
dmcrypt
btrfs
Smart is 100% for all 5 drives, and they passed an extensive test before
I built the new raid and filesystem on them.
> and considering redoing with bcache to an ssd added, as
> seems very likely, if it /is/ actually storage device or bus errors, that
> could be one reason the previous one was getting so slow... Maybe it
> wasn't btrfs after all.
Good thinking, although in this case, it's a different filesystem.
This filesystem is however on a Sata port multiplier with a 2 meter
cable to an external disk array.
As a result, bandwidth to it is going to be slow-ish, and the long cable
could be adding I/O errors.
On Tue, Feb 23, 2016 at 11:17:06PM +0000, Duncan wrote:
> I believe all formal documentation of what the error counters actually
> mean is developer-level -- "Trust the Source, Luke."
Haha, I know that one :)
Although to be fair I was more offering for someone to tell me what
they're supposed to mean, and me updating the wiki to capture that info.
> Yet another point supporting the "btrfs is still stabilizing, not yet
> fully stable" position, I suppose, as it could definitely be argued that
> those counters and their visibility, including display in the kernel log
> at mount time, are definitely intended to be consumed at the admin-user
> level, and that it follows that they should be documented at the admin-
> user level before the filesystem can properly be defined as fully stable.
Yes :) and I'm happy to help make this reality in the wiki at least.
> Write error counter increments should be accompanied by kernel log events
> telling you more -- what level of the device stack is returning the
> errors that propagate up to the filesystem level, for instance. Expected
> would be either bus level timeouts and resets, or storage device errors.
I agree, and I get 0 such errors here, which is why it's weird.
> If it's storage device errors, SMART data should show increasing raw
> value relocated sectors or the like (smartctl -A). If it's bus errors,
Correct, and they are all at 0.
> it could be bad cabling (bad connections or bad shielding, or using
> SATA-150 certified cables for SATA-600 or some such), or, as I saw on an
Cabling is indeed a likely culprit, I'm just surprised that if it's the
case, the sata layer is showing me nothing (I'm doing tail -f
/var/log/kern.log and usually I'd see sata or PMP errors there)
> old and failing mobo (when I pulled it there were bulging and some
> exploded capacitors) a few years ago, failing filter-capacitors on the
> mobo signalling paths. Bad power, including the possibility of an
> overloaded UPS that hit one guy I know, is notorious for both this sort
> of issue and memory problems, as well.
All true, but wouldn't all of these show up as actual disk errors by the
underlying driver involved too?
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html