On 2016-03-18 05:17, Duncan wrote:
Pete posted on Thu, 17 Mar 2016 21:08:23 +0000 as excerpted:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0
This one is available on ssds and spinning rust, and while it never
actually hit failure mode for me on an ssd I had that went bad, I watched
over some months as the raw reallocated sector count increased a bit at a
time. (The device was one of a pair with multiple btrfs raid1 on
parallel partitions on each, and the other device of the pair remains
perfectly healthy to this day, so I was able to use btrfs checksumming
and scrubs to keep the one that was going bad repaired based on the other
one, and was thus able to run it for quite some time after I would have
otherwise replaced it, simply continuing to use it out of curiosity and
to get some experience with how it and btrfs behaved when failing.)
In my case, it started at 253 cooked with 0 raw, then dropped to a
percentage (still 100 at first) as soon as the first sector was
reallocated (raw count of 1). It appears that your manufacturer treats
it as a percentage from a raw count of 0.
What really surprised me was just how many spare sectors that ssd
apparently had. 512 byte sectors, so half a KiB each. But it was into
the thousands of replaced sectors raw count, so Megabytes used, but the
cooked count had only dropped to 85 or so by the time I got tired of
constantly scrubbing to keep it half working as more and more sectors
failed. But threshold was 36, so I wasn't anywhere CLOSE to getting to
reported failure here, despite having thousands of replaced sectors thus
megabytes in size.
This actually makes sense, as SSD's have spare 'sectors' in erase block
size chunks, and most use a minimum 1MiB erase block size, with 4-8MiB
being normal for most consumer devices.
But the ssd was simply bad before its time, as it wasn't failing due to
write-cycle wear-out, but due to bad flash, plain and simple. With the
other device (and the one I replaced it with as well, I actually had
three of the same brand and size SSDs), there's still no replaced sectors
at all.
But apparently, when ssds hit normal old-age and start to go bad from
write-cycle failure, THAT is when those 128 MiB or so (as I calculated
based on percentage and raw value failed at one point, or was it 256 MiB,
IDR for sure) of replacement sectors start to be used. And on SSDs,
apparently when that happens, sectors often fail and are replaced faster
than I was seeing, so it's likely people will actually get to failure
mode on this attribute in that case.
I'd guess spinning rust has something less, maybe 64 MiB for multiple TB
of storage, instead of the 128 or 256 MiB I saw on my 256 GiB SSDs. That
would be because spinning rust failure mode is typically different, and
while a few sectors might die and be replaced over the life of the
device, typically it's not that many, and failure is by some other means
like mechanical failure (failure to spin up, or read heads getting out of
tolerated sync with the cylinders on the device).
7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always
- 56166570022
Like the raw-read-error-rate attribute above, you're seeing minor issues
as the raw number isn't 0, and in this case, the cooked value is
obviously dropping significantly as well, but it's still within
tolerance, so it's not failing yet. That worst cooked value of 60 is
starting to get close to that threshold of 30, however, so this one's
definitely showing wear, just not failure... yet.
9 Power_On_Hours 0x0032 075 075 000 Old_age Always
- 22098
Reasonable for a middle-aged drive, considering you obviously don't shut
it down often (a start-stop-count raw of 80-something). That's ~2.5
years of power-on.
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
This one goes with spin-up time. Absolutely no problems here.
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 83
Matches start-stop-count. Good. =:^) Since you obviously don't spin
down except at power-off, this one isn't going to be a problem for you.
184 End-to-End_Error 0x0032 098 098 099 Old_age Always
FAILING_NOW 2
I /think/ this one is a power-on head self-test head seek from one side
of the device to the other, and back, covering both ways.
I believe you're correct about this, although I've never seen any
definitive answer anywhere.
Assuming I'm correct on the above guess, the combination of this failing
for you, and the not yet failing but a non-zero raw-value for raw-read-
error-rate and seek-error-rate, with the latter's cooked value being
significantly down if not yet failing, is definitely concerning, as the
three values all have to do with head seeking errors.
I'd definitely get your data onto something else as soon as possible, tho
as much of it is backups, you're not in too bad a shape even if you lose
them, as long as you don't lose the working copy at the same time.
But with all three seek attributes indicating at least some issue and one
failing, at least get anything off it that is NOT backups ASAP.
And that very likely explains the slowdowns as well, as obviously, while
all sectors are still readable, it's having to retry multiple times on
some of them, and that WILL slow things down.
188 Command_Timeout 0x0032 100 099 000 Old_age Always
- 8590065669
Again, a non-zero raw value indicating command timeouts, probably due to
those bad seeks. It'll have to retry those commands, and that'll
definitely mean slowdowns.
Tho there's no threshold, but 99 worst-value cooked isn't horrible.
FWIW, on my spinning rust device this value actually shows a worst of
001, here (100 current cooked value, tho), with a threshold of zero,
however. But as I've experienced no problems with it I'd guess that's an
aberration. I haven't the foggiest why/how/when it got that 001 worst.
Such an occurrence is actually not unusual when you have particularly
bad sectors on a 'desktop' rated HDD, as they will keep retrying for an
insanely long time to read the bad sector before giving up.
189 High_Fly_Writes 0x003a 095 095 000 Old_age Always
- 5
Again, this demonstrates a bit of disk wobble or head slop. But with a
threshold of zero and a value and worst of 95, it doesn't seem to be too
bad.
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always
- 287836
Interesting. My spinning rust has the exact same value and worst of 1,
threshold 0, and a relatively similar 237181 raw count.
But I don't really know what this counts unless it's actual seeks, and
mine seems in good health still, certainly far better than the cooked
value and worst of 1 might suggest.
As far as I understand it, this is an indicator of the number of times
the heads have been loaded and unloaded. This is tracked separately as
there are multiple reasons the heads might get parked without spinning
down the disk (most disks will park them if they've been idle, so that
they reduce the risk of a head crash, and many modern laptops will park
them if they detect that they're in free fall to protect the disk when
they impact whatever they fall onto). It's not unusual to see values
like that for similarly aged disks either though, so it's not too worrying.
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline
- 281032595099550
OK, head flying hours explains it, drive is over 32 billion years old...
While my spinning rust has this attribute and the cooked values are
identical 100/253/0, the raw value is reported and formatted entirely
differently, as 21122 (89 19 0). I don't know what those values are, but
presumably your big long value reports the others mine does, as well,
only as a big long combined value.
Which would explain the apparent multi-billion years yours is reporting!
=:^) It's not a single value, it's multiple values somehow combined.
At least with my power-on hours of 23637, a head-flying hours of 21122
seems reasonable. (I only recently configured the BIOS to spin down that
drive after 15 minutes I think, because it's only backups and my media
partition which isn't mounted all the time anyway, so I might as well
leave it off instead of idle-spinning when I might not use it for days at
a time. So a difference of a couple thousand hours between power-on and
head-flying, on a base of 20K+ hours for both, makes sense given that I
only recently configured it to spin down.)
But given your ~22K power-on hours, even simply peeling off the first 5
digits of your raw value would be 28K head-flying, and that doesn't make
sense for only 22K power-on, so obviously they're using a rather more
complex formula than that.
This one is tricky, as it's not very clearly defined in the SMART spec.
Most manufacturers just count the total time the head has been loaded.
There are some however who count the time the heads have been loaded,
multiplied by the number of heads. This value still appears to be
incorrect though, as combined with the Power_On_Hours, it implies well
over 1024 heads, which is physically impossible on even a 5.25 inch disk
using modern technology, even using multiple spindles. The fact that
this is so blatantly wrong should be a red flag regarding the disk
firmware or on-board electronics, which just reinforces what Duncan
already said about getting a new disk.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html