[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


At 05:42 PM 4/7/2007, david@xxxxxxx wrote:
On Sat, 7 Apr 2007, Ron wrote:

The reality is that all modern HDs are so good that it's actually quite rare for someone to suffer a data loss event. The consequences of such are so severe that the event stands out more than just the statistics would imply. For those using small numbers of HDs, HDs just work.

OTOH, for those of us doing work that involves DBMSs and relatively large numbers of HDs per system, both the math and the RW conditions of service require us to pay more attention to quality details.
Like many things, one can decide on one of multiple ways to "pay the piper".

a= The choice made by many, for instance in the studies mentioned, is to minimize initial acquisition cost and operating overhead and simply accept having to replace HDs more often.

b= For those in fields were this is not a reasonable option (financial services, health care, etc), or for those literally using 100's of HD per system (where statistical failure rates are so likely that TLC is required), policies and procedures like those mentioned in this thread (paying close attention to environment and use factors, sector remap detecting, rotating HDs into and out of roles based on age, etc) are necessary.

Anyone who does some close variation of "b" directly above =will= see the benefits of using better HDs.

At least in my supposedly unqualified anecdotal 25 years of professional experience.

Ron, why is it that you assume that anyone who disagrees with you doesn't work in an environment where they care about the datacenter environment, and aren't in fields like financial services? and why do you think that we are just trying to save a few pennies? (the costs do factor in, but it's not a matter of pennies, it's a matter of tens of thousands of dollars)
I don't assume that. I didn't make any assumptions. I (rightfully IMHO) criticized everyone jumping on the "See, cheap =is= good!" bandwagon that the Google and CMU studies seem to have ignited w/o thinking critically about them. I've never mentioned or discussed specific financial amounts, so you're making an (erroneous) assumption when you think my concern is over people "trying to save a few pennies".

In fact, "saving pennies" is at the =bottom= of my priority list for the class of applications I've been discussing. I'm all for economical, but to paraphrase Einstein "Things should be as cheap as possible; but no cheaper."

My biggest concern is that something I've seen over and over again in my career will happen again: People tend to jump at the _slightest_ excuse to believe a story that will save them short term money and resist even _strong_ reasons to pay up front for quality. Even if paying more up front would lower their lifetime TCO.

The Google and CMU studies are =not= based on data drawn from businesses where the lesser consequences of an outage are losing $10Ks or $100K per minute... ...and where the greater consequences include the chance of loss of human life. Nor are they based on businesses that must rely exclusively on highly skilled and therefore expensive labor.

In the case of the CMU study, people are even extrapolating an economic conclusion the original author did not even make or intend! Is it any wonder I'm expressing concern regarding inappropriate extrapolation of those studies?

I actually work in the financial services field, I do have a good datacenter environment that's well cared for.

while I don't personally maintain machines with hundreds of drives each, I do maintain hundreds of machines with a small number of drives in each, and a handful of machines with a few dozens of drives. (the database machines are maintained by others, I do see their failed drives however)

it's also true that my expericance is only over the last 10 years, so I've only been working with a few generations of drives, but my experiance is different from yours.

my experiance is that until the drives get to be 5+ years old the failure rate seems to be about the same for the 'cheap' drives as for the 'good' drives. I won't say that they are exactly the same, but they are close enough that I don't believe that there is a significant difference.

in other words, these studies do seem to match my experiance.
Fine. Let's pretend =You= get to build Citibank's or Humana's next mission critical production DBMS using exclusively HDs with 1 year warranties.
(never would be allowed ITRW)

Even if you RAID 6 them, I'll bet you anything that a system with 32+ HDs on it is likely enough to spend a high enough percentage of its time operating in degraded mode that you are likely to be looking for a job as a consequence of such a decision. ...and if you actually suffer data loss or, worse, data corruption, that's a Career Killing Move.
(and it should be given the likely consequences to the public of such a F* up).

this is why, when I recently had to create some large capacity arrays, I'm only ending up with machines with a few dozen drives in them instead of hundreds. I've got two machines with 6TB of disk, one with 8TB, one with 10TB, and one with 20TB. I'm building these sytems for ~$1K/TB for the disk arrays. other departments sho shoose $bigname 'enterprise' disk arrays are routinely paying 50x that price

I am very sure that they are not getting 50x the reliability, I'm sure that they aren't getting 2x the reliability.
...and I'm very sure they are being gouged mercilessly by vendors who are padding their profit margins exorbitantly at the customer's expense. HDs or memory from the likes of EMC, HP, IBM, or Sun has been overpriced for decades. Unfortunately, for every one of me who shop around for good vendors there are 20+ corporate buyers who keep on letting themselves get gouged. Gouging is not going stop until the gouge prices are unacceptable to enough buyers.

Now if the issue of price difference is based on =I/O interface= (SAS vs SATA vs FC vs SCSI), that's a different, and orthogonal, issue. The simple fact is that optical interconnects are far more expensive than anything else and that SCSI electronics cost significantly more than anything except FC.
There's gouging here as well, but far more of the pricing is justified.

I believe that the biggest cause for data loss from people useing the 'cheap' drives is due to the fact that one 'cheap' drive holds the capacity of 5 or so 'expensive' drives, and since people don't realize this they don't realize that the time to rebuild the failed drive onto a hot-spare is correspondingly longer.
Commodity HDs get 1 year warranties for the same reason enterprise HDs get 5+ year warranties: the vendor's confidence that they are not going to lose money honoring the warranty in question.

AFAIK, there is no correlation between capacity of HDs and failure rates or warranties on them.

Your point regarding using 2 cheaper systems in parallel instead of 1 gold plated system is in fact an expression of a basic Axiom of Systems Theory with regards to Single Points of Failure. Once components become cheap enough, it is almost always better to have redundancy rather than all one's eggs in 1 heavily protected basket.

Frankly, the only thing that made me feel combative is when someone claimed there's no difference between anecdotal evidence and a professional opinion or advice.
That's just so utterly unrealistic as to defy belief.
No one would ever get anything done if every business decision had to wait on properly designed and executed lab studies.

It's also insulting to everyone who puts in the time and effort to be a professional within a field rather than a lay person.

Whether there's a name for it or not, there's definitely an important distinction between each of anecdote, professional opinion, and study result.

Ron Peacetree

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Home]     [Yosemite]

Powered by Linux