On 2016-07-06 14:45, Chris Murphy wrote:
On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
<ahferroin7@xxxxxxxxx> wrote:
On 2016-07-06 12:43, Chris Murphy wrote:
So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.
Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.
I think it's statistically 0 people changing this from default. It's
people with drives that have no SCT ERC support, used in raid1+, who
happen to stumble upon this very obscure work around to avoid link
resets in the face of media defects. Rare.
Not as much as you think, once someone has this issue, they usually put
preventative measures in place on any system where it applies. I'd be
willing to bet that most sysadmins at big companies like RedHat or
Oracle are setting this.
2. People using single disks which have lots of errors may or may not see an
apparent degradation of performance, but will likely have the life
expectancy of their device extended.
Well they have link resets and their file system presumably face
plants as a result of a pile of commands in the queue returning as
unsuccessful. So they have premature death of their system, rather
than it getting sluggish. This is a long standing indicator on Windows
to just reinstall the OS and restore data from backups -> the user has
an opportunity to freshen up user data backup, and the reinstallation
and restore from backup results in freshly written sectors which is
how bad sectors get fixed. The marginally bad sectors get new writes
and now read fast (or fast enough), and the persistently bad sectors
result in the drive firmware remapping to reserve sectors.
The main thing in my opinion is less extension of drive life, as it is
the user gets to use the system, albeit sluggish, to make a backup of
their data rather than possibly losing it.
The extension of the drive's lifetime is a nice benefit, but not what my
point was here. For people in this particular case, it will almost
certainly only make things better (although at first it may make
performance worse).
3. Individuals who are not setting this but should be will on average be no
worse off than before other than seeing a bigger performance hit on a disk
error.
4. People with single disks which are new will see no functional change
until the disk has an error.
I follow.
In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to reasonable
percentage over that (probably something like 25%, which would give roughly
10 seconds for the normal 7 second ERC timer).
2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
this is reasonable for SCSI disks).
3. Otherwise, set the timer to 200 (we need a slight buffer over the
expected disk timeout to account for things like latency outside of the
disk).
Well if it's a non-redundant configuration, you'd want those long
recoveries permitted, rather than enable SCT ERC. The drive has the
ability to relocate sector data on a marginal (slow) read that's still
successful. But clearly many manufacturers tolerate slow reads that
don't result in immediate reallocation or overwrite or we wouldn't be
in this situation in the first place. I think this auto reallocation
is thwarted by enabling SCT ERC. It just flat out gives up and reports
a read error. So it is still data loss in the non-redundant
configuration and thus not an improvement.
I agree, but if it's only the kernel doing this, then we can't make
judgements based on userspace usage. Also, the first situation while
not optimal is still better than what happens now, at least there you
will get an I/O error in a reasonable amount of time (as opposed to
after a really long time if ever).
Basically it's:
For SATA and USB drives:
if data redundant, then enable short SCT ERC time if supported, if not
supported then extend SCSI command timer to 200;
if data not redundant, then disable SCT ERC if supported, and extend
SCSI command timer to 200.
For SCSI (SAS most likely these days), keep things the same as now.
But that's only because this is a rare enough configuration now I
don't know if we really know the problems there. It may be that their
error recovery in 7 seconds is massively better and more reliable than
consumer drives over 180 seconds.
I don't see why you would think this is not common. If you count just
by systems, then it's absolutely outnumbered at least 100 to 1 by
regular ATA disks. If you look at individual disks though, the reverse
is true, because people who use SCSI drives tend to use _lots_ of disks
(think big data centers, NAS and SAN systems and such). OTOH, both are
probably vastly outnumbered by stuff that doesn't use either standard
for storage...
Separately, USB gets _really_ complicated if you want to cover
everything, USB drives may or may not present as non-rotational, may or
may not show up as SATA or SCSI bridges (there are some of the more
expensive flash drives that actually use SSD controllers plus USB-SAT
chips internally), if they do show up as such, may or may not support
the required commands (most don't, but it's seemingly hit or miss which do).
I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.
OTOH, I would not be surprised if the stance there is 'you get no support
if
your not using enterprise drives', not because of the project itself, but
because it's ZFS. Part of their minimum recommended hardware
requirements
is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
there too.
http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "
"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "
They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.
At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.
Possibly mdadm and lvm tools do the same thing.
I"m pretty certain they don't create rules, or even try to check the drive
for SCT ERC support.
They don't. That's a suggested change in behavior. Sorry "should do
the same thing" instead of "do the same thing".
The problem with doing this is that you can't be
certain that your underlying device is actually a physical storage device or
not, and thus you have to check more than just the SCT ERC commands, and
many people (myself included) don't like tools doing things that modify the
persistent functioning of their system that the tool itself is not intended
to do (and messing with block layer settings falls into that category for a
mkfs tool).
Yep it's imperfect unless there's the proper cross communication
between layers. There are some such things like hardware raid geometry
that optionally poke through (when supported by hardware raid drivers)
so that things like mkfs.xfs can automatically provide the right sunit
swidth for optimized layout; which the device mapper already does
automatically. So it could be done it's just a matter of how big of a
problem is this to build it, vs just going with a new one size fits
all default command timer?
The other problem though is that the existing things pass through
_read-only_ data, while this requires writable data to be passed
through, which leads to all kinds of complicated issues potentially.
If it were always 200 instead of 30, the consequence is if there's a
link problem that is not related to media errors. But what the hell
takes that long to report an explicit error? Even cable problems
generate UDMA errors pretty much instantly.
And that is more why I'd suggest changing the kernel default first
before trying to use special heuristics or anything like that. The
caveat is that it would need to be for ATA disks only to not break SCSI
(which works fine right now) and USB (which has it's own unique issues).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html