Re: SCSI error handling -- one error blocks the whole SCSI host

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 27, 2013 at 11:41 PM, James Bottomley
<James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, 2013-05-27 at 16:39 +0200, Hannes Reinecke wrote:
>
>> - LLDDs typically won't return a command status even for a
>>   command which has been aborted via ABORT TASK TMF.
>>   So the midlayer probably will never get notified if
>>   the command got aborted via ABORT TASK.
>
> Well, that's true, but irrelevant.  If the HBA can't inform you of the
> status of the abort, then abort is useless as a first step in the
> traditional eh as well as in this method, so you just don't do that and
> proceed to resets.
>
> There's actually a school of thought that says even if the HBA *can*
> give you all the status you need, aborts are still pointless because
> it's sending in yet another state transition to an already failed state
> machine (because the device is timing out).  Therefore, since the chance
> of recovering the state machine with an abort is so tiny, you should
> start with the lowest reset anyway because that takes the state machine
> to a known state.

Most devices I know do not really abort the command in any normal sense
anyhow. Not even when doing a reset. The disks (HDD & SSD) and also SAN
systems normally just treat an abort or a reset as a signal that no
real reply is
necessary but the command itself if it is already actively handled continues
in its path. The abort only cancels those commands that are in the queue
and if there really was a problem and the disk is engaging in error recovery
of its own you'll just have no response from it and it will seem dead (abort
may timeout).

The one thing aborts/reset help with is to clear your HBA from any pending
so that your DMA buffers will no longer be affected and you can forget the
command and do your application level recovery (RAID or lose data and panic).

It is also an important part of handling bad links but at least in SAS that is
done internally in the HBA anyway.

This view of aborts also means that reducing timeouts for commands and
TMFs is mostly useless and sometimes even a really bad idea. I prefer
to just let the device go on with its error recovery and just forget about the
command. I want to forget about the DMA so I issue an abort but anything
higher than that means a link is dead to me.

Baruch
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux