RE: Recent comments about FCoE and iSCSI
Great comments. You are all certainly
aware that sockets are also undergoing transformation (asynchronous sockets)
but even with synchronous sockets and some care not to break existing application
that use synchronous sockets a restructuring of the stack may enable (as
shown by the Intel and IBM-Haifa work) great increases in performance.
Software RDMA for the new class of of
multicore engines is definetly an interesting proposition (on highly multithreaded
engines it should come with not cost associated with it - or almost no
I wish I knew more about the decrease
in latencies in the switch fabric (it would be interesting if somebody
could comment) as large Layer-2 fabrics have some inherent latency issues.
FCoE is asking us to forget all athis
and go back and pay the hardware price for several more years and ignore
the IP-land and nothing that I heard convinced me that we should do so.
|"Nicholas A. Bellinger"
Please respond to
|Zack Best <zbest28@xxxxxxxxx>,
ips@xxxxxxxx, nab@xxxxxxxxxxxxxxx, Mike Mazarick <mazarick@xxxxxxxxxxxxx>,
Eric Hall <ehall@xxxxxxxxx>
|RE: Recent comments about FCoE
A quick comment in regards to the abundance large
of computing resources
available for initiator side software IP storage services.. Also
many thanks for posting this great thread. :)
As the progress of the DDP TWG continues onward and 2nd generation
hardware iWARP engines start to come online, the benefit of a hybrid
software implementation with host OS software network stack
modifications in kernel above TCP and SCTP starts to pose a question..
What real savings can a hyrbid iSER nodes using software DDP? What
those changes required to make high performance software DDP a
As osc-iwarp has found out, there is a significant CPU overhead
assoicated with sockets and software VERBS, but I think this can be
minimized with the right set of changes. Those changes are moving
from receieve side sockets for software iSER mode. These changes
start to become attractive for new product designs as this will allow
RNIC hardware engines to scale futher using a more sane method or less
painfully (depending on who you ask, OFA uses a Hybrid IB-VERBs) than
traditional TOEs with speciality engines. Really taking advantage
what metadata in DDP and iWARP metadata is telling about the framed
network transport can help in RDMA WRITE scenarios because the software
RNIC would already have Stagged memory ready to go in the iSER case.
Espically when it comes to the API for the iSER stack, having a single
codebase with vendors writing hardware drivers instead of re-inventing
the wheel with sockets. I believe the smart software RNICs of the
future will direct RDMA traffic directly into host OS SCSI memory
buffers, and like today use something similar to sendpage() for TX.
As multi-core microprocessor designs with large, intelligent shared
caches, and CPU cache coherentecy and I/O interconnects that in the 90's
where only available in the Alpha EV67 and highest of high end shared
memory supercomputers and clusters are now starting to become the norm.
Pushing software iSER to the next level and beyond is surely not going
to happen with a 30 year old API (sockets). Also for the data center
story with a traditional tiered SAN architecture and software case, the
hyrbid iWARP software stack on the initiator will not get a whole lot of
interest until it can show improved performance and overhead that is
acceptable to traditional iSCSI today. For the 3rd generation IP
storage stacks, typical multiport 1G workloads is what will really drive
interest into areas where putting a hardware RNIC will not be cost
feasable for some time.
But just as with traditional iSCSI, we can also scale software iSER down
towards towards platforms with more modest computing resources on
low-power, wireless devices. Even in the type of mobile devices that
storage services have been prototyped on today, the benefit of being
able to scale server side hardware RNICs more efficently is not software
iSER's only benefit. On a side note, I think the transparency that
connection recovery in traditional iSCSI and iSER allows to internexus
multiplexing, as well as end user requirements for configuration and
management scenarios. Using a active-active recovery mechinism that
as close to completely transparent as possbile (which ERL=2 is IMHO) is
I think what mobile IP storage services users need to be demanding from
Thanks for listening!
On Thu, 2007-04-26 at 21:16 -0400, Julian Satran wrote:
> Excellent comments. My take (if not obvious from the previous text)
> that data centers will be very large and compute power (as evidenced
> by the multicore) and advances in stack implementation are bound to
> improve substantialy the performance of the protocol stacks (see Intel
> and our work) and layer 3 switching.
> It is important also to point out that Ethernet has substantial
> latencies if only bridging is using and replacement technologies (such
> as Rbridges or others) may take some time to appear.
> Zack Best <zbest28@xxxxxxxxx>
> 25/04/07 16:37
> RE: Recent
> comments about
> FCoE and iSCSI
> The real debate here is between two types of networks.
> The first is reliable at the link level and does not
> drop packets under congestion. The second is running
> a reliable transport protocol (i.e. TCP) over an
> unreliable link level network.
> I agree with the scaling argument. For sufficiently
> large networks, reliable link level doesn't work well
> because network component failure, or chronically
> congested links are not handled well. For
> sufficiently small networks, reliable link level has
> some significant advantages in simplicity, low
> hardware cost, performance, and worst case latency.
> My personal view is that the vast majority of
> enterprise storage networks fall in the "sufficiently
> small" category. This view has to some extent been
> vindicated by the continuing success of Fibre Channel
> in this space and the inability of iSCSI to displace
> FC in any significant way for enterprise storage. Of
> course, this may or may not change in the future.
> Whether FC is simpler than iSCSI depends largely on
> your definition of simplicity. If one defines
> simplicity/complexity as the number of gates or lines
> of code to reduce the protocol to hardware or
> firmware, then my experience is that iSCSI is 2X to 3X
> the complexity of FC. This has implications in cost
> and reliability.
> Particularly problematic with iSCSI is the
> unpredictability of the performance. Performance is
> great with no packet drop. However even a small
> amount of congestion can cause a sudden large drop and
> performance. This can be difficult to predict as a
> network that is almost but not quite congested can run
> great, but a small incremental change of any sort can
> cause the performance to become suddenly unacceptable.
> For FC, or other protocol using link level flow
> control, the reduction in performance is much more
> graceful and incremental when the level of congestion
> is small and intermittent.
> A second major problem with iSCSI is the unbounded
> nature of worst case latency. When a storage network
> fails, it is desirable to detect the failure in a
> fraction of a second and transition to a backup
> network. TCP, when implemented to the standards, can
> take many seconds or minutes to determine that a
> network has failed and close the connection. RFC
> 2988, for instance, requires that the minimum
> retransmission be one second. This means a single
> dropped packet may add one second to the latency of
> outstanding commands. This is a huge amount of time
> on a 10G link. No doubt this could be mitigated by
> drastically reducing the timeouts within TCP, but the
> market seems to be surprisingly resistant to tampering
> with accepted standards here.
> Overall, the FC and FCP protocol have a lot in common
> with the Intel i86 instruction set architecture. They
> are overly complex, and rather poorly designed by
> modern standards. But they are good enough, and there
> is a huge amount of value add that has been built on
> top of them, and therefore little incentive to change.
> FCoE is an interesting idea because it preserves 90%
> of the existing value add of FC, unifies the physical
> link with Ethernet, and uses the reliable link method
> of packet delivery.
> There are two significant possibilities for iSCSI to
> displace FC (or FCoE) in enterprise storage networks.
> First is if the networks start to scale to large
> enough size that FC can't be made sufficiently
> reliable, and second if CPU compute cycles become
> sufficiently cheap that the iSCSI protocol can be run
> in host software with no negative performance impact.
> Barring either of these, it seems that iSCSI will have
> an uphill battle, and FCoE may have a place.
> -----Original Message-----
> From: Julian Satran [mailto:Julian_Satran@xxxxxxxxxx]
> Sent: Tuesday, April 24, 2007 3:10 PM
> To: ips@xxxxxxxx
> Subject: Recent comments about FCoE and iSCSI
> Dear All,
> The trade press is lately full with comments about the
> latest and greatest reincarnation of Fiber Channel
> over ethernet.
> It made me try and summarize all the long and hot
> debates that preceded the advent of iSCSI.
> Although FCoE proponents make it look like no debate
> preceded iSCSI that was not so - FCoE was considered
> even then and was dropped as a dumb idea.
> Here is a summary (as afar as I can remember) of the
> main arguments. They are not bad arguments even in
> retrospect and technically FCoE doesn't look better
> than it did then.
> Feel free to use this material in a nay form. I expect
> this group to seriously expand my arguments and make
> them public - in personal or collective form.
> And do not forget - it is a technical dispute -
> although we all must have some doubts about the way it
> is pursued.
> What a piece of nostalgia :-)
> Around 1997 when a team at IBM Research (Haifa and
> Almaden) started looking at connecting storage to
> servers using the "regular network" (the ubiquitous
> LAN) we considered many alternatives (another team
> even had a look at ATM - still a computer network
> candidate at the time). I won't get you over all of
> our rationale (and we went over some of them again at
> the end of 1999 with a team from CISCO before we
> convened the first IETF BOF in 2000 at Adelaide that
> resulted in iSCSI and all the rest) but some of the
> reasons we choose to drop Fiber Channel over raw
> Ethernet where multiple:
> Fiber Channel Protocol (SCSI over Fiber Channel Link)
> is "mildly" effective because:
> it implements endpoints in a dedicated engine
> it has no transport layer (recovery is done at the
> application layer under the assumption that the error
> rate will be very low)
> the network is limited in physical span and logical
> span (number of switches)
> flow-control/congestion control is achieved with a
> mechanism adequate for a limited span network
> (credits). The packet loss rate is almost nil and that
> allows FCP to avoid using a transport (end-to-end)
> FCP she switches are simple (addresses are local and
> the memory requirements cam be limited through the
> credit mechanism)
> However FCP endpoints are inherently costlier than
> simple NICs – the cost argument (initiators are more
> The credit mechanisms is highly unstable for large
> networks (check switch vendors planning docs for the
> network diameter limits) – the scaling argument
> The assumption of low losses due to errors might
> radically change when moving from 1 to 10 Gb/s – the
> scaling argument
> Ethernet has no credit mechanism and any mechanism
> with a similar effect increases the end point cost.
> Building a transport layer in the protocol stack has
> always been the preferred choice of the networking
> community – the community argument
> The "performance penalty" of a complete protocol stack
> has always been overstated (and overrated). Advances
> in protocol stack implementation and finer tuning of
> the congestion control mechanisms make conventional
> TCP/IP performing well even at 10 Gb/s and over.
> Moreover the multicore processors that become dominant
> on the computing scene have enough compute cycles
> available to make any "offloading" possible as a mere
> code restructuring exercise (see the stack reports
> from Intel, IBM etc.)
> Building on a complete stack makes available a wealth
> of operational and management mechanisms built over
> the years by the networking community (routing,
> provisioning, security, service location etc.) – the
> community argument
> Higher level storage access over an IP network is
> widely available and having both block and file served
> over the same connection with the same support and
> management structure is compelling – the community
> Highly efficient networks are easy to build over IP
> with optimal (shortest path) routing while Layer 2
> networks use bridging and are limited by the logical
> tree structure that bridges must follow. The effort to
> combine routers and bridges (rbridges) is promising to
> change that but it will take some time to finalize
> (and we don't know exactly how it will operate).
> Untill then the scale of Layer 2 network is going to
> seriously limited – the scaling argument
> As a side argument – a performance comparison made in
> 1998 showed SCSI over TCP (a predecessor of the later
> iSCSI) to perform better than FCP at 1Gbs for block
> sizes typical for OLTP (4-8KB). That was what
> convinced us to take the path that lead to iSCSI – and
> we used plain vanilla x86 servers with plain-vanilla
> NICs and Linux (with similar measurements conducted on
> The networking and storage community acknowledged
> those arguments and developed iSCSI and the companion
> protocols for service discovery, boot etc.
> The community also acknowledged the need to support
> existing infrastructure and extend it in a reasonable
> fashion and developed 2 protocols iFCP (to support
> hosts with FCP drivers and IP connections to connect
> to storage by a simple conversion from FCP to TCP
> packets) FCPIP to extend the reach of FCP through IP
> (connects FCP islands through TCP links). Both have
> implemented and their foundation is solid.
> The current attempt of developing a "new-age" FCP over
> an Ethernet link is going against most of the
> arguments that have given us iSCSI etc.
> It ignores the networking layering practice, build an
> application protocol directly above a link and thus
> limits scaling, mandates elements at the link layer
> and application layer that make applications more
> expensive and leaves aside the whole "ecosystem" that
> accompanies TCP/IP (and not Ethernet).
> In some related effort (and at a point also when
> developing iSCSI) we considered also moving away from
> SCSI (like some "no standardized" but popular in some
> circles software did – e.g., NBP) but decided against.
> SCSI is a mature and well understood access
> architecture for block storage and is implemented by
> many device vendors. Moving away from it would not
> have been justified at the time.
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> Ips mailing list
> Ips mailing list
Ips mailing list