Re: TCP performance on a lossy 1Gbps link
|[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Hi - many thanks for the response; comments are inlined: Leslie Rhorer wrote:
I've got a dedicated 1000Mbps link between two sites with a rtt of 7ms, which seems to be dropping about 1 in 20000 packets (MTU of 1500 bytes). I've got identical boxes at either end of the link running 2.6.27 (e1000e 0.3.3.3-k6), and I've been trying to saturate the link with TCP transfers in spite of the packet loss.Why?
Because I'd like to use existing TCP applications over the link (e.g. rsync, mysql, HTTP, ssh, etc.) and get the highest possible throughput.
I can chuck UDP at near-linespeed over the link (/dev/zero + nc), which seems to almost saturate it at 920Mbps. However, TCP throughput of a single stream (/dev/zero + nc) averages about 150Mbps. Looking at the tcptrace time sequence graphs of a capture, the TCP window averages out at about 3MB - although after an initial exponential ramp up, the moment the sender realises a packet is lost, the throughput appears to be clamped to only use about ~5% of the available window. I assume this is the congestion control algorithm at the sender applying a congestion window.No, not really, per se. TCP sends packets until the Tx window is full. The Rx host receives the packets and assembles them in order. It sends an ACK pointing to the highest numbered packet in the successfully assembled stream, saving but ignoring any out-of-sequence packets. Thus, if the receiving host gets the first 12 and the last 6 out of 20 packets, it sends an ACK for packet #12, and then just waits. Having received an ACK for #12, the Tx host moves the start of the window to packet #13, and transmits the remaining packets up to the end of the window. It then sits and waits for an additional ACK. Since packets #13 and #14 never reached the Rx host, it also simply waits, keeping packets #15 through the end of the window, and both hosts sit idle. After an implementation dependent wait period (usually about 2 seconds), the Tx host starts re-sending the entire window contents, which in this case starts with packet #13.
Right, I follow your example - but I thought that with SACK turned on (as it is by default), the Rx host will immediately send ACKs when receiving packets #14 through #20, repeatedly ACKing receipt up to the beginning of packet #13 - but with a selective ACK blocks to announce that it has correctly received subsequent packets. Once the Tx host sees three such repeats, it can assume that packet #13 was lost, and retransmit it - which surely only takes 1 round trip + 3 more packet intervals to happen, rather than the 2 seconds of a plain old retransmit? Even without SACK, doesn't linux implement Fast Retransmit and cause the Tx host to immediately retransmit packet #13 after receiving 3 consecutive duplicate ACKs?
This entire process seems like it should be able to happen without causing enormous disruption, and whilst the window might be briefly blocked waiting for retransmission, it should not significantly hinder throughput. That said, with a 7ms roundtrip, a 0.0006% packet loss rate, and sending 1500 byte packets at 1Gbps line speed, I guess that if packet loss is linearly distributed, you could end up with 55 packets lost every second - resulting in 385ms of retransmit pauses every second. According to tcptrace, however, packet loss is clumpy, causing only a few ~7ms pauses every second.
If a re-transmit is required, then TCP does adjust the window size to accommodate what it presumes is congestion on the link. It also never starts out streaming at full bandwidth. It continually adjusts its window size upwards until it encounters what it interprets as congestion issues, or the maximum window size supported by the two hosts.
Right. I understand this as the congestion avoidance and slow start algorithms from RFC2581.
What else should I be doing to crank up the throughput and defeat the congestion control?Why would you be trying to do this?
To get the most throughput out of the link for TCP transfers between existing applications.
It is true TCP works well with congested links, but not so well with links suffering random errors. You aren't going to be successful in breaking the TCP handshaking parameterswithout breaking TCP itself.
Right. I'm not trying to break the handshaking parameters - just adjust the extent to which the congestion window is reduced in the face of packet loss, admittedly at the risk of increasing packet loss when the link is genuinely saturated.
TCP guarantees delivery of the packets to the application layer intact and in order. The behavior of TCP on a dirty link is an artifact of that requirement. If you want to deliver at full speed,use UDP, and have the application layer handle lost packets.
Surely implementing reliable data transfer at the application level ends up being effectively the same as re-implementing TCP (although I guess you could miss out the congestion control, or find some application-layer mechanism for reserving bandwidth for the stream).
If you did not write the application (or have a developer do it for you), and it does not support UDP transfers, then there is nothing you can do about it.
Could jumbo frames help?No. If anything, they may make it worse. Noisy links call for small frames.
I'm trying jumbo frames anyway - on the hope that if the loss is happening per-packet, at least the congestion window will increase more rapidly after it collapses upon packet loss (as implied by http://sd.wareonearth.com/~phil/jumbo.html). If the loss is happening per-bit, then it will make the packet loss appear bad enough that I stand a chance of getting the link itself fixed :)
is this a Bad Idea?Yes. RUDE_TCP notwithstanding, there are various other ways to guarantee data delivery than that used by TCP, and each method has its own strengths and drawbacks. No matter what transfer protocol is implemented, however, guaranteeing delivery of a stream segment requires the entire segment be assembled completely at the Rx host before moving on.
Consequently, this means the process must halt in some fashion before continuing after the entire segment has been transmitted until such time as the Tx host receives notification the entire segment was received intact. This places an upper limit on the overall transmission rate directly proportional to the size of the Rx buffer.
Yes, but I really don't think that this is what is slowing my throughput down in this instance - instead, the congestion window is clamping the data rate at the sender. Looking at a tcptrace time sequence graph, I can see that only a small fraction of the available TCP window is ever used - and I can only conclude that the Tx host is just holding off on sending due to adhering to the artificially reduced window.
This can be done at the application layer, or it can be done at some other layer, in this case TCP. Handling an expectedly noisy link is definitely best done through some other protocol than TCP, assuming such flexibility is available.
Presumably a rather perverse solution to this would be a proxy to split a single TCP stream into multiple streams, and then reassemble them at the other end - thus pushing the problem into one of having large Rx application-layer buffers for reassembly, using socket back-pressure to keep the component TCP streams sufficiently in step. Does anyone know if anyone's written such a thing? Or do people simply write off TCP single stream throughput through WANs like this?
M. -- To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
[Netdev] [Ethernet Bridging] [Linux 802.1Q VLAN] [Linux Wireless] [Kernel Newbies] [Security] [Linux for Hams] [Netfilter] [Git] [Bugtraq] [Photo] [Yosemite] [Yosemite News and Information] [MIPS Linux] [ARM Linux] [Linux RAID] [Linux PCI] [Linux Admin] [Samba] [Video 4 Linux] [Linux Resources]