Documentation/components/net/tcp_network_perf.rst - nuttx - Git at Google

 =======================
 TCP Network Performance
 =======================

 .. warning::
     Migrated from:
     https://cwiki.apache.org/confluence/display/NUTTX/TCP+Network+Performance


 (Abstracted and extended from a discussion from the NuttX Google group)

 Question
 ========

 For some unknown reason, I am seeing poor TCP network performance.

 Answer
 ======

 First let's talk about TCP send performance.

 Source of Performance Bottlenecks
 ---------------------------------

 General TCP send performance is not determined by the TCP stack as much
 as it is by the network device driver. Bad network performance is due
 to time lost `BETWEEN` packet transfers. The packet transfers themselves
 go at the wire speed*. So if you want to improve performance on a
 given network, you have to reduce time lost between transfers.
 There is no other way.

 Ignoring Ethernet issues like collisions, back-off delays,
 inter-packet gaps (IPG), etc.

 The time between packets is limited primarily by the buffering
 design of the network driver. If you want to improve performance,
 then you must improve the buffering at the network driver.
 You need to support many full size (1500 byte) packet buffers.
 You must be able to query the network for new data to transfer,
 and queue those transfers in packet buffers. In order to reach
 peak performance, the network driver must have the next transfer
 buffered and ready-to-go before the previous transfer is finished
 to minimize the GAP between packet transfers.

 Different network devices also support more or less efficient
 interfaces: The worst performing support interfaces that can
 handle only one packet at a time, the best performing are able
 to retain linked lists of packet buffers in memory and perform
 scatter-gather DMA for a sequence of packets.

 In the NuttX TCP stack, you can also improve performance by
 enabling TCP write buffering. But the driver is the real key.

 It would be good to have a real in-depth analysis of the
 network stack performance to identify bottlenecks and
 generate ideas for performance improvement. No one has
 ever done that. If I were aware of any stack related
 performance issue, I would certainly address it.

 RFC 1122
 --------

 There is one important feature missing the NuttX TCP that
 can help when there is no write buffering: Without write
 buffering send() will not return until the transfer has
 been ACKed by the recipient. But under RFC 1122, the host
 need not ACK each packet immediately; the host may wait
 for 500 MS before ACKing. This combination can cause very
 slow performance when small, non-buffered transfers are
 made to an RFC 1122 client. However, the RFC 1122 must
 ACK at least every second (odd) packet so sequences of
 packets with write buffering enabled do not suffer from
 this problem.

 `Update:  RFC 1122 support was added to the NuttX TCP
 stack with commit 66ef6d143a627738ad7f3ce1c065f9b1f3f303b0
 in December of 2019.  That, however, that affects only
 received packet ACK behavior and has no impact on transmitted
 packet performance; write buffering is still recommended.`

 TCPBlaster
 ----------

 I created a new test application at ``apps/examples/tcpblaster`` to
 measure TCP performance and collected some data for the
 configuration that happens to be on my desk. The `tcpblaster`
 test gives you the read and write transfer rates in ``Kb/sec``
 (I won't mention the numbers because I don't believe they
 would translate any other setup and, hence, would be
 misleading).

 There is a nifty `TCP Throughput Tool <https://www.switch.ch/network/tools/tcp_throughput/>`_
 that gives some theoretical upper limits on performance.
 The tool needs to know the ``MSS`` (which is the Ethernet
 packet size that you configured minus the size of the
 Ethernet header, 14), the round-trip time (``RTT``)in
 milliseconds (which you can
 get from the Linux host ping), and a loss constant (which
 I left at the default). With these values, I can determine
 that the throughput for the NuttX TCP stack is approximately
 at the theoretical limits. You should not be able to do
 better any better than that (actually, it performs above
 the theoretical limit, but I suppose that is why it is
 "theoretical").

 So, If you are unhappy with your network performance, the I
 suggest you run the `tcpblaster` test, use that data
 (along with the ``RTT`` from ping) with the
 `TCP Throughput Tool <https://www.switch.ch/network/tools/tcp_throughput/>`_.
 If you are still unhappy with the performance, don't go
 immediately pointing fingers at the stack (which everyone does).
 Instead, you should focus on optimizing your network
 configuration settings and reviewing the buffer handling
 of the Ethernet driver in you MCU.

 If you do discover any significant performance issues
 with the stack I will of course gladly help you resolve
 them. Or if you have ideas for improved performance,
 I would also be happy to hear those.

 What about Receive Performance?
 -------------------------------

 All of the above discussion concerns `transmit performance`,
 i.e., "How fast can we send data over the network?" The other
 side is receive performance. Receive performance is very
 different thing. In this case it is the remote peer who is
 in complete control of the rate at which packets appear on
 the network and, hence, responsible for all of the raw bit
 transfer rates.

 However, we might also redefine performance as the number of
 bytes that were `successfully` transferred. In order for the
 bytes to be successfully transferred they must be successfully
 received and processed on the NuttX target. If we fail in
 this if the packet is `lost` or `dropped`. A packet is lost if
 the network driver is not prepared to receive the packet when
 it was sent. A packet is dropped by the network if it is
 received but could not be processed either because there
 is some logical issue with the packet (not the case here)
 or if we have no space to buffer the newly received packet.

 If a TCP packet is lost or dropped, then the penalty is big:
 The packet will not be ACKed, the remote peer may send a
 few more out-of-sequence packets which will also be dropped.
 Eventually, the remote peer will time out and retransmit
 the data from the point of the lost packet.

 There is logic in the TCP protocol to help manage these data
 overruns. The TCP header includes a TCP `receive window` which
 tells the remote peer how much data the receiver is able to
 buffer. This value is sent in the ACK to each received
 packet. If well tuned, this receive window could possibly
 prevent packets from being lost due to the lack of
 read-ahead storage. This is a little better. The remote
 peer will hold off sending data instead of timing out and
 re-transmitting. But this is still a loss of performance;
 the gap between the transfer of packets caused by the hold-off
 will result in a reduced transfer rate.

 So the issues for good reception are buffering and processing
 time. Buffering again applies to handling within the driver
 but unlike the transmit performance, this is not typically
 the bottleneck. And there is also a NuttX configuration
 option that controls `read-ahead` buffering of TCP packets.
 The buffering in the driver must be optimized to avoid lost
 packets; the ` buffering can be tuned to minimize
 the number packets dropped because we have no space to buffer them.

 But the key to receive perform is management of processing
 delays. Small processing delays can occur in the network
 driver or in the TCP stack. But the major source of
 processing delay is the application which is the ultimate
 consumer of the incoming data. Imagine, for example,
 and FTP application that is receiving a file over a
 TCP and writing the file into FLASH memory. The primary
 bottleneck here will be the write to FLASH memory which
 is out of the control of software.

 We obtain optimal receive performance when the processing
 delays keep up with the rate of the incoming packets.
 If the processing data rate is even slightly slower
 then the receive data rate, then there will be a
 growing `backlog` of buffered, incoming data to be
 processed. If this backlog continues to grow then
 eventually our ability to buffer data will be exhausted,
 packets will be held off or dropped, and performance
 will deteriorate. In an environment where a high-end,
 remote peer is interacting with the low-end, embedded
 system, that remote peer can easily overrun the
 embedded system due to the embedded system's limited
 buffering space, its much lower processing capability,
 and its slower storage peripherals.
	=======================
	TCP Network Performance
	=======================

	.. warning::
	Migrated from:
	https://cwiki.apache.org/confluence/display/NUTTX/TCP+Network+Performance


	(Abstracted and extended from a discussion from the NuttX Google group)

	Question
	========

	For some unknown reason, I am seeing poor TCP network performance.

	Answer
	======

	First let's talk about TCP send performance.

	Source of Performance Bottlenecks
	---------------------------------

	General TCP send performance is not determined by the TCP stack as much
	as it is by the network device driver. Bad network performance is due
	to time lost `BETWEEN` packet transfers. The packet transfers themselves
	go at the wire speed*. So if you want to improve performance on a
	given network, you have to reduce time lost between transfers.
	There is no other way.

	Ignoring Ethernet issues like collisions, back-off delays,
	inter-packet gaps (IPG), etc.

	The time between packets is limited primarily by the buffering
	design of the network driver. If you want to improve performance,
	then you must improve the buffering at the network driver.
	You need to support many full size (1500 byte) packet buffers.
	You must be able to query the network for new data to transfer,
	and queue those transfers in packet buffers. In order to reach
	peak performance, the network driver must have the next transfer
	buffered and ready-to-go before the previous transfer is finished
	to minimize the GAP between packet transfers.

	Different network devices also support more or less efficient
	interfaces: The worst performing support interfaces that can
	handle only one packet at a time, the best performing are able
	to retain linked lists of packet buffers in memory and perform
	scatter-gather DMA for a sequence of packets.

	In the NuttX TCP stack, you can also improve performance by
	enabling TCP write buffering. But the driver is the real key.

	It would be good to have a real in-depth analysis of the
	network stack performance to identify bottlenecks and
	generate ideas for performance improvement. No one has
	ever done that. If I were aware of any stack related
	performance issue, I would certainly address it.

	RFC 1122
	--------

	There is one important feature missing the NuttX TCP that
	can help when there is no write buffering: Without write
	buffering send() will not return until the transfer has
	been ACKed by the recipient. But under RFC 1122, the host
	need not ACK each packet immediately; the host may wait
	for 500 MS before ACKing. This combination can cause very
	slow performance when small, non-buffered transfers are
	made to an RFC 1122 client. However, the RFC 1122 must
	ACK at least every second (odd) packet so sequences of
	packets with write buffering enabled do not suffer from
	this problem.

	`Update: RFC 1122 support was added to the NuttX TCP
	stack with commit 66ef6d143a627738ad7f3ce1c065f9b1f3f303b0
	in December of 2019. That, however, that affects only
	received packet ACK behavior and has no impact on transmitted
	packet performance; write buffering is still recommended.`

	TCPBlaster
	----------

	I created a new test application at ``apps/examples/tcpblaster`` to
	measure TCP performance and collected some data for the
	configuration that happens to be on my desk. The `tcpblaster`
	test gives you the read and write transfer rates in ``Kb/sec``
	(I won't mention the numbers because I don't believe they
	would translate any other setup and, hence, would be
	misleading).

	There is a nifty `TCP Throughput Tool <https://www.switch.ch/network/tools/tcp_throughput/>`_
	that gives some theoretical upper limits on performance.
	The tool needs to know the ``MSS`` (which is the Ethernet
	packet size that you configured minus the size of the
	Ethernet header, 14), the round-trip time (``RTT``)in
	milliseconds (which you can
	get from the Linux host ping), and a loss constant (which
	I left at the default). With these values, I can determine
	that the throughput for the NuttX TCP stack is approximately
	at the theoretical limits. You should not be able to do
	better any better than that (actually, it performs above
	the theoretical limit, but I suppose that is why it is
	"theoretical").

	So, If you are unhappy with your network performance, the I
	suggest you run the `tcpblaster` test, use that data
	(along with the ``RTT`` from ping) with the
	`TCP Throughput Tool <https://www.switch.ch/network/tools/tcp_throughput/>`_.
	If you are still unhappy with the performance, don't go
	immediately pointing fingers at the stack (which everyone does).
	Instead, you should focus on optimizing your network
	configuration settings and reviewing the buffer handling
	of the Ethernet driver in you MCU.

	If you do discover any significant performance issues
	with the stack I will of course gladly help you resolve
	them. Or if you have ideas for improved performance,
	I would also be happy to hear those.

	What about Receive Performance?
	-------------------------------

	All of the above discussion concerns `transmit performance`,
	i.e., "How fast can we send data over the network?" The other
	side is receive performance. Receive performance is very
	different thing. In this case it is the remote peer who is
	in complete control of the rate at which packets appear on
	the network and, hence, responsible for all of the raw bit
	transfer rates.

	However, we might also redefine performance as the number of
	bytes that were `successfully` transferred. In order for the
	bytes to be successfully transferred they must be successfully
	received and processed on the NuttX target. If we fail in
	this if the packet is `lost` or `dropped`. A packet is lost if
	the network driver is not prepared to receive the packet when
	it was sent. A packet is dropped by the network if it is
	received but could not be processed either because there
	is some logical issue with the packet (not the case here)
	or if we have no space to buffer the newly received packet.

	If a TCP packet is lost or dropped, then the penalty is big:
	The packet will not be ACKed, the remote peer may send a
	few more out-of-sequence packets which will also be dropped.
	Eventually, the remote peer will time out and retransmit
	the data from the point of the lost packet.

	There is logic in the TCP protocol to help manage these data
	overruns. The TCP header includes a TCP `receive window` which
	tells the remote peer how much data the receiver is able to
	buffer. This value is sent in the ACK to each received
	packet. If well tuned, this receive window could possibly
	prevent packets from being lost due to the lack of
	read-ahead storage. This is a little better. The remote
	peer will hold off sending data instead of timing out and
	re-transmitting. But this is still a loss of performance;
	the gap between the transfer of packets caused by the hold-off
	will result in a reduced transfer rate.

	So the issues for good reception are buffering and processing
	time. Buffering again applies to handling within the driver
	but unlike the transmit performance, this is not typically
	the bottleneck. And there is also a NuttX configuration
	option that controls `read-ahead` buffering of TCP packets.
	The buffering in the driver must be optimized to avoid lost
	packets; the ` buffering can be tuned to minimize
	the number packets dropped because we have no space to buffer them.

	But the key to receive perform is management of processing
	delays. Small processing delays can occur in the network
	driver or in the TCP stack. But the major source of
	processing delay is the application which is the ultimate
	consumer of the incoming data. Imagine, for example,
	and FTP application that is receiving a file over a
	TCP and writing the file into FLASH memory. The primary
	bottleneck here will be the write to FLASH memory which
	is out of the control of software.

	We obtain optimal receive performance when the processing
	delays keep up with the rate of the incoming packets.
	If the processing data rate is even slightly slower
	then the receive data rate, then there will be a
	growing `backlog` of buffered, incoming data to be
	processed. If this backlog continues to grow then
	eventually our ability to buffer data will be exhausted,
	packets will be held off or dropped, and performance
	will deteriorate. In an environment where a high-end,
	remote peer is interacting with the low-end, embedded
	system, that remote peer can easily overrun the
	embedded system due to the embedded system's limited
	buffering space, its much lower processing capability,
	and its slower storage peripherals.