This whitepaper looks at message transfer over HF Radio, and looks at how the ACP 142 protocol can achieve optimal performance, and the use of flow control and timers to achieve this. HF Radio can be an unreliable channel, and so it is important that performance is optimized in the event of channel failures. Use of timers to deal with failures is considered in detail.


Core Technologies

This paper is focused on an open standards approach, and a set of protocols designed for HF Radio.

STANAG 5066

STANAG 5066 provides data link services over HF Radio and Modem. STANAG 5066 has two modes of data transfer:

  • ARQ provides reliable point to point data transfer.
  • Non-ARQ provides unreliable data transfer, both point to point and broadcast.

These two modes need different application protocols for optimized transfer.

Connection Oriented ACP 142

For ARQ, Isode's Connection Oriented ACP 142 is ideal. Because the STANAG 5066 layer gives reliable data transfer, the application level timers discussed in this paper are not needed, and so this protocol is not discussed further here. Isode's M-Switch enables this protocol to be used in conjunction with ACP 142, so that the best transfer protocol can be used.

ACP 142

ACP 142 is designed to operate over non-ARQ links, providing support for multicast messaging and EMCON (radio silence). Given the broadcast nature of HF radio, it is very sensible to use a multicast application to optimize message transfer over HF Radio. ACP 142 was designed to support STANAG 4406 messaging. Isode's M-Switch supports this, and also transfer of SMTP messages over ACP 142. This paper looks at ACP 142 and its operation over STANAG 5066.

Rate and Flow Control

A key issue for bulk transfer protocols such as messaging over a constrained link such as HF is to ensure that transfer occurs fast enough to fully utilize the link, without leading to loss of data or duplicate transmissions.

Mechanisms that do not work well over HF

Windowing is a common application flow control mechanism, and widely deployed in the TCP protocol. It works very badly over HF radio, because:

  • The mechanism deals well when packets are dropped due to network congestion (it slows down by closing the window, which is the right behavior), but deals badly with packets dropped on the link (where slowing down is not the right behavior) .
  • The very high latency of HF connections means that the window mechanism is very slow to adjust.
  • Buffering of packets waiting to go over the slow link causes issues.

This is discussed in detail in the paper [Performance Measurements of Applications using IP over HF Radio].

Another mechanism is to use rate control. This will not work well over HF because:

  • The very high latency will make it very hard to share rate control information in a time.
  • The bursty nature of HF data transfer with variable gaps makes it hard to measure transfer rate.

So an alternative is needed to these standard end to end approaches.

STANAG 5066 Flow Control

STANAG 5066 provides flow control to the application, telling the application when it can send and receive. Because the STANAG 5066 layer transmits directly, it can use this flow control to ensure that there is always data to transmit (so the pipe is kept full) and prevent the application from sending too much data.

This approach is ideal, and is used by Isode's ACP 142 implementation when sending data over STANAG 5066.

Operation over UDP

ACP 142 can send data directly over STANAG 5066, using the UDOP (Unreliable Datagram Oriented Protocol) defined in STANAG 5066. This is the best approach and STANAG 4406 has this as the preferred option.

STANAG 4406 also allows ACP 142 operation over UDP over IP, which in turn can operate over STANAG 5066 using the IPClient protocol defined in STANAG 5066. Disadvantages of this approach are:

  • Higher protocol overhead.
  • An additional layer of queuing with IPClient.
  • No flow control mechanism,

However, there are sometime requirements to support this stack for interoperability reasons. The first two disadvantages lead to some additional overhead, but not to major problems. The lack of flow control leads to major performance loss relative to direct operation over STANAG 5066.

To address this, Isode and Isode partner AeroMaritime have developed a flow control protocol for use with IPClient. The ACP 142 server will send UDP datagrams to IPClient. The ACP 142 server will also maintain a TCP connection to the IPClient process, which enables the IPClient to tell the ACP 142 server when it can send data. The IPClient will send data over the STANAG 5066 SIS protocol to the STANAG 5066 server. The SIS protocol will provide flow control information to IPClient. IPClient uses this flow control information to flow control the ACP 142 server. This architecture means that the transmission pipe can be kept full without overloading.

Dealing with Errors

If there was no data loss, the protocols would be very straightforward. The rest of this paper looks at how ACP 142 deals with data loss, and the interaction of this approach with STANAG 5066.

Why Application Timers

ACP 142 operates by sending datagrams (either STANAG 5066 UDOP or UDP). Transfer of this data is inherently unreliable, and will be mapped to non-ARQ data in STANAG 5066, which means that any data sent may not reach the intended peer system or systems.

The ACP 142 application needs to deal with potential data loss. The approach to do this is to use Timers, which will lead to retransmission of data after the configured time in the event that data is not received. This will ensure that things get through.

Why 127.5 seconds is Important for HF Radio

To understand timers in the context of HF Radio, it is important to understand the characteristics of HF Radio data transmission. HF Radio provides Simplex communication, with one HF node transmitting at a time. Data transfer rates are from 75 to 9600 bits per second. A key characteristic is that there is a long turnaround time between nodes transmitting. This means that it is crucial to transmit for a reasonably long period of time. STANAG 5066 has a maximum transmit time of 127.5 seconds (just over two minutes). When optimizing for bulk transfer, such as email, it is generally desirable to use this full transmit window, as use of much shorter transmit times will significantly reduce overall throughput. This means that in a busy network, each node will transmit in turn for about two minutes, with a gap between each transmission which might be 10 or 20 seconds (or longer). In an n-node system, the delay between a system stopping transmitting and its next transition can be as much as 2*(n-1) minutes

Why Short Queues are Important

Control of missing data is done by timers at the application level. Of necessity for HF radio, data is queued at levels below the application. If these queues become long, the time to transmit data in these queues becomes longer and less predictable. This makes it hard to set application timers.

The STANAG 5066 queue is of particular importance. In order to fill the transmit buffer, a queue of two minutes needs to be made. This seems a long queue, but is actually appropriate for HF Radio. The RapidM RC66 server (Isode's partner) has queues of four minutes (two transmit buffers), which is reasonable. Some competing products have much longer queues, which make them almost impossible to tune effectively with ACP 142.

If IP is used, there will also be a queue at the IP level. With the Aeromartime IPClient, using the Isode/Aeromaritime flow control protocol, the queue will typically be zero or occasionally one packet. This avoids the problems of long queues.

How ACP 142 Works

In order to understand the key timers, it is essential to understand how ACP 142 works. This section gives a very brief description. For further information, it is recommended to read the ACP 142 protocol specification.

ACP 142 transmits a block of data (file). The block is broken up into Data PDUs, with size appropriate for the datagram service (UDOP or UDP) that is used to transfer the data.

The first PDU sent is an "Address PDU". This specifies the length of the message, and the set of destinations for the message; ACP 142 is a multicast protocol. It also specifies if the transmission to a given destination is complete. This is not important for subsequent uses of the Address PDU.

Then each Data PDU is sent. Under normal conditions, nothing is sent back during this transfer.
When a receiver receives the last Data PDU, it completes the data transfer and sends back an Ack.

When the sender receives an Ack, it marks the message as sent to that destination.

Then the sender sends an Address PDU, indicating that the message has been completely sent. The receiver then knows that the sender has received the Ack and can clear its information on the message.

It can be seen that this process is very efficient, in that very little data in addition to the message data is sent. It is also highly asynchronous, and works well in an HF environment with very long turnaround times. Where there is no data loss, the message is transferred onwards with any end to end handshaking. There is a handshake to ensure reliability, but this happens after the message transfer is completed.

This basic understanding of ACP 142 enables the core timers to be understood.

Transmission Errors from HF Radio

HF communication may be affected by a wide range of errors. For the purpose of this analysis, two types of error are important.

  1. Short or "burst" errors where transmission is interrupted for a short period. Data transfer of HF should be using Interleavers and Forward Error Correction to minimize the visibility of this sort of error to the application. However, there will be some short errors that will not be corrected, that will lead to small errors in the data transmission.
  2. Longer periods of error, when nothing is getting through. ACP 142 will often operate over a broadcast setup at fixed frequency where data is sent out, and sometimes it will get through and sometimes it will not. There may be extended periods of nothing getting through.

The timers need to deal with both kinds of error.

Choosing Timer Values

Choosing timers needs to make a trade-off between two factors:

  1. If you set the timer too short, it may be that the data which the timer is treating as lost is simply delayed. The timer will then lead to duplication of data, and inefficient use of the link.
  2. If the timer is set very long, then the risk of data duplication is very low, but the time to recover from data loss is longer.

Care needs to be taken in this choice. Considerations include:

  • The amount of data duplication that an early timer will cause.
  • The operational preference for timely delivery vs efficient channel utilization.

A detailed discussion of ACP 142 timers is provided in the Isode Deployment Note [ACP 142 Parameters for Radio and Satellite Networks]. The rest of this paper looks in detail at the four key timers.

Data Loss

Loss of Data PDU is the most likely event, as they are larger (so more likely to be hit by burst errors) and there are more of them in a typical transfer. Handling of lost Data PDUs is done on the receiver side.

In normal operation, with a steady flow of Data PDUs, one or more may go missing. The receiver can send a NACK back to the sender, indicating with Data PDUs are missing. The NACK will cause the data to be sent again. This will be sent in one of two situations when timers are not used:

  1. When the last Data PDU of the message is received, a Nack is sent back to indicated the list of Data PDUs missing.
  2. When more than a configurable number of Data PDUs are missing, a NACK will be sent in any event.

This process will deal with common cases of data loss and the typical effect of a burst error leading to occasional Data PDU loss.

There is also a receiver side timer to protect against data loss. This is done by estimating when the Last Data PDU will arrive. If the Last Data PDU has not arrived in this time, a NACK is sent, covering all of the missing Data PDUs. This time is calculated by looking at the greater of two numbers:

  1. A value calculated based on the rate of arrival of all the Data PDUs received. For a large message, this will generally give a reasonably accurate estimate of when the Last Data PDU will arrive.
  2. A configurable value, which is also the default value before any Data PDUs arrive. The reason for this is that Data PDUs will generally arrive in two minute bursts (in line with the HF transmit window) and then gaps which will often be several minutes. This means that the initial data rate will be high, and will lead to an unrealistically short timer value. This value needs to be set to allow for a transmit gap.

Once this final NACK has been sent, the receiver side control starts again, treating the set of missing PDUs as the "whole message". In the event of the NACK being lost, the Last Data PDU timer will cause it to be sent again.

Final ACK Loss

At the end of a message, when the message has been received and transferred there is an Ack sent by the receiver and an Address PDU sent back by the sender. This is important to clear the message, and also to prevent sender side timers from being activated. If an Ack is not acknowledged, there is an "Ack Respond Timer" to deal with loss of either the Ack or the Address PDU. This timer will lead to the Ack being sent again.

Address PDU Loss

A particularly nasty data loss is when the initial Address PDU is lost. Because the Address PDU is small, it is less likely to be hit by burst errors. If it is lost, a receiver cannot tell if it should be handling the subsequent data PDUs. For a network with a large number of nodes, handling this receiver side would be inefficient, so this is handled by a sender side "Retransmit Timer", which is set when the last Data PDU is sent. If nothing has been received back when this timer goes off, an Address PDU is sent.

Total Message Loss

A final situation to guard against is total message loss. If there is an extended outage, then the sender will have sent all of the message and nothing will have been received. If nothing is received back from the receiver, total loss will be assumed and the full message retransmitted. This retransmission is based on two timers: The "Retransmit Timer" plus "Retransmission Delay Time". Total message loss looks the same as address PDU loss (nothing comes back). The second delay allows time for the Address PDU to be acknowledged, so the overhead of message retransmission is avoided when only the Address PDU is lost.

The Retransmit Timer has a configurable back off factor, so that full message retransmission will occur with longer and longer intervals.

In the event of small amounts of data loss, the receiver side timers provide much more efficient handling than this timer. Therefore, it is important to set the sender side timers rather more conservatively than the receiver side timers, so that total message retransmission will not happen when only some data is dropped.

Conclusions

This paper has explained how ACP 142 and STANAG 5066 can be used to provide optimized reliable data transfer over HF Radio.