For many commercial and personal applications, Internet mail is sufficiently reliable to be trusted and treated as if it were 100% reliable. For some applications, such as aviation, military, and key government communications this is not good enough.
This paper looks at what is needed to provide highly reliable message transport: reliably taking a message from its originator and delivering to the recipient(s).
It examines what is needed to provide 'Medium' and 'High' grade message processing, X.400's role in 'system' reliability above what we refer to as a 'component' reliability approach and the message transport system's role in minimizing message delivery failures. Finally, it lists the capabilities that should be examined when considering components for a reliable messaging solution.
How Reliable is Reliable?
In practical experience, Internet email is pretty reliable, although most people can cite examples of messages getting lost. It is hard to measure loss, but we believe it is order 1 message in 1,000 (0.1%). Some independent measurements put the number as much higher than this. It is almost certainly getting worse, because of the wide deployment of anti-spam filters, which can sometimes remove "real" messages. This sort of reliability is good enough for much commercial and personal use - if a message is really important it will get resent. Internet email is certainly more reliable than paper mail (postal services) or the average human receiving email.
While Internet email is suitable for a wide range of situations, there are environments where it is not, as the UK Defence establishment's categorization of messaging types illustrates:
|Messaging Type||Feature Summary|
|High Grade: Designed to meet stringent requirements for integrity, non-repudiation and archiving.||"fire and forget"|
|Medium Grade: Designed to provide a general purpose service for informal non-operational messaging tasks.||"fire and watch"|
|Public Service: A public messaging service is controlled and managed by a commercial organization (e.g. Internet).||"fire and hope"|
'System' versus 'Component' Reliability
Another way to look at message handling is from the viewpoint of the service provider. Consider a typical Internet email service provider, offering a "carrier grade" service, and how it might be expected
- An organization to which the service provider is delivering mail does not have its servers available.
- An individual customer of the service provider does not read his email for an extended period.
A typical ISP would take the following attitude to these problems:
- It should keep trying to send the email to the organization where the server is not available. In the event of a longer delay, the ISP may send a warning to the message sender, and will eventually time out the message and return to sender.
- The ISP has completed service obligation to the customer by delivering the message. It is the customer's choice to read (or not read) a delivered message.
This is a reasonable approach given the ISPs role in normal message delivery. It could be characterized as a component-based approach in that the ISP is primarily concerned with its narrow role in message routing rather than the success of the message delivery system as a whole.
Contrast this with an aviation service provider handling messages containing flight plans. The ISP approach would be quite unacceptable, as it would result in critical information not getting to the end recipient that may have significant operational consequences. For these situations the Aviation service provider would want to:
- Contact the receiving organization, to get the message through in some manner (if necessary by printing it out and faxing it).
- If a message has not been read, take action to ensure that it was dealt with.
It can be seen that the Aviation service provider's definition of reliability is defined at the system rather than component level.
Reliable Messaging and X.400
Most organizations seeking high reliability, of the nature described in our aviation example, choose to use X.400. X.400 was designed with this type of service in mind, and has a number of useful features and capabilities to help provide a reliable service. A technical analysis of why X.400 is good for high reliability messaging is provided in a companion white paper [Why X.400 Is Good for High Reliability Messaging].
Clearly the general Internet email service cannot provide this sort
of reliability. However, it would be possible to use Internet messaging
protocols to provide a more reliable system. Some of the features described
here can be achieved with Internet messaging, and others rely on X.400
What Can Go Wrong?
There are many ways in which a message can fail to get delivered from originator to recipient. These include:
- Software bugs causing internal message loss.
- Hardware failure (e.g., disk head crash or CPU failure).
- System inaccessibility, perhaps due to network failure.
- Recipient not at an appropriate terminal.
- Traffic congestion causing undue delay.
- Operator error, leading to message deletion or mis-routing.
- Malicious attack on systems causing message loss or diversion.
These problems cannot be entirely eliminated. Reliable message transport needs to work despite the risk of such failures.
Basic Principles of Multi-Provider Deployment
Messaging infrastructure is usually provided by more than one service provider. As this situation is more complex than single provider, we focus here on multi-provider. In this situation, a service provider has to do two basic things:
- Look after their own servers and clients. A system will be as robust as its weakest link, and every participant is responsible to ensure robust operation of their own components.
- Watch for failures in peer service providers. A service provider should assume that there will be external failures, and take steps to identify and compensate for them.
Approached to achieve reliability are now considered.
Given that single components can fail, a basic strategy to provide reliability is to duplicate systems wherever possible, so that there is no single point of failure. Significant points where redundancy directly relates to messaging and recommended approached to these potential failure points are:
|Potential Failure Point||Redundancy Solution|
|Site||If a complete site fails, there needs to be an off site hot standby (disaster recovery) system.|
|Message Switches||Message Switches can be configured to operate in parallel, to provide redundancy.|
|Message Routing||Message routing should be configured to take advantage of local and remote systems with redundant message switches, and to enable other redundant routing configurations.|
|Directory Servers||Directory severs holding shadow copies of data can provide redundancy for applications needing read-only access.|
|Disks||Servers that have unique data (Message Store, Message Switch, Master Directory Server) should store data on RAID (Redundant Array of Independent Disks) systems.|
|CPU||Servers that have unique data should use fail-over clustering to guard against risk of CPU or other computer system failure.|
In the event of message loss, it is important that operators can do something to re-instate the message. The mechanism to achieve this is provision of message archive on each message switch, so that any message that has been transferred can be recovered. It is important to be able to search the message archive, based on information from the message audit logs.
It is vital that a service provider monitors servers. While there are many aspects to monitoring, from a reliability viewpoint there are four key issues:
- Up/down status of servers. If a server fails or is otherwise not working properly, it is critical that an operator is rapidly notified.
- Significant deviations (high or low) of normal operating parameters, such as message arrival rate; message delivery rate; operations per second; number of connections. Such changes are indicators of potential problems in local or remote systems which should be investigated.
- Delay in message transfer (Message Switch). If messages are not being transferred (or delivered to a Message Store), this is a problem. For some services, repeated failure or total delay (based on message priority) are sufficient. In other cases, operators should be warned of any message transfer or connection failure.
- Delay in messages being read by recipient (Message Store). If a message has not been read by its intended recipient, the operator may need to be informed.
Monitoring actions can lead to local actions (e.g., restarting a server) or external functions (e.g., telephoning the operator of a remote system). It can also lead to using internal management actions provided by the messaging servers. These are considered in the next two sections.
Management Actions: Message Switch
When a message is delayed on a message switch because it cannot be transferred to a remote system (or transferred or delivered to a local system), and actions to correct access to the remote system are not possible or appropriate, a number of capabilities in the message switch are desirable:
|Force Alternate Route||When a remote Message Switch is not available, and the "normal" alternate routes do not work, this approach configures an alternate route which is used for an interim period.|
|Re-route||In some failure scenarios, a more complex routing change is needed than simply a new alternate route for one destination. This can be achieved by reconfiguring the routing, and then forcing queued messages to do a full routing calculation based on the new routing configuration (re-route).|
|Non-Deliver||Where onward delivery is not possible, the operator should be able to non-deliver a message, as mechanism to alert the message originator to the problem.|
|Operator Forward||Where delivery to a recipient is not possible in a timely manner, the operator should be able to forward the message to an alternate recipient.|
|Operator View||The operator should be able to view the message, so that it can be delivered by an alternate mechanism (e.g., fax).|
A message switch capability to increase reliability not related to delayed messages, is the ability to monitor non-delivered messages. When an error occurs, the message will be non-delivered. It is useful for the local operator to note this and review the non-delivered message. In some situations, it may be useful for the operator to forward this message to a local recipient, in order to meet the intentions of the originator.
Management Actions: Message Store
Where a message has been delivered to a mailbox, but not read by the intended recipient, action may need to be taken to get the message to another recipient.
|Auto-forward on timeout||The message store can support an auto-action to forward an unread message after a specified period of time. It can be forwarded either to the X.400 originator specified alternate recipient or to a recipient specified alternate recipient.|
|Operator forward||Where an automatic forwarding is not appropriate, manual forwarding by the operator is also possible. Forwarding can be to one of the above addresses or to an operator specified recipient.|
All of the techniques described will help build a reliable system. It is important to record and measure reliability. To do this, message switches and message stores should provide an audit log. The key point at which reliability should be measured is at the submitting MTA (Message Transfer Agent), where the message enters the message transfer system. To audit for reliability it is essential that the message transfer system requests positive delivery notification, even if this is not requested by the message originator. Reliability can then be measured by correlating delivery report audit log entries with the original messages.
Audit logs on the recipient's message store can be used to verify that all delivered messages are read by the recipients.
Tracking and Real Time Measurements
Audit log analysis can simply be used retrospectively to show reliability. This system can also be used in real time, to detect messages which are sent out and for which no delivery reports come back in an appropriate time frame. Lack of delivery report could be used to trigger manual intervention (e.g., to use message tracking capabilities and audit log information on servers on the message route to work out what happened to the message). It would generally be desirable to do this in order to determine the nature of the failure, and to review mechanisms to prevent this from happening again.
The error could also be used to automatically resend the message. A reliable MTA should be archiving copies of all traffic sent. A lost message may be retrieved from the archive, and resubmitted.
Checking before you send
Sending out a message to an invalid recipient or to a recipient that cannot handle the message is not good. For this reason, verifying the recipient and the message handling capabilities of the recipient prior to sending a message is a good thing. X.400’s support for directory names as a part of the O/R name greatly facilitates this.
A useful X.400 capability is the ability of the originator to specify an alternate recipient. In the event that there are delivery problems to the originally intended recipient, the message may be delivered to this alternate recipient, or (if permitted by the originator) to an alternate recipient assigned by the receiving management domain. This approach increases operational reliability by increasing the options to get a message to a viable recipient.
End to End Security
End to end security is an important component of reliable messaging for two reasons:
- It enables the recipient to verify that the content of the message has not been changed (message integrity).
- It enables the recipient to securely verify the originator of the message.
The X.400 world uses two end to end security protocols to achieve these goals:
- Security features based on X.509 that are a part of the X.400 standards. AMHS uses these features.
- CMS (Cryptographic Message Syntax) is used by STANAG 4406, which is the military messaging specification.
Message confidentially is often a part of end to end security. This will not help reliability, and may reduce reliability as it prevents an operator from reading the message and delivering by mechanisms other than X.400.
X.400 and X.500 allow clients and servers to use peer authentication based on X.509. This will not significantly improve reliability under normal operating conditions, but will help protect the system from malicious attack, and add security against various threats.
Isode Product Capabilities
This paper has described generic capabilities to provide reliable message transfer. Isode's X.400 products, M-Switch X.400 and M-Store X.400, support many of these features and all of them are planned for future releases. This is set out below:
|Site Failure||Off site hot standby (disaster recovery) system.|
|Redundant Configuration||The redundancy solutions described for Message Switches; Message Routing; Directory Servers; Disks; and CPU are all supported.|
Message archive is supported.
An operator tool to search the archive and forward messages is supported.
Message Switch and Message Store monitoring capabilities are supported.
Locally generated delivery reports can be monitored, and the rejected messages may be viewed and or forwarded
|Message Switch Management Actions||
Force Alternate Route; Re-Route; and Non-Deliver actions are supported.
Operator forward, Operator View and non-delivery report monitoring
|Message Store Management Actions||
Auto-forward on timeout and operator forward are supported.
The message audit database contains the necessary information for an Isode customer to perform these measurements.
Tools to audit reliability are planned.
|Real Time Measurements||The message audit database can be used to support real time monitoring.|
|Checking before you send||Directory capabilities for recipient verification and capability verification are supported.|
|Alternate recipient||Isode servers correctly support alternate recipient.|
|End to End Security||End to End security using military protocols is supported using partner products. End to end security using the AMHS standards as part of the Isode client API development kits is supported.
be viewed and or forwarded
|Peer Authentication||Peer authentication using X.400 P1 strong authentication is supported.|
This paper has explained the requirements for reliable message transport, and set out a number of techniques to provide a reliable message transport infrastructure. It also explains how Isode products can be used to achieve these goals.