This whitepaper describes how Isode X.400 servers can be deployed to support off site disaster recovery. It looks at the new (in R15.1) features in M-Store X.400, which are central to the X.400 disaster recovery approach and then looks at how this can be used in conjunction with other Isode disaster recovery capabilities to provide disaster recovery for a full X.400 deployment. This approach is appropriate for Aviation (AMHS) and Military (STANAG 4406) deployments.
Why Off Site Disaster Recovery?
All of the Isode servers are very robust, and can be configured with RAID disks and failover clustering so that they are resilient to hardware failure. Deployments can utilize redundant networking. This means that individual servers can be configured with very high availability. This will provide extremely robust operation in most scenarios.
There is a class of failure which the most robust server cannot cope with. An extreme example is a bomb at the site which simply destroys everything. Another example is a storm breaking all communication links with a site. Solutions to deal with this sort of scenario are referred to as disaster recovery. Disaster recovery is important for many mission critical services.
The broad characteristic of a disaster recovery scenario is “site failure”, so in order to protect against this, it is necessary to have recovery servers at a remote site that can be used in the event of a disaster.
While it is sometimes possible to use local server clustering to provide disaster recovery, this requires a very fast link and restricts distance. Most deployments need an approach that does not constrain distance to the recovery site and does not require special networking.
M-Store X.400 Disaster Recovery
M-Store X.400 is Isode’s X.400 message store, which provides message submission, message storage of sent and received messages and message access.
Disaster Recovery Model
The basic model is that there is an operational M-Store X.400 server, and a second standby server. The standby server may be turned off (cold standby) or operational but not supporting any traffic (hot standby).
No attempt is made to keep the standby server up to date with stored messages while it is on standby. The goal of the standby server is to enable message submission, storage and reception in the unlikely event of the primary server failing.
In the event of a disaster, configuration changes will make the hot standby live. Clients can connect to the standby server, and send and receive messages. When the operational server is restored, the clients are switched back. Messages that have been stored on the standby server can be copied to the operational server, so that a full history of message traffic is kept on the operational server.
This is a simple and robust model. It is vital that the disaster recovery approach does not fail in the event that it is needed.
Configuration in Normal Operation
The primary message store and all of the message store users are configured exactly as they would be if there was no disaster recovery option. This means that there are no changes for a normal configuration, and it is straightforward to add a standby message store to an existing configuration.
The standby server is configured as a standard Isode message store. It is simply adding a new message store at the disaster recovery location.
X.400 users are associated with a single message store. The Isode approach to mailbox management has the user configuration in the directory and relatively independent of the message store. The message store used is identified as part of the user configuration (i.e., the user points to the message store, rather than the message store pointing to the user). User configuration includes configuration of message store information, such as auto-actions. This configuration will be used by the standby server, when fail-over occurs.
Click to show/hide detail
The operator action to invoke failover is very simple. The configuration of the primary message store is modified to point at the standby message store. When this change is made, the UI will also modify the configuration of the standby message store, to reflect that it is acting as the primary message store.
Note that the configuration information is held in a directory. The ability for the GUI to work will often depend on directory failover, discussed below.
This modified configuration will enable a number of things to happen:
- When an M-Switch server has a message to deliver for a user, the configuration change will redirect message delivery to the standby message store.
- When a client connects to the standby message store, the standby store will recognize from the modified configuration that the client is connecting to the correct store. It will be able to provide messages delivered for the client. If the client submits a message, it will store a copy of the submitted message and then submit the message through an M-Switch server.
Essentially, this simple configuration change enables switch of message store.
Sharing a Disaster Recovery Server
A standby message store is simply a “blank” message store that can be used in standby mode. Because of this, it is possible to use one standby server as the potential DR server for two or more operational nodes. The standby server can be used in the event that any one of the operational servers fails.
Full X.400 Disaster Recovery
This paper has so far explained how disaster recovery works for M-Store X.400. This is just one component of an X.400 system, and disaster recovery needs to look at the whole system. This section looks at how disaster recovery of all the components can work between two locations.
Components of an X.400 Deployment
The following components need to be considered:
- Message Store (M-Store X.400) where messages are stored for client access.
- Message Transfer Agent (M-Switch X.400) which handles message submission, delivery, switching and gateways.
- Directory (M-Vault). User, M-Store X.400 and M-Switch X.400 configuration is held in the directory.
- Messaging Clients.
M-Vault Disaster Recovery
Because configuration information for the messaging servers is held in the directory, providing directory disaster recovery is key. M-Vault provides this, and it is described in the White Paper “M-Vault Failover and Disaster Recovery” Note that the mirror servers normally provide read access, so M-Vault is active-active for read and active-passive for write.
Getting the directory correctly functioning is a pre-requisite of the messaging, so M-Vault disaster recovery should be handled first.
M-Switch Redundant Deployment
M-Switch switches messages, and in general messages will be held for a few hundred milliseconds before being passed to the next hop. This means that there is little benefit to trying to back up messages on the fly.
A recommended simple active-active M-Switch configuration is shown above. There will be an M-Switch active at the primary site, which will deliver messages to the local M-Store X.400. An M-Switch will also be active at the DR site (or could be on cold standby) any messages arriving at this M-Switch will get delivered to the primary message store (controlled by the client configuration held in the directory). External systems can be configured to send to the primary M-Switch, with fall back to the secondary one.
If the primary M-Switch fails, messages will get sent to the DR M-Switch. When the M-Store X.400 configuration is changed to the DR M-Store X.400, it will deliver messages to the DR message store.
This means that the M-Switch configuration does not need to be changed in the DR situation.
When an X.400 client sends a message, two approaches are possible:
- Direct submission to M-Switch using X.400 P3.
- Indirect submission using X.400 P7 to M-Store X.400, which will use X.400 P3 to submit the message to M-Switch. This has two benefits:
- The client can use a single connection for message access and message submission.
- M-Store X.400 will retain a copy of each submitted message for future reference.
An M-Switch instance can accept P3 submission from any client that it can correctly authenticate. This means that in the active-active M-Switch configuration, either M-Switch can accept messages for submission from a client or message store. This means that there is no configuration change required for the M-Switch servers to support disaster recovery.
It can now be seen how the M-Store X.400 disaster recovery described earlier fits into this. In the event of the need to fail over to the disaster recovery site:
- M-Vault disaster recovery is performed first, so that directory changes can be made on the DR site.
- No M-Switch configuration changes are needed.
- M-Store X.400 failover to the DR site can be achieved as described earlier.
Thus, the whole server infrastructure can be switched over.
Messaging clients also need to participate in the disaster recovery. With a client/server architecture, clients acting on behalf of a given user or role can operate at multiple locations. There are three basic approaches to client disaster recovery:
- The user will use a separate client at the DR location, configured to access the DR servers.
- The client will be configured to access both servers, and can switch between them either manually or automatically.
- Network configuration can be set up so that in the event of switching to the DR servers, the switch of servers is transparent to client configuration.
Click to show/hide detail
When there is a catastrophic failure, it is possible that a small number of messages will be “lost”. For example, messages being switched, or a message delivered to the failed message store, but not accessed by the user. It would be very hard to ensure that absolutely no message loss occurred at the transition.
An approach to help deal with this is to track messages, and have the tracking system ensure that messages are delivered and acknowledged. This will enable missing messages to be flagged to operators, who can take appropriate action. Isode’s approach to this is described in our Using Message Acknowledgements for Tracking, Correlation and Fire & Forget whitepaper.
This paper has shown how Isode products can be used to provide an X.400 deployment with off site disaster recovery.