High AvailabilityFail-Over Clustering and Off-Site Hot Standby
High availability is important for many applications of Isode's products. On this page we describe:
- Isode's Fail-Over Clustering model for high availability, and
- Disaster Recovery using off-site hot standby.
These approaches are common to all member of Isode's M-Vault, M-Switch, M-Box and M-Store products. M-Link uses its own clustering model which is described on the M-Link: Reliability page.
It is always desirable to achieve high server availability. For example, if a Message Switch fails when it is transferring some messages, those messages will remain "stuck" on the Message Switch until it is repaired. In some service environments, such a delay would be unacceptable. Fail-Over clustering is designed to provide very high server availability, for environments with this type of service requirement.
In a failover cluster, there are two computers (or occasionally several computers). One (primary) provides the service in normal situations. A second (failover) computer is present in order to run the service when the primary system fails. The primary system is monitored, with active checks every few seconds to ensure that the primary system is operating correctly. The system performing the monitoring may be either the failover computer or an independent system (called the cluster controller). In the event of the active system failing, or failure of components associated with the active system such as network hardware, the monitoring system will detect the failure and the failover system will take over operation of the service.
A key element of the fail-over clustering approach, is that both computers share a common file system. One approach is to provide this by using a dual ported RAID (Redundant Array of Independent Disks), so that the disk subsystem is not dependent on any single disk drive. An alternative approach is to utilize a SAN (Storage Area Network).
Isode's fail-over clustering utilizes cluster support from the Operating System vendor. The primary fail-over functions are provided by this cluster support. Isode provides components to integrate with these cluster managers, to enable the cluster manager to monitor, stop and start Isode servers.
Off-Site Hot Standby
The approach to providing off site hot standby is closely related to that used to provide fail-over clustering. In a fail-over clustering system, operations move to a standby system in the event of the primary system failing. Off-Site Hot Standby is provided by having the standby system on a disaster recovery site.
As well as a separate processor, an independent copy of all appropriate files is kept at the remote site. This is achieved by a process known as disk mirroring or RAID 1 (Redundant Array of Independent Disks 1). This setup is often referred to as a SAN (Storage Area Network).
RAID 1 is used to keep data consistent between the off site and primary copies. Additionally, it is likely that some form of hardware based RAID will be used at one or both sites to deal with the risk of disk failure.
There are two basic approaches that can be used to provide the RAID 1 disk mirroring between the primary and disaster recovery sites:
- Software. Isode recommends iSCSI, which is supported by a number of products. It is also available as a part of most Unix operating systems, and so can be implemented using standard hardware. iSCSI allows the RAID 1 mirroring to occur over an IP network using standard network cards. It is a good solution for low and medium volume deployments.
- Hardware. There are a number of hardware solutions that can be used to provide RAID 1, such as Fibrechannel. This is appropriate for a high volume deployment.
As with fail-over clustering, a heartbeat mechanism is used by the disaster recovery system to detect failure of the primary system. If this happens, the disaster recovery system will take over operation.