Overview

This white paper looks at deployment of XMPP in a distributed environment and approaches to ensuring high reliability. It describes capabilities supported by Isode’s M-Link XMPP Server product. This paper looks particularly at scenarios using constrained network links, in particular HF Radio. It also looks at cross domain configurations. This paper also provides an overview of the new capabilities provided by M-Link 19.3.


Classic XMPP Clustering

XMPP Clustering is a mechanism where multiple XMPP servers can provide service for a single domain. The XMPP standards support client access to such configuration by defining a Domain Name Service (DNS) mechanism to enable an XMPP client to choose between cluster nodes and to reconnect to another node in the event of one failing. Clustering also provides load balancing.

Server side clustering protocols are not standardized. M-Link provides an efficient clustering mechanism, which is designed to provide high reliability with wide area links between cluster nodes. This enables M-Link Clustering to provide service reliability both within a single site and across sites. This means that M-Link can provide resilience against site failure, by clustering the service between sites.

The XMPP standard envisions a fully connected Internet environment, where every XMPP server can connect to every other XMPP server. In this environment communications proceed: Client A -> Server 1 -> Server 2 -> Client B. XMPP clustering is the primary resilience mechanism needed for this environment. If Server 1 and Server 2 are both clustered, this is going to address a wide range of network and server failure. Note that the XMPP model is that clients connect to servers, and that a cluster will ensure that messages are correctly routed to a given client.

XMPP Trunking

Isode’s XMPP solution targets environment were this full connectivity is not viable. The approach is described in the Isode white paper Providing XMPP Trunking with M-Link Peer Controls. Essentially this allows communication chains that include more that two XMPP servers. M-Link uses a peer control mechanism to support traffic routing in this kind of setup. XMPP Trunking is needed to support a number of scenarios:

  • Organizational Boundary, where XMPP servers within an organization need to be separated from XMPP servers outside the organization using an XMPP server operating on the boundary.
  • Cross Domain, is a variant of the previous scenario where different networks are connected across a secure boundary. This is a key Isode operational target described in the Isode white paper Isode's XMPP Cross Domain Solution.
  • Constrained Networks, as described in the Isode white paper Operating XMPP over HF Radio and Constrained Networks. Where a constrained network is used, it is generally desirable to have XMPP servers operating together close to each end of the link, typically using optimized protocols. These servers can then relay traffic on to servers on fast networks.

These XMPP Trunking scenarios introduce additional complexity. Classic XMPP Clustering is insufficient to address reliability considerations in these scenarios. The rest of this paper discusses additional techniques to provide reliability for these scenarios.

M-Link 19.3

M-Link is provided as a family of products with different capabilities built on a common base. M-Link User Server is a central M-Link product which supports end users and Multi-User Chat (MUC) rooms. M-Link User Server provides Classic XMPP Clustering. Version 17.0 M-Link User Server is widely deployed.

M-Link is undergoing a significant refactor, with the various M-Link products being released incrementally on the new code base. M-Link 19.4, scheduled for release in 2023, will include all of the M-Link product family. M-Link 19.3 provides two M-Link products, each of which provides resilience capabilities relevant to this white paper:

A key goal of BRASS is to give a high probability of message reception by ships which are in EMCON (Emission Control) and cannot transmit data.

  • M-Link Edge: provides boundary support for organization and cross domain boundaries.
  • M-Link MU Gateway: provides support for Mobile Units (MUs) operating over constrained networks.

M-Link Edge, M-Guard and Cross Domain

Isode’s approach to XMPP cross domain is to use an XML Guard, such as Isode’s M-Guard on the domain boundary and to use M-Link Edge to connect between XMPP servers in the domain and the XML Guard. The core solution is described in the Isode white paper Isode's XMPP Cross Domain Solution. This section considers how to provide resilience and in particular to protect against site and network failures.

The diagram above shows two domains, each containing multiple M-Link (or other XMPP) servers. Each of these servers may be clustered, in order to provide XMPP Server resilience. The domains are connected by cross domain components, each comprising an M-Guard with an M-Link Edge on each side.

The key resilience point is that there are two independent M-Guard/M-Link Edge nodes, which will typically be at independent sites. The whole setup will continue to work if one of these nodes fails by using link fall back to a different node.

An M-Link server in one domain will be configured to route to both M-Link Edges for servers in the other domain. This might be equal priority routing or preference given to one of the M-Link Edges. This fall back mechanism will deal with the case of M-Link Edge failure or site failure for the site running the M-Link Edge.

The M-Link Edges can also communicate with each other, which deals with two additional failure scenarios:

  • Failure of M-Guard or of the peer M-Link Edge. In this scenario, the M-Link Edge will fail to connect to its peer M-Link Edge through the guard. It can be configured to fall back to another M-Link Edge in the same domain.
  • Network failures between M-Link Edge and domain M-Link Servers. If an M-Link Edge cannot route traffic directly to the target M-Link server in the domain, it can fall back to another M-Link Edge in the same domain, which may be able to route the traffic.
  • Constrained Networks & Multiple Bearers

    In fast networks, IP networking is typically used between a pair of servers and IP routing used to deal with different bearers and link failures. With constrained networks, this approach does not work so well with XMPP.

    • For high latency networks, such as Satcom, standard XMPP protocols have too high a latency and an optimized protocol operating over TCP/IP gives significantly improved performance.
    • For HF Radio using STANAG 5066, performance is optimized by completely avoiding use of IP.
    • In general for a poor link, resilience is helped by having the XMPP Server running close to the poor link.

    These points are discussed in the Isode white paper Operating XMPP over HF Radio and Constrained Networks.

    Commonly, constrained links are associated with Mobile Units (MUs) such as ships or planes. There will often be a choice of links (bearers) to be used and choosing the best bearer is important for resilience. This is sometimes referred to as Bearer Of Opportunity (BOO).

    M-Link MU Gateway allows for multiple links to be configured for a peer with a specified order of preference. Consider an MU which has a preferred Satcom link with an option to use HF Radio when Satcom is not available. This is an important configuration when considering Satcom-denied operation. In normal operation Satcom will be used. In the event of the link failing, operation will automatically “fall back” to using HF and a link will be established. In this scenario, the Satcom link will be monitored. When the Satcom link recovers, communication will “fall forward” to start using the Satcom link again. This provides fully automated operation to use the best link available.

    Connection Analysis and Tracing

    When operating complex configurations with chains of XMPP servers it can prove tricky to diagnose issues when things are not working as expected. M-Link 19.3 has added a new session tracing facility as shown above, which is much more convenient than using tools such as Wireshark. It is possible to trace the data being sent and received over any session, including those using TLS. One or more sessions can be selected to monitor. It is also possible to opt to monitor new sessions, which can be helpful to diagnose issues with session startup.

    Monitoring

    M-Link 19.3 also adds new capabilities to monitor an operational service. There are capabilities in the Web interface to M-Link to show current operational status for the local server and for active sessions. Historical information can be shown graphically, as shown above. This works by M-Link feeding metrics into a Prometheus time series database. The Prometheus data can then be displayed using a Grafana dashboard as shown above. Prometheus and Grafana are widely used open source components, external to M-Link. This configuration gives a flexible approach to statistics display, appropriate to the deployment.

    The available statistics cover metrics on messages and connections which are useful for any deployment. For constrained network deployments, additional metrics are available. Constrained links may be configured with XEP-0198 acknowledgements, so that M-Link can measure link latency and message queue buildup of unacknowledged messages.

    Conclusions

    This paper has shown how Classic XMPP Clustering can be augmented by techniques provided in M-Link 19.3 to provide high resilience for cross domain XMPP services and network resilience in multi-bearer environments.