Operational Monitoring and Control of Systems using Isode Servers
Summary: Isode server products (M-Vault, M-Switch, M-Store X.400 and M-Box) are deployed in a wide variety of situations, and usually there is a high service reliance placed on them. In some cases, a single server provides a complete standalone service. In other systems there are large numbers of Isode servers forming just one part of a complex system.
Isode’s approach to server design and management is that the products are building blocks, with maximum use of open standard protocols for interconnection. Management is almost entirely client/server, as discussed in the white paper Isode Management Architecture: Client/Server and Directory.
This combination of building block + client/server means that the approach to operational management needs to be considered as part of the overall system design. This paper explains the approach Isode has taken and the options provided, that can be used to build an operational system.
Share this whitepaper
Overall Operator Monitoring
In most deployments, Isode servers will run day in and day out without errors of any kind. When faults occur, they are often in response to failures of other (or external) components. The combination of these two points leads to most sites choosing to use a general purpose Management Station such as HP Openview or BMC Patrol to be used by system operators to provide one point to monitor a wide range of systems. Use of such a tool gives many advantages:
- One tool can be used to monitor many servers.
- Operator training costs are reduced.
- There is good flexibility in monitoring and notification options.
This top level approach is primarily used to ensure that the system is running correctly, and most of the time it will be operating smoothly and without error. In the event of a failure, the operator will switch to using a specialised tool, or contact an appropriate expert.
A Management Station will typically monitor:
- Isode Servers, and in particular:
- Up/down status. Knowing that the server is up and running is a top priority.
- General "health" parameters are important as a secondary
check. If there are large decreases (or increases) of activity,
this is likely to indicate that there is a problem. Example parameters:
- Number of open connections.
- Protocol or operation response time.
- Message throughput.
- Message latency (M-Switch).
- Faults & Events (discussed later).
- Local system resources associated with the Isode servers, as problems
may be due to the underlying system. In particular:
- System up/down.
- Disk space available.
- Processor usage.
- Network resources. The status of routers, switches and other network components is important, as failures often affect applications.
- Other components and applications within the total system being monitored.
It is usually straightforward to monitor network and local system components from any standard Management Station product. To monitor the up/down status and general server health parameters, Isode uses SNMP (Simple Network Monitoring Protocol) and the Internet Standard MADMAN (Mail and Directory Management MIBs (Management Information Base), that were originally designed by Steve Kille (Isode CEO).
Use of SNMP is a good choice, as it is supported by most Management Stations. It is important that this basic monitoring is done by the Management Station polling the applications from time to time, as SNMP uses an unreliable data transport (User Datagram Protocol) and servers in severe difficulty should not be relied on to report errors. This is exactly the function provided by MADMAN.
Faults & Events
A central component of Isode's management architecture is the event subsystem. Isode has an extensible list of events, each associated with a "facility" which is a functional area of the product set.
|Severity Level||Description||Example||Operator Intervention Required||Administrator Intervention Required|
|Critical||A serious error has occurred, leading to total loss of service.||License file expired.||Requires immediate intervention.||As for operator.|
|Fatal||A serious error has occurred, which is likely to cause partial loss of service.||Running out of disk space.||Requires immediate inspection.||As for operator.|
|Error||An error has occurred, which may cause partial loss of service. The sub-system will usually recover from this without intervention.||Association Rejected to a remote MTA.||May require inspection.||May be appropriate to investigate repeated errors, or unusual error patterns.|
|Warning||Something unexpected has happened but is not causing a loss of service.||Protocol violation by remote system.||May be useful for operator to observe.||Administrator should perform non-urgent investigation.|
|Authfail||An authentication or authorisation failure.||LDAP Client authentication to server fails.||Operator may need to investigate.||Administrator should perform investigation of unusual warnings.|
|AuthOK||A successful authentication or authorization.||LDAP Client authentication to server succeeds.||Not usually useful for operator.||May be useful additional information for administrator.|
|Notice||Informative logging, recording major stages of operational processing.||Called service: smtp-external.||May be useful in monitoring low volume systems.||As for operator.|
|Information||Informative Logging, providing more detail than notice level.||Record each routing option reviewed.||Not appropriate.||May be useful to provide additional logging detail.|
|Detail||Informative logging at a more detailed level.||Log each X.400 checkpoint.||Not appropriate.||May be useful to provide additional logging detail.|
|Success||Informative logging, at a level similar to Detail.||Complete content conversion calculation.||Not appropriate.||May be useful to provide additional logging detail.|
|PDU||A logging option to record specific types of PDU (Protocol Data Unit).||Record LDAP Add PDUs||Not appropriate.||For use by experienced administrator.|
|Debug||Records information about progress and parameters within the program.||ckadr.c:360 normalised address OK (96).||Not appropriate.||Generally only useful when investigating complex problems in consultation with Isode support.|
When an event occurs, the Isode application will make a call to the event system. The Isode event system will be configured to send this event to zero or more event streams. Event streams are of several different types:
- File. The event is written out to a file, in a regular format. This file may be used directly or viewed remotely with Isode’s Event Viewer program.
- Protocol streams which send events by protocol. Isode supports three
- Syslog (the standard Unix event protocol).
- Windows Events (a standard Interface for events on Windows platform).
- SNMP. Use of Traps to send alerts.
In a typical system, errors at Authfail level and above will be recorded in log files, so that they are available for operator and administrator inspection. Critical, Fatal, and perhaps selected Error level events will be fed by protocol to a Management Station, so that the operator will be made aware of events that require urgent attention.
Detailed Application Monitoring
Use of a general purpose Management Station is ideal for top level monitoring of the whole system, with a small amount of information on each server. Configuring a general purpose management station to deal with detailed management of a specific application would be a lot of work, and produce a rather inadequate result. For this reason, Isode’s management architecture make use of application specific tools for more detailed management.
Isode does this by have "Management Consoles" for each server product. These are M-Vault Console(for M-Vault), MConsole (for M-Switch), XMSConsole (for M-Store X.400) and a future product for M-Box. There
- In a "head up" display, operating in a fixed configuration on a visible screen. The Consoles are all designed to operate in a "monitor mode" which will show the current status of key aspects of the servers being monitored (e.g., message queues for M-Switch and replication agreements for M-Vault). This display will enable changes from normal status to be noticed quickly and easily.
- For use by skilled operators and administrators to do advanced monitoring and preliminary problem diagnosis. The Consoles allow an operator to look in more detail at the server and to make operational changes (e.g., to delete a message from a queue).
Isode applications make use of directory based configuration, and have special tools for managing this configuration in the directory. In general, configuration will be separate to operational monitoring, although there is a clear interaction and in some situations configuration changes will be made to address operational problems.
Reporting & Statistics
Generation of reports is not usually an operational task, but often makes use of similar information. Isode’s approach to reports on operational information is to record information in an audit database. As well as supporting statistics, this information is used for operational services such as message tracking, archive searching, and quarantine management.
Supporting Service Level Agreements
Many operational systems operate according to service level agreements (SLAs). Establishing an approach to conformance to SLAs is a task for the system designer. It is important that Isode provide the right building blocks to enable this to happen. Some specific things that Isode provides:
- Reporting and Statistics. SLAs will generally require showing that targets have been achieved, and appropriate reports are key to achieving this.
- Where operator action is needed, the Management Station configuration will be key to meeting some types of agreement (e.g., to respond to all outages within 5 minutes).
- Where an SLA has actions dependent on a complex combination of conditions from various components, the Management Station is the natural place to manage such SLAs.
- In order to support complex SLAs, it is important that Isode products support appropriate underlying functions. We have worked to design event structure and SNMP polling to provide appropriate infrastructure. We are happy to extend our support to new events, where current coverage does not fit a required SLA.
This paper has explained the operational management approach taken by Isode.