Itential Documentation
- Itential Documentation
- Itential Opensource

High Availability Architecture

02 May 2024

Dark

Light
PDF

High Availability Architecture

Updated on 02 May 2024

Dark

Light
PDF

Article summary

Did you find this summary helpful?

Thank you for your feedback

High Availability Architecture

⚠ Note

The information in this article should only be considered for IAP versions that require RabbitMQ as a dependent software. This includes IAP 2022.1.x and IAP 2023.1.x.

Itential Automation Platform (IAP) has been designed with a focus on high availability (HA). This guide will provide detailed component information along with architectural sizing to enable the design and deployment of a highly available IAP as well as an option to setup a disaster recovery (DR) environment to mitigate the catastrophic datacenter loss.

At a minimum, only two (2) IAP instances are required. Additional IAP instances can be added based on your software license agreement.

IAP HA Component Overview

The Itential Automation Platform has been designed to allow for multiple IAP servers to be used in a clustered fashion when colocated in the same datacenter. These clustered IAP servers will share the same MongoDB "pronghorn" database in order to share storage, and allow for Workflow Engine to run workflow tasks across multiple available IAP servers. In addition, they share the same Redis and RabbitMQ servers to provide appropriate communication. Altogether this provides both horizontal scaling when paired with a front-end load-balancer and high availability with redundant servers in the event of a server failure.

When further assurance is needed, a second DR system can be created which stands ready to take over processing upon a datacenter failure. This system is clustered at the Mongo database layer and requires some manual intervention to activate processing once the failure event has been identified and diagnosed.

It is important to distinguish between "high availability" and "disaster recovery" as well as the difference between "active" and "standby" servers. The next section defines each of these.

Active vs Standby Servers

The Itential Automation Platform supports both Active and Standby Automation Platform servers. In most use cases, standby servers are those that do not actively accept user-traffic or work tasks from automations during normal business operations; however, they are ready to take over these duties should an active IAP Server fail or need to be shutdown.

Active: One or more IAP servers running actively and accepting user traffic and working tasks. If these servers shutdown it will impact automations and user traffic.
Standby: One or more IAP servers that are running, but do not receive incoming user traffic from load balancers. They also do not actively work automation tasks. Standby servers typically have the processTasksOnStart property set to false within the properties.json file.
High Availability: Two or more IAP servers configured to simultaneously provide load-sharing and therefore fault tolerance. Servers setup in this configuration share the same RabbitMQ, Redis, and MongoDB and must be within the same data center. If any of the servers go down, job and task execution continues with the other servers due to the clustering and mirroring of RabbitMQ and Redis.
Disaster Recovery: Whereas high availability mitigates failures at the host and application layers, disaster recovery ensures business continuity in a broader context if an entire data center becomes unavailable. The relationship between systems in two different data centers is not as seamless as those configured as highly available; however, the secondary data center system is kept in a running state with IAP Task Worker, Operations ("Ops") Manager and Automation Catalog off. In the event of a failure, these three components will need to be manually turned on only when the non-DR site has been verified as non-functional.

RabbitMQ

RabbitMQ is an industry standard software messaging broker. RabbitMQ is used in IAP to provide intercommunication between IAP and its applications, as well as allowing IAP to horizontally scale with additional IAP nodes. Single data center high availability is achieved through RabbitMQ clustering and message queue mirroring.

Shared-Token Redis

The IAP HA architecture uses a group of Master/Slave Redis servers to provide a single location for IAP servers to jointly share IAP tokens created on user login. By having a shared Redis, all IAP servers will share login tokens and allow users to seamlessly switch between clustered IAP Servers without being redirected to the login page. The Master/Slave Redis setup allows for high availability with its use of Redis Sentinel to monitor the Redis group and ensures a Master is always available.

MongoDB

MongoDB is a No-SQL document database with scalability and flexibility in mind. Itential Automation Platform uses Mongo as its main repository for storing data used in its workflow automations and also for many of its applications. MongoDB is also used in HA Architecture by providing a single repository of information for multiple IAP servers working together. MongoDB has extensive scalability and high availability options available. The IAP application utilizes MongoDB replica set functionality to redundantly store copies of data in order to ensure availability in the event of a failure.

MongoDB utilizes an election style system for electing a primary MongoDB server. This ensures that only one MongoDB server is ever seen as the Master, preventing split-brain issues. The Primary/Master is the only member that can actively be written to or read from. Should the primary fail, the remaining servers will fail their heartbeats and begin another election to bring the replica set back online. The number of replica set members determines the amount of redundancy available.

Architecture Sizing

The following options constitute the supported architectures for an IAP solution. You should implement one of the following depending on the level of availability you require.

Small: No High Availability

HA-0: Bare minimum node count for each component. Shared Token Redis is not required as there is only a single IAP node, although a Redis instance is required to be installed locally on the IAP server for token storage.

Figure 1: Small Size HA

⚠ Small architecture is not recommended for Production usage as it provides no high availability. It could be used, however, for non-critical environments such as DEV/TEST, but frequent manual backups are strongly recommended.

Medium: High Availability

HA-3: Single data center containing multiples of all components, with the exception of Supporting Tools and Automation Gateway as they do not support clustering. Policy Engine (Supporting Tools) runs an individual instance. Depending on load requirements, multiple Automation Gateway nodes can be added as required.

Figure 2: Medium Size HA

Large: High Availability & Disaster Recovery

HA-6: Highly available configuration within a single data center that virtually matches the layout of the previous Medium: High Availability architecture size. In addition, a matching standby data center is available to support disaster recovery scenarios. It is recommended to deploy the MongoDB Arbiter node in a third data center. All systems of the standby data center are running while IAP Task Worker, Ops Manager and Automation Catalog remain off. This will ensure the tasks are not run in the disaster recovery center until a manual process to turn it on is performed. Upon normal operations, the primary data center provides highly available service but in the unlikely event of a catastrophic outage, the secondary data center system can be manually activated to resume operations.

 The disaster recovery secondary data center is in standby status as defined above. Task Worker, Ops Manager and Automation Catalog should not be turned on until the primary system is no longer processing jobs.

Figure 3: Large Size HA

03-LargeSizeHA-20.2plus.png

 There are multiple "scheduler systems" within IAP that are affected by a DR setup. Of them, Operations Manager and Automation Catalog are affected. Therefore, for any HA setup where there is Disaster Recovery, you must disable (pause) Task Worker, and shutdown the Operations Manager and Automation Catalog apps on the DR system.

Was this article helpful?

What's Next

Manual Failover

Table of contents