- 22 Aug 2023
High Availability Architecture
- Updated on 22 Aug 2023
High Availability Architecture
Itential Automation Platform (IAP) has been designed with a focus on high availability (HA). This guide will provide detailed component information along with architectural sizing to enable the design and deployment of a highly available IAP as well as an option to setup a disaster recovery environment to mitigate the catastrophic datacenter loss.
At a minimum, only two (2) IAP instances are required. Additional IAP instances can be added based on your software license agreement.
IAP HA Component Overview
The Itential Automation Platform has been designed to allow for multiple IAP servers to be used in a clustered fashion when colocated in the same datacenter. These clustered IAP servers will share the same MongoDB "pronghorn" database in order to share storage, and allow for Workflow Engine to run workflow tasks across multiple available IAP servers. In addition, they share the same Redis and RabbitMQ servers to provide appropriate communication. Altogether this provides both horizontal scaling when paired with a front-end load-balancer and high availability with redundant servers in the event of a server failure.
When further assurance is needed, a second disaster recovery system can be created which stands ready to take over processing upon a datacenter failure. This system is clustered at the Mongo database layer and requires some manual intervention to activate processing once the failure event has been identified and diagnosed.
It is important to distinguish the difference between "high availability" and "disaster recovery" as well as the difference between "active" and "standby" servers. The next section defines each of these.
Active vs Standby Servers
The Itential Automation Platform supports both Active and Standby Automation Platform servers. In most use cases, standby servers are those that do not actively accept user-traffic or work tasks from automations during normal business operations; however, they are ready to take over these duties should an active IAP Server fail or need to be shutdown.
Active: One or more IAP servers running actively and accepting user traffic and working tasks. If these servers shutdown it will impact automations and user traffic.
Standby: One or more IAP servers that are running, but do not receive incoming user traffic from load balancers. They also do not actively work automation tasks. The
WorkFlowEngineservice configuration property
activate:falseis set for these IAP servers to disable working tasks on server startup. The load balancers should only send traffic to the standby server if the active IAP servers go offline.
High Availability: Two or more IAP servers configured to simultaneously provide load-sharing and therefore fault tolerance. Servers setup in this configuration share the same RabbitMQ, Redis, and MongoDB and must be within the same data center. If any of the servers go down, job and task execution continues with the other servers due to the clustering and mirroring of RabbitMQ and Redis.
Disaster Recovery: Whereas high availability mitigates failures at the host and application layers, disaster recovery ensures business continuity in a broader context if an entire data center becomes unavailable. The relationship between systems in two different data centers is not as seamless as those configured as highly available; however, the secondary data center system is kept in a running state with the IAP task worker off. In the event of a failure, startup time is reduced so jobs processing can be restored quickly. Only MongoDB is synchronized between data centers; neither RabbitMQ nor Redis are synchronized between data centers.
RabbitMQ is an industry standard software messaging broker. RabbitMQ is used in IAP to provide intercommunication between IAP and its applications, as well as allowing IAP to horizontally scale with additional IAP nodes. Single data center high availability is achieved through RabbitMQ clustering and message queue mirroring.
The IAP HA architecture uses a group of Master/Slave Redis servers to provide a single location for IAP servers to jointly share IAP tokens created on user login. By having a shared Redis, all IAP servers will share login tokens and allow users to seamlessly switch between clustered IAP Servers without being redirected to the login page. The Master/Slave Redis setup allows for high availability with its use of Redis Sentinel to monitor the Redis group and ensures a Master is always available.
MongoDB is a No-SQL document database with scalability and flexibility in mind. Itential Automation Platform uses Mongo as its main repository for storing data used in its workflow automations and also for many of its applications. MongoDB is also used in HA Architecture by providing a single repository of information for multiple IAP servers working together. MongoDB has extensive scalability and high availability options available. The IAP application utilizes MongoDB's replica set functionality to redundantly store copies of data in order to ensure availability in the event of a failure.
MongoDB utilizes an election style system for electing a primary MongoDB server. This ensures that only one MongoDB server is ever seen as the Master, preventing split-brain issues. The Primary/Master is the only member that can actively be written to or read from. Should the primary fail, the remaining servers will fail their heartbeats and begin another election to bring the replica set back online. The number of replica set members determines the amount of redundancy available.
The following options constitute the supported architectures for an IAP solution. You should implement one of the following depending on the level of availability you require.
Small: No High Availability
HA-0: Bare minimum node count for each component. Shared Token Redis is not required as there is only a single IAP node, although a Redis instance is required to be installed locally on the IAP server for token storage.
Figure 1: Small Size HA
⚠ Note: This architecture is not recommended for Production usage as it provides no high availability. It could be used, however, for non-critical environments such as DEV/TEST, but frequent manual backups are strongly recommended.
Medium: High Availability
HA-3: Single data center containing multiples of all components, with the exception of Supporting Tools and Automation Gateway as they do not support clustering. Policy Engine (Supporting Tools) runs an individual instance. Depending on load requirements, multiple Automation Gateway nodes can be added as required.
Figure 2: Medium Size HA
Large: High Availability & Disaster Recovery
HA-6: Highly available configuration within a single data center that virtually matches the layout of the previous Medium: High Availability architecture size. In addition, a matching standby data center is available to support disaster recovery scenarios. It is recommended to deploy the MongoDB Arbiter node in a third data center. All systems of the standby data center are running except the IAP task worker remains off. This will ensure the tasks are not run in the disaster recovery center until a manual process to turn it on is performed. Upon normal operations, the primary data center provides highly available service but in the unlikely event of a catastrophic outage, the secondary data center system can be manually activated to resume operations.
⚠ Note: The disaster recovery secondary data center is in standby status as defined above. The task worker should not be turned on until the primary system is no longer processing jobs.
Figure 3: Large Size HA