- 29 Mar 2024
-
DarkLight
-
PDF
High Availability Architecture
- Updated on 29 Mar 2024
-
DarkLight
-
PDF
Itential Automation Platform (IAP) was designed with a focus on high availability (HA). This article will provide detailed component information along with architectural sizing to enable the design and deployment of a highly available IAP as well as an option to setup a disaster recovery environment to mitigate catastrophic datacenter loss.
IAP HA Component Overview
Itential Automation Platform was designed to allow for multiple IAP servers to be used in a clustered fashion when co-located in the same datacenter. These clustered IAP servers will share the same MongoDB itential
database to allow for shared storage, and allow for Workflow Engine to run workflow tasks across multiple available IAP servers. In addition, they share the same Redis servers to provide appropriate communication. Altogether this provides both horizontal scaling when paired with a front-end load-balancer and high availability with redundant servers in the event of a server failure.
Some databases (prior to 2022.1) may be named
pronghorn
. Either name is supported, but the preferred naming is"itential"
.
When further assurance is needed, a second disaster recovery system can be created which stands ready to take over processing upon a datacenter failure. This system is clustered at the Mongo database layer and requires some manual intervention to activate processing once the failure event has been identified and diagnosed.
It is important to distinguish the difference between "high availability" and "disaster recovery" as well as the difference between "active" and "standby" servers. The next section defines each of these.
Active vs Standby Servers
Itential Automation Platform supports both Active and Standby Automation Platform servers. In most use cases, standby servers are those that do not actively accept user-traffic or work tasks from automations during normal business operations; however, they are ready to take over these duties should an active IAP Server fail or need to be shutdown.
-
Active: One or more IAP servers running actively and accepting user traffic and working tasks. If these servers shutdown it will impact automations and user traffic.
-
Standby: One or more IAP servers are running, but do not receive incoming user traffic from load balancers. They also do not actively work automation tasks. The
WorkFlowEngine
service configuration propertyactivate:false
is set for these IAP servers to disable working tasks on server startup. The load balancers should only send traffic to the standby server if the active IAP servers go offline. -
High Availability: Two or more IAP servers configured to simultaneously provide load-sharing and therefore fault tolerance. Servers setup in this configuration share the same Redis and MongoDB and must be within the same data center. If any of the servers go down, job and task execution continues with the other servers due to the clustering and mirroring of Redis.
-
Disaster Recovery: Whereas high availability mitigates failures at the host and application layers, disaster recovery ensures business continuity in a broader context if an entire data center becomes unavailable. The relationship between systems in two different data centers is not as seamless as those configured as highly available; however, the secondary data center system is kept in a running state with the IAP Task Worker off. In the event of a failure, startup time is reduced so jobs processing can be restored quickly. Only MongoDB is synchronized between data centers; Redis are not synchronized between data centers.
Shared-Token Redis
The IAP HA architecture uses a group of Master/Slave Redis servers to provide a single location for IAP servers to jointly share IAP tokens created on user login. By having a shared Redis, all IAP servers will share login tokens and allow users to seamlessly switch between clustered IAP Servers without being redirected to the login page. The Master/Slave Redis setup allows for high availability with its use of Redis Sentinel to monitor the Redis group and ensures a Master is always available.
MongoDB
MongoDB is a No-SQL document database with scalability and flexibility in mind. Itential Automation Platform uses Mongo as its main repository for storing data used in its workflow automations and also for many of its applications. MongoDB is also used in HA Architecture by providing a single repository of information for multiple IAP servers working together. MongoDB has extensive scalability and high availability options available. The IAP application utilizes MongoDB replica set functionality to redundantly store copies of data in order to ensure availability in the event of a failure.
MongoDB utilizes an election style system for electing a primary MongoDB server. This ensures that only one MongoDB server is ever seen as the Master, preventing split-brain issues. The Primary/Master is the only member that can actively be written to or read from. Should the primary fail, the remaining servers will fail their heartbeats and begin another election to bring the replica set back online. The number of replica set members determines the amount of redundancy available.
Architecture Sizing
The following options constitute the supported architectures for an IAP solution. You should implement one of the following depending on the level of availability you require.
Small: No High Availability
HA-0
: Bare minimum node count for each component. The recommendation is to install Redis on a separate virtual machine (VM).
Figure 1: Small Size HA
⚠ Note: This architecture is not recommended for Production usage as it provides no high availability. It could be used, however, for non-critical environments such as DEV/TEST, but frequent manual backups are strongly recommended.
Medium: High Availability
HA-3
: Single data center containing multiples of all components, with the exception of Supporting Tools and Automation Gateway as they do not support clustering. Depending on load requirements, multiple Automation Gateway nodes can be added as required.
Figure 2: Medium Size HA
Large: High Availability & Disaster Recovery
HA-6
: Highly available configuration within a single data center that virtually matches the layout of the previous Medium: High Availability architecture size. In addition, a matching standby data center is available to support disaster recovery scenarios. It is recommended to deploy the MongoDB Arbiter node in a third data center. All systems of the standby data center are running except the IAP Task Worker remains off. This will ensure that tasks do not run in the disaster recovery center until a manual process to turn it on is performed. Upon normal operations, the primary data center provides highly available service but in the unlikely event of a catastrophic outage, the secondary data center system can be manually activated to resume operations.
⚠ Note: The disaster recovery secondary data center is in standby status as defined above. The Task Worker should not be turned on until the primary system is no longer processing jobs.
Figure 3: Large Size HA