Active-standby architecture
Active-standby architecture
The active-standby architecture is Itential’s most resilient validated deployment model. It is designed for organizations that operate automation at a scale where unplanned downtime carries measurable business risk, whether that risk is regulatory, contractual, or operational.
This architecture provides geographic redundancy across three data centers and is capable of surviving the complete loss of any single data center without operator intervention. We recommend deploying this architecture if you have strict uptime requirements and formal business continuity programs.
This architecture requires significant infrastructure investment but delivers high resilience. The sections below describe the component footprint, hardware requirements, failover behavior, and operational expectations in detail so that infrastructure, operations, and security teams can assess readiness and plan accordingly.
How This Compares to Other Itential Architectures
Architecture overview
An active-standby architecture (ASA) deploys Itential Platform, MongoDB, and Redis across three geographically distributed data centers with full redundancy. This architecture builds on two geographically redundant HA2 installations and uses a larger geographically redundant MongoDB replica set.
The Itential Platform performs frequent reads and writes to the database, so low latency between the active Platform instances and MongoDB is critical. All active components run in the same data center. MongoDB and Redis replication processes replicate data from the primary node in the active data center to secondary data centers. All components require authentication.
The minimum active-standby deployment requires 17 virtual machines (VMs) distributed across three data centers. The table below summarizes the full server inventory.
All servers require solid-state storage (SSD or NVMe) capable of at least 20,000 IOPS and network connectivity of 10 Gbps or higher. For complete hardware specifications, see Server specifications.
- A Global Traffic Manager (GTM) load balancer for routing between data centers during failover
- Local Traffic Manager (LTM) load balancers in each active data center
- Network routing that permits inter-component traffic flows (see Network requirements)
Highly available Itential Platform
Itential Platform instances communicate with one another in an abstract manner via Redis. Adding a new Itential Platform node and configuring it to access the correct MongoDB and Redis instances achieves high availability. As Itential Platform instances are added and configured, they are enabled to perform work.
Configure Itential Platform instances with the following:
- MongoDB connection strings that reference all members of the replica set
- Redis configurations that specify the list of all known Redis Sentinels and their Sentinel username and password (connections use Redis Sentinels rather than direct Redis connections)
Highly available databases
Both MongoDB and Redis use a primary/secondary replication model. When a primary node fails, the replica set initiates an election for a new primary. The replica set cannot accept reads and writes until the new primary is selected, typically within a few seconds. Once a new primary is identified, the Itential Platform resumes normal operation. Operators do not need to take action during elections.
MongoDB configuration
MongoDB clusters operate in a primary/secondary model where data written to the primary replicates to secondary nodes. To prevent split-brain scenarios during elections, the architecture requires an odd number of replica set members distributed across three data centers: 2 in the primary region, 2 in the secondary region, and 1 in a tertiary region. When a region is lost, three voting members of the replica set remain. The replica set configuration must enforce a preference to influence the voting in this architecture to guarantee that the primary MongoDB shifts to the secondary region in the case of a disaster.
Configure Itential’s MongoDB cluster with the following requirements:
- All replica set members must be defined in the Itential Platform config
- Configure authentication between replica members using either a shared key or X.509 certificate
- Create an admin user with full access to perform any operation
- Create an “itential” user with least-privilege access to the Itential database only (configure Itential Platform to use this user account)
- Use the
prioritysettings to influence voting as follows:
Learn more:
Redis configuration
To avoid single points of failure, arrange Redis data-bearing nodes in pairs across both data centers as a replica set. Configure Sentinels in 3 data centers to avoid a data center outage from reducing Sentinel availability below a majority. A majority of Sentinels must always be available for failover to work.
Configure Itential’s Redis replica sets with the following requirements:
- Define all Redis nodes in the Itential Platform configuration
- Configure authentication between replica members using user credentials in the Redis configuration file
- Create an admin user with full Redis access
- Create an “itential” user with least-privilege access required by the Itential Platform
- Create a replication user with least-privilege access for the replication process
- Include Redis Sentinel to monitor the Redis cluster
- Redis Sentinel may be collocated with Redis but is not required to be collocated
- Create an admin user for Redis Sentinel with full access to perform any Sentinel task
- Maintain low-latency connections between Redis nodes to prevent replication failures
- Configure Redis priority settings to make primary member elections deterministic. Settings are the opposite of MongoDB—the lowest value is most preferred.
Redis requires careful latency management. If latency between Redis nodes exceeds 10ms, replication lag and failover issues can occur. Keep all Redis nodes within a single region or use high-bandwidth, low-latency interconnects between regions.
Learn more:
What happens during a data center outage?
The active-standby architecture is designed so that the loss of any single data center triggers automatic recovery without requiring operator action. The specific behavior depends on which data center is lost.
Loss of the primary data center
If Data Center 1 (active) becomes unavailable, the MongoDB replica set detects the loss of its two highest-priority members and holds an election. The two MongoDB nodes in the secondary data center carry sufficient priority to assume the primary role. Redis Sentinel, with members still available in Data Centers 2 and 3, similarly promotes the secondary Redis node to primary. The GTM load balancer then routes traffic to Data Center 2, where a standby set of Platform and Gateway nodes is running and ready to accept work. The Platform nodes in Data Center 2 resume processing once both MongoDB and Redis primaries are established in that data center.
Loss of the secondary data center
If Data Center 2 becomes unavailable, the primary data center retains a majority in both MongoDB and Redis and continues normal operation without disruption.
Loss of the tertiary data center
Data Center 3 hosts only a Redis Sentinel and a MongoDB arbiter. Losing it does not cause a primary election in either MongoDB or Redis because a majority of voting members (in Data Centers 1 and 2) remain available. No failover occurs and operations continue uninterrupted.
Recovery time expectations
Automatic recovery is handled by MongoDB and Redis election mechanisms. Under normal network conditions, elections complete within a few seconds. During that window the platform may not be able to accept new work or commit data. No manual intervention is required. Once elections complete, the platform resumes automatically.
Define your own Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets based on your specific operational requirements and validate those targets during initial deployment testing. The architecture is designed to support RTOs measured in seconds for component-level failures and minutes for full data center failures, contingent on external load balancer configuration.
Backup and recovery
Replication provides infrastructure redundancy, not data protection. Backups are essential to guard against logical corruption and accidental deletion.
Geographic replication is not a substitute for backup. Replication propagates all writes, including unintended ones, to all members of the replica set. Implement and test a backup strategy appropriate to your recovery objectives.
MongoDB is the system of record for all platform data including workflows, jobs, inventory, and configuration. Itential recommends a regular backup schedule using MongoDB-native tooling (mongodump or MongoDB Ops Manager) with backups stored outside the primary replica set, preferably in a location not subject to the same failure scenarios as the three data centers. Consider point-in-time recovery using oplog tailing for low RPO requirements.
For more information, see Back up and restore MongoDB.
Redis functions as a transient message and job queue layer. Redis data is reconstructed by the platform on reconnection and does not require the same backup posture as MongoDB. However, if you use Redis persistence (AOF or RDB), ensure that persistence files are included in your broader backup strategy.
Test backup procedures on a scheduled basis, not only at initial deployment.
Operational overhead
The answer to how much ongoing operational effort this architecture requires depends on whether normal operations or incident response is being considered.
Normal operations
Under steady-state conditions this architecture requires relatively little hands-on management. The platform, MongoDB, and Redis all handle routine internal events — such as replica lag recovery and leader election — without operator intervention. The primary operational responsibilities are:
- Monitoring cluster health across all three data centers (see Monitoring expectations section below)
- Applying operating system and component patches on a scheduled basis using rolling procedures that avoid simultaneous downtime of majority replica set members
- Periodically testing failover scenarios in non-production environments to validate that GTM routing and Sentinel configuration remain correct as the environment evolves
- Managing TLS certificates and credential rotation across all components on their respective expiration schedules
Required skills
Operating this architecture competently requires staff with working knowledge of MongoDB replica set administration, Redis Sentinel configuration, Linux system administration, and your load balancer and network infrastructure. Familiarity with your monitoring stack is also required. Itential Professional Services can assist with initial deployment and knowledge transfer.
What operators don’t need to do
Primary elections in MongoDB and Redis are fully automatic. Operators do not need to manually promote a replica to primary during a failover. The platform reconnects to new primaries automatically once elections complete.
Monitoring
Establish monitoring coverage for the following conditions at minimum. Without visibility into these signals, degraded states can go undetected until they compound into an outage.
MongoDB
Monitor replica set member health and replication lag across all five nodes. A lagging secondary that has not caught up with the primary is a risk factor in any subsequent failover. Alert on replication lag exceeding a threshold appropriate to your RPO. Monitor available disk space on all data-bearing nodes, particularly the /var/lib/mongo partition, which holds the data files.
Redis
Monitor Sentinel-reported primary/secondary topology to confirm the expected node holds the primary role at all times. Alert on any unexpected role change, which may indicate a silent failover occurred. Monitor replication lag between Redis nodes and connection counts from the platform.
Platform nodes
Monitor application logs for job processing errors and connectivity failures to MongoDB or Redis. Elevated error rates are early indicators of a dependency problem.
Infrastructure
Monitor network latency between data centers. The platform is sensitive to latency in its MongoDB and Redis connections. Elevated cross-datacenter latency is a leading indicator of replication problems.
Configure the workflow engine
The installation process handles setting the appropriate configurations in MongoDB and Redis. There are a few other things to consider in Itential Platform — most importantly, knowing what state the Workflow Engine is in at any given moment. In an ASA, the secondary data center (standby site) contains Itential Platform servers that you must configure to remain passive until a failover event occurs. This configuration prevents both data centers from processing workloads simultaneously.
There are two settings for controlling the state of the Workflow Engine, both found in either the configuration file or in corresponding environment variables. The properties file is typically found at /etc/itential/platform.properties.
Initialize both properties to false in the secondary data center so that it remains passive and does not start or process jobs or tasks. During a failover event, after both MongoDB and Redis have successfully completed their elections, set these to true to activate the secondary workflow engines and resume automations. Manage both properties with a RESTful API call — start the task worker first, then the job worker. Both APIs require a valid session token and must be run against each Itential Platform server individually; they cannot be executed through a load balancer.
After making these requests the secondary data center processes automations. At this point, disable these on the previously active data center:
Itential recommends setting both properties to false in both the active and secondary data centers. Setting all Itential Platform servers to disable jobs and tasks gives you a known state whenever instances stop and restart, making it a deliberate action to enable job and task processing even in the primary data center.
Server specifications
For production environments, all Itential Platform components should be installed on their own individual servers to properly support High Availability (HA). Disk references to pronghorn (seen in older deployments) should be changed to itential.
Itential Platform server
MongoDB server
Redis server
IAG server
The following applies to a simple All-In-One implementation of Itential Automation Gateway (IAG). For a more information about alternative IAG architectures, see Choose a deployment architecture.
Hardware requirements
Processor
Processor specification requirements:
- Second generation or better Intel Xeon Platinum 8000 series processors
- Third generation or better AMD EPYC 7000 series processors
Memory
Memory specification requirement:
- DDR5 DRAM 3200 MHz or higher
Storage
Storage performance requirements in IOPS (16 kiB):
- 20000+ IOPS
- Non-spinning media (SSD, NVMe)
Network
Network speed requirement:
- 10 Gbps or higher
In some instances, adding additional dedicated interfaces that are focused on routing specific traffic to specific external systems can be explored. This routing of traffic would be configured at the OS-level (custom interfaces and routes) and requires the system administrator to manage it. An example would be separating NSO traffic from Redis/MongoDB destined traffic.
Hypervisor/host OS settings
These settings are strongly recommended for high load applications of Itential Platform:
- CPU affinity settings or similar functionality to prevent CPU starvation
- Full memory reservation
- One physical CPU per VM is preferred
- Huge pages for memory support enabled (except MongoDB)
- Memory compression disabled
- Minimal CPU allocation settings for scheduler according to CPU clock
Example: Assuming an Itential Platform VM on a server capable of 2.5GHz nominal speed:
Follow hypervisor recommendations when performing CPU reservations. In most cases the total of all CPU reservations for all VMs on a host cannot be more than 90% of the host capacity as 10% is reserved by the host itself.
MongoDB discourages the utilization of Transparent Huge Pages with versions 7 and below. This advice is changed in version 8 which encourages the use of Transparent Huge Pages.
Network requirements
In an environment where components are installed on more than one host, the following network traffic flows need to be allowed. All ports and networking specs are TCP protocol unless otherwise noted. Not all ports will need to be open for every supported architecture. Secure ports are only required when explicitly configured.
Required user accounts in dependencies
The validated designs are opinionated installations of Itential and its dependencies. The following user accounts are required by the dependencies.