Deploying high availability clusters

High Availability (HA) IAG deployments ensure continuous service availability by maintaining one active gateway server with multiple standby nodes ready to take over in case of failure. This guide covers the procedures for configuring both all-in-one HA clusters and HA clusters with distributed execution.

Prerequisites

Before configuring an HA cluster, ensure you have:

Multiple IAG server nodes installed and ready for configuration
Access to configure an shared database (etcd or Amazon DynamoDB)
Valid SSL certificates for each node
Administrative access to Gateway Manager

Configuring All-in-One HA Clusters

Step 1: Configure the shared database

All nodes in the HA cluster must connect to the same shared database to coordinate leadership and share cluster state information. You can choose to use either an etcd or Amazon DynamoDB database based on your infrastructure preferences and requirements.

etcd

Set up your etcd database following the etcd database configuration procedures
Ensure all gateway servers can connect to the etcd instance
Verify network connectivity and firewall rules allow etcd communication

Amazon DynamoDB

Set up your DynamoDB database following the Amazon DynamoDB table configuration procedures
Configure appropriate AWS credentials and permissions for all gateway servers
Verify network connectivity and AWS access from all nodes

Step 2: Enable high availability mode

Configure HA mode on all gateway servers in the cluster:

On each gateway server, edit the gateway configuration file
Set the GATEWAY_CONNECT_SERVER_HA_ENABLED configuration variable to true
Set the GATEWAY_APPLICATION_CLUSTER_ID to your desired cluster identifier, all nodes must share the same cluster ID
These settings must be applied to all controller nodes that will participate in the HA cluster

Step 3: Designate the primary node

Identify which gateway server should be the preferred active node:

On your designated primary gateway server, set the GATEWAY_CONNECT_SERVER_HA_IS_PRIMARY configuration variable to true. This ensures the primary node will maintain the connection to Gateway Manager when it is online and available.
Leave this setting as false (or unset) on all other nodes in the cluster.

Step 4: Configure SSL certificates

Each gateway server in the cluster requires its own SSL certificate configuration:

On each gateway server, configure the following variables:
- GATEWAY_CONNECT_CERTIFICATE_FILE: Path to the SSL certificate file
- GATEWAY_CONNECT_PRIVATE_KEY_FILE: Path to the private key file
Each gateway server can use its own unique certificate and key pair or they can share a common certificate

Step 5: Register the gateway cluster with Gateway Manager

Follow the procedures for Creating gateway clusters to add your HA cluster to Gateway Manager.

Note

Configure all nodes in your HA deployment with the same GATEWAY_APPLICATION_CLUSTER_ID. The gateway cluster ID you provide when creating the gateway cluster in Gateway Manager must also match the GATEWAY_APPLICATION_CLUSTER_ID.

Configuring HA with Distributed Execution

For HA clusters that include runner nodes, follow the all-in-one HA configuration steps above for the gateway servers, then add runner nodes to the cluster.

Additional Steps for Runner Nodes

Connect runner nodes to shared database: Configure each runner node to connect to the same database (etcd or Amazon DynamoDB) used by the gateway servers
Configure cluster membership: Ensure runner nodes are configured with the same cluster ID as the gateway servers
Verify connectivity: Confirm runner nodes can communicate with all gateway servers in the cluster
Test execution delegation: Verify that any active core server can send execution requests to the runner nodes

For more detailed procedures, see Deploying distributed execution clusters.

Testing the HA Configuration

Verify Initial Setup

Check cluster status: Review logs on all nodes to confirm they recognize each other
Confirm leadership: Verify that only one node shows as active in the logs
Test connectivity: Ensure the active node maintains connection to Gateway Manager

Test Failover Behavior

Monitor standby nodes: Check logs to confirm standby nodes recognize the current active node
Initiate controlled failover: Gracefully shut down the active node
Observe leadership election: Watch for a standby node to become active
Verify service continuity: Confirm the new active node connects to Gateway Manager
Test service execution: Run automation tasks to ensure functionality

Sample Failover Log Sequence

During a successful failover, you should observe log entries similar to:

Active node before shutdown:

active-gateway | 2024-12-23T18:03:59Z INF connected to gateway manager at my-itential-cloud-server-ip:443
active-gateway | 2024-12-23T18:04:22Z INF got signal for shutdown....terminated

Standby node taking over:

standby-gateway | 2024-12-23T18:04:19Z DBG this core node with Id of 'xxx' is not the active node
standby-gateway | 2024-12-23T18:04:24Z INF node xxx is elected as the leader. About to start gateway manager...
standby-gateway | 2024-12-23T18:04:24Z INF connected to gateway manager at my-itential-cloud-server

Troubleshooting Common Issues

Nodes Not Recognizing Each Other

Verify all nodes connect to the same shared database (etcd or Amazon DynamoDB)
Check network connectivity between nodes
For etcd: Confirm etcd service is running and accessible
For Amazon DynamoDB: Verify AWS credentials and permissions are correctly configured

Multiple Active Nodes

Ensure only one node has GATEWAY_CONNECT_SERVER_HA_IS_PRIMARY set to true
Check for network partitions affecting database communication
Verify database consistency and accessibility

Failover Not Occurring

Confirm GATEWAY_CONNECT_SERVER_HA_ENABLED is set to true on all nodes
Check shared database connectivity from standby nodes
Review firewall rules and network connectivity
For Amazon DynamoDB: Verify AWS permissions allow read/write access

SSL Certificate Issues

Verify certificate file paths are correct
Ensure gateway service has read access to certificate files
Check certificate validity and expiration dates

Post-Deployment Monitoring

After successfully deploying your HA cluster:

Set up monitoring: Implement monitoring for all cluster nodes
Review logs regularly: Monitor logs for any connectivity or leadership issues
Test failover periodically: Regularly test failover procedures to ensure reliability
Update documentation: Document your specific cluster configuration and any customizations
Plan maintenance: Develop procedures for updating and maintaining the HA cluster