- 05 Jun 2025
-
DarkLight
-
PDF
Deploying high availability clusters
- Updated on 05 Jun 2025
-
DarkLight
-
PDF
High Availability (HA) IAG deployments ensure continuous service availability by maintaining one active gateway server with multiple standby nodes ready to take over in case of failure. This guide covers the procedures for configuring both all-in-one HA clusters and HA clusters with distributed execution.
Prerequisites
Before configuring an HA cluster, ensure you have:
- Multiple IAG server nodes installed and ready for configuration
- Access to configure an shared database (etcd or Amazon DynamoDB)
- Valid SSL certificates for each node
- Administrative access to Gateway Manager
Configuring All-in-One HA Clusters
Step 1: Configure the shared database
All nodes in the HA cluster must connect to the same shared database to coordinate leadership and share cluster state information. You can choose to use either an etcd or Amazon DynamoDB database based on your infrastructure preferences and requirements.
etcd
- Set up your etcd database following the etcd database configuration procedures
- Ensure all gateway servers can connect to the etcd instance
- Verify network connectivity and firewall rules allow etcd communication
Amazon DynamoDB
- Set up your DynamoDB database following the Amazon DynamoDB table configuration procedures
- Configure appropriate AWS credentials and permissions for all gateway servers
- Verify network connectivity and AWS access from all nodes
Step 2: Enable high availability mode
Configure HA mode on all gateway servers in the cluster:
- On each gateway server, edit the gateway configuration file
- Set the
GATEWAY_CONNECT_SERVER_HA_ENABLED
configuration variable totrue
- Set the
GATEWAY_APPLICATION_CLUSTER_ID
to your desired cluster identifier, all nodes must share the same cluster ID - These settings must be applied to all controller nodes that will participate in the HA cluster
Step 3: Designate the primary node
Identify which gateway server should be the preferred active node:
- On your designated primary gateway server, set the
GATEWAY_CONNECT_SERVER_HA_IS_PRIMARY
configuration variable totrue
. This ensures the primary node will maintain the connection to Gateway Manager when it is online and available. - Leave this setting as
false
(or unset) on all other nodes in the cluster.
Step 4: Configure SSL certificates
Each gateway server in the cluster requires its own SSL certificate configuration:
- On each gateway server, configure the following variables:
GATEWAY_CONNECT_CERTIFICATE_FILE
: Path to the SSL certificate fileGATEWAY_CONNECT_PRIVATE_KEY_FILE
: Path to the private key file
- Each gateway server can use its own unique certificate and key pair or they can share a common certificate
Step 5: Register the gateway cluster with Gateway Manager
Follow the procedures for Creating gateway clusters to add your HA cluster to Gateway Manager.
Configure all nodes in your HA deployment with the same GATEWAY_APPLICATION_CLUSTER_ID
. The gateway cluster ID you provide when creating the gateway cluster in Gateway Manager must also match the GATEWAY_APPLICATION_CLUSTER_ID
.
Configuring HA with Distributed Execution
For HA clusters that include runner nodes, follow the all-in-one HA configuration steps above for the gateway servers, then add runner nodes to the cluster.
Additional Steps for Runner Nodes
- Connect runner nodes to shared database: Configure each runner node to connect to the same database (etcd or Amazon DynamoDB) used by the gateway servers
- Configure cluster membership: Ensure runner nodes are configured with the same cluster ID as the gateway servers
- Verify connectivity: Confirm runner nodes can communicate with all gateway servers in the cluster
- Test execution delegation: Verify that any active core server can send execution requests to the runner nodes
For more detailed procedures, see Deploying distributed execution clusters.
Testing the HA Configuration
Verify Initial Setup
- Check cluster status: Review logs on all nodes to confirm they recognize each other
- Confirm leadership: Verify that only one node shows as active in the logs
- Test connectivity: Ensure the active node maintains connection to Gateway Manager
Test Failover Behavior
- Monitor standby nodes: Check logs to confirm standby nodes recognize the current active node
- Initiate controlled failover: Gracefully shut down the active node
- Observe leadership election: Watch for a standby node to become active
- Verify service continuity: Confirm the new active node connects to Gateway Manager
- Test service execution: Run automation tasks to ensure functionality
Sample Failover Log Sequence
During a successful failover, you should observe log entries similar to:
Active node before shutdown:
active-gateway | 2024-12-23T18:03:59Z INF connected to gateway manager at my-itential-cloud-server-ip:443
active-gateway | 2024-12-23T18:04:22Z INF got signal for shutdown....terminated
Standby node taking over:
standby-gateway | 2024-12-23T18:04:19Z DBG this core node with Id of 'xxx' is not the active node
standby-gateway | 2024-12-23T18:04:24Z INF node xxx is elected as the leader. About to start gateway manager...
standby-gateway | 2024-12-23T18:04:24Z INF connected to gateway manager at my-itential-cloud-server
Troubleshooting Common Issues
Nodes Not Recognizing Each Other
- Verify all nodes connect to the same shared database (etcd or Amazon DynamoDB)
- Check network connectivity between nodes
- For etcd: Confirm etcd service is running and accessible
- For Amazon DynamoDB: Verify AWS credentials and permissions are correctly configured
Multiple Active Nodes
- Ensure only one node has
GATEWAY_CONNECT_SERVER_HA_IS_PRIMARY
set totrue
- Check for network partitions affecting database communication
- Verify database consistency and accessibility
Failover Not Occurring
- Confirm
GATEWAY_CONNECT_SERVER_HA_ENABLED
is set totrue
on all nodes - Check shared database connectivity from standby nodes
- Review firewall rules and network connectivity
- For Amazon DynamoDB: Verify AWS permissions allow read/write access
SSL Certificate Issues
- Verify certificate file paths are correct
- Ensure gateway service has read access to certificate files
- Check certificate validity and expiration dates
Post-Deployment Monitoring
After successfully deploying your HA cluster:
- Set up monitoring: Implement monitoring for all cluster nodes
- Review logs regularly: Monitor logs for any connectivity or leadership issues
- Test failover periodically: Regularly test failover procedures to ensure reliability
- Update documentation: Document your specific cluster configuration and any customizations
- Plan maintenance: Develop procedures for updating and maintaining the HA cluster