Deploying high availability clusters
  • 05 Jun 2025
  • Dark
    Light
  • PDF

Deploying high availability clusters

  • Dark
    Light
  • PDF

Article summary

High Availability (HA) IAG deployments ensure continuous service availability by maintaining one active gateway server with multiple standby nodes ready to take over in case of failure. This guide covers the procedures for configuring both all-in-one HA clusters and HA clusters with distributed execution.

Prerequisites

Before configuring an HA cluster, ensure you have:

  • Multiple IAG server nodes installed and ready for configuration
  • Access to configure an shared database (etcd or Amazon DynamoDB)
  • Valid SSL certificates for each node
  • Administrative access to Gateway Manager

Configuring All-in-One HA Clusters

Step 1: Configure the shared database

All nodes in the HA cluster must connect to the same shared database to coordinate leadership and share cluster state information. You can choose to use either an etcd or Amazon DynamoDB database based on your infrastructure preferences and requirements.

etcd

  1. Set up your etcd database following the etcd database configuration procedures
  2. Ensure all gateway servers can connect to the etcd instance
  3. Verify network connectivity and firewall rules allow etcd communication

Amazon DynamoDB

  1. Set up your DynamoDB database following the Amazon DynamoDB table configuration procedures
  2. Configure appropriate AWS credentials and permissions for all gateway servers
  3. Verify network connectivity and AWS access from all nodes

Step 2: Enable high availability mode

Configure HA mode on all gateway servers in the cluster:

  1. On each gateway server, edit the gateway configuration file
  2. Set the GATEWAY_CONNECT_SERVER_HA_ENABLED configuration variable to true
  3. Set the GATEWAY_APPLICATION_CLUSTER_ID to your desired cluster identifier, all nodes must share the same cluster ID
  4. These settings must be applied to all controller nodes that will participate in the HA cluster

Step 3: Designate the primary node

Identify which gateway server should be the preferred active node:

  1. On your designated primary gateway server, set the GATEWAY_CONNECT_SERVER_HA_IS_PRIMARY configuration variable to true. This ensures the primary node will maintain the connection to Gateway Manager when it is online and available.
  2. Leave this setting as false (or unset) on all other nodes in the cluster.

Step 4: Configure SSL certificates

Each gateway server in the cluster requires its own SSL certificate configuration:

  • On each gateway server, configure the following variables:
    • GATEWAY_CONNECT_CERTIFICATE_FILE: Path to the SSL certificate file
    • GATEWAY_CONNECT_PRIVATE_KEY_FILE: Path to the private key file
  • Each gateway server can use its own unique certificate and key pair or they can share a common certificate

Step 5: Register the gateway cluster with Gateway Manager

Follow the procedures for Creating gateway clusters to add your HA cluster to Gateway Manager.

Note

Configure all nodes in your HA deployment with the same GATEWAY_APPLICATION_CLUSTER_ID. The gateway cluster ID you provide when creating the gateway cluster in Gateway Manager must also match the GATEWAY_APPLICATION_CLUSTER_ID.

Configuring HA with Distributed Execution

For HA clusters that include runner nodes, follow the all-in-one HA configuration steps above for the gateway servers, then add runner nodes to the cluster.

Additional Steps for Runner Nodes

  1. Connect runner nodes to shared database: Configure each runner node to connect to the same database (etcd or Amazon DynamoDB) used by the gateway servers
  2. Configure cluster membership: Ensure runner nodes are configured with the same cluster ID as the gateway servers
  3. Verify connectivity: Confirm runner nodes can communicate with all gateway servers in the cluster
  4. Test execution delegation: Verify that any active core server can send execution requests to the runner nodes

For more detailed procedures, see Deploying distributed execution clusters.

Testing the HA Configuration

Verify Initial Setup

  • Check cluster status: Review logs on all nodes to confirm they recognize each other
  • Confirm leadership: Verify that only one node shows as active in the logs
  • Test connectivity: Ensure the active node maintains connection to Gateway Manager

Test Failover Behavior

  • Monitor standby nodes: Check logs to confirm standby nodes recognize the current active node
  • Initiate controlled failover: Gracefully shut down the active node
  • Observe leadership election: Watch for a standby node to become active
  • Verify service continuity: Confirm the new active node connects to Gateway Manager
  • Test service execution: Run automation tasks to ensure functionality

Sample Failover Log Sequence

During a successful failover, you should observe log entries similar to:

Active node before shutdown:

active-gateway | 2024-12-23T18:03:59Z INF connected to gateway manager at my-itential-cloud-server-ip:443
active-gateway | 2024-12-23T18:04:22Z INF got signal for shutdown....terminated

Standby node taking over:

standby-gateway | 2024-12-23T18:04:19Z DBG this core node with Id of 'xxx' is not the active node
standby-gateway | 2024-12-23T18:04:24Z INF node xxx is elected as the leader. About to start gateway manager...
standby-gateway | 2024-12-23T18:04:24Z INF connected to gateway manager at my-itential-cloud-server

Troubleshooting Common Issues

Nodes Not Recognizing Each Other

  • Verify all nodes connect to the same shared database (etcd or Amazon DynamoDB)
  • Check network connectivity between nodes
  • For etcd: Confirm etcd service is running and accessible
  • For Amazon DynamoDB: Verify AWS credentials and permissions are correctly configured

Multiple Active Nodes

  • Ensure only one node has GATEWAY_CONNECT_SERVER_HA_IS_PRIMARY set to true
  • Check for network partitions affecting database communication
  • Verify database consistency and accessibility

Failover Not Occurring

  • Confirm GATEWAY_CONNECT_SERVER_HA_ENABLED is set to true on all nodes
  • Check shared database connectivity from standby nodes
  • Review firewall rules and network connectivity
  • For Amazon DynamoDB: Verify AWS permissions allow read/write access

SSL Certificate Issues

  • Verify certificate file paths are correct
  • Ensure gateway service has read access to certificate files
  • Check certificate validity and expiration dates

Post-Deployment Monitoring

After successfully deploying your HA cluster:

  1. Set up monitoring: Implement monitoring for all cluster nodes
  2. Review logs regularly: Monitor logs for any connectivity or leadership issues
  3. Test failover periodically: Regularly test failover procedures to ensure reliability
  4. Update documentation: Document your specific cluster configuration and any customizations
  5. Plan maintenance: Develop procedures for updating and maintaining the HA cluster

Was this article helpful?

Changing your password will log you out immediately. Use the new password to log back in.
First name must have atleast 2 characters. Numbers and special characters are not allowed.
Last name must have atleast 1 characters. Numbers and special characters are not allowed.
Enter a valid email
Enter a valid password
Your profile has been successfully updated.