High Availability (HA) Configuration

Updated on 09 Jan 2025

Dark

Light
PDF

Article summary

Did you find this summary helpful?

Thank you for your feedback

Automation Services can be deployed in High Availability (HA) configuration where there is one active node and multiple standby nodes. The active node is the primary node responsible for handling all requests. The standby nodes are backup nodes that are ready to take over if the active node fails. The standby nodes are in a "hot standby" state, meaning they are ready to take over at any time.

Simple Active/Standby Deployment

A simple active/standby deployment can be configured to resemble the following:

Figure 1
01_ha_simple_deployment

Only one node can be active at a time. The active node is the node that is connected to Itential Cloud. The gateway nodes are able to communicate with each other through the etcd database to determine if the active node is both running and able to connect to Itential Cloud. If the active node fails, the standby node will take over and become the active node.

Active/Standby Core Node Deployment

An active/standby deployment with core nodes and distributed runner nodes can be configured to resemble the following:

Figure 2
02_ha_node_deployment

This concept for a highly available deployment is separate from having execution only "runner" nodes as described in the Distributed Service Execution Guide. Every core node in the cluster is capable of sending execution requests to one of the runner nodes as they all share the same etcd database.

How to Configure HA Node Cluster Deployments

To configure an active/standby deployment, the following steps are required:

Ensure all nodes are connected to the same etcd database. More information on configuring an etcd database can be found here.
Enable the active/standby mode in the gateway configuration file. This is done by setting theGATEWAY_COMMANDER_SERVER_HA_ENABLED configuration variable to true on all core nodes in the cluster.
Set the GATEWAY_COMMANDER_SERVER_HA_IS_PRIMARY configuration variable to true on the active node. This will ensure the active node will always maintain a connection to Itential Cloud when it is online.
Add the cluster to Itential Cloud if it is not already configured. The entire cluster will be treated as a single gateway in the UI as shown in Figure 3 below.
Ensure that every active and standby node has a proper certificate key-pair configured via the configuration variables GATEWAY_COMMANDER_CERTIFICATE_FILE and GATEWAY_COMMANDER_PRIVATE_KEY_FILE. Every core gateway can have its own key pair as shown in Figure 4 below.

And that's it! The cluster is now configured in an active/standby mode. You can now test the fail-over by stopping the active node and observing the standby node takeover via the logs on both servers.

Figure 3
03_ha_cluster_nodes

Figure 4
04_ha_certs

Fail-Over Log Example

An example subset of the logs on an active node and a standby node when an active node is shut down is shown below to demonstrate how the fail-over works.

active-gateway    | 2024-12-23T18:03:59Z INF connected to commander at my-itential-cloud-server-ip:443
standby-gateway   | 2024-12-23T18:04:19Z DBG this core node with Id of '717d13b92073_0193f4b0-ac05-7b71-b14b-d418048bd729' is not the active node. The current active core node is '6fe81873bab5_0193f4b0-ab55-70f3-8aca-cd0b7203a7f9'
active-gateway    | 2024-12-23T18:04:22Z INF got signal for shutdown....terminated
active-gateway    | 2024-12-23T18:04:22Z INF received shutdown signal
active-gateway exited with code 0
standby-gateway   | 2024-12-23T18:04:24Z INF node 717d13b92073_0193f4b0-ac05-7b71-b14b-d418048bd729 is elected as the leader. About to start commander connection...
standby-gateway   | 2024-12-23T18:04:24Z INF creating connection to commander at 'my-itential-cloud-server.itential.io:443'
standby-gateway   | 2024-12-23T18:04:24Z INF attempting to connect to wss://my-itential-cloud-server.itential.io:443/ws
standby-gateway   | 2024-12-23T18:04:24Z INF connected to commander at my-itential-cloud-server-ip:443

Was this article helpful?

What's Next

Using the Automation Service

Table of contents

Simple Active/Standby Deployment
Active/Standby Core Node Deployment
How to Configure HA Node Cluster Deployments
Fail-Over Log Example