- 09 Jan 2025
-
DarkLight
-
PDF
High Availability (HA) Configuration
- Updated on 09 Jan 2025
-
DarkLight
-
PDF
Automation Services can be deployed in High Availability (HA) configuration where there is one active node and multiple standby nodes. The active node is the primary node responsible for handling all requests. The standby nodes are backup nodes that are ready to take over if the active node fails. The standby nodes are in a "hot standby" state, meaning they are ready to take over at any time.
Simple Active/Standby Deployment
A simple active/standby deployment can be configured to resemble the following:
Figure 1
Only one node can be active at a time. The active node is the node that is connected to Itential Cloud. The gateway nodes are able to communicate with each other through the etcd
database to determine if the active node is both running and able to connect to Itential Cloud. If the active node fails, the standby node will take over and become the active node.
Active/Standby Core Node Deployment
An active/standby deployment with core nodes and distributed runner nodes can be configured to resemble the following:
Figure 2
This concept for a highly available deployment is separate from having execution only "runner" nodes as described in the Distributed Service Execution Guide. Every core node in the cluster is capable of sending execution requests to one of the runner nodes as they all share the same etcd
database.
How to Configure HA Node Cluster Deployments
To configure an active/standby deployment, the following steps are required:
- Ensure all nodes are connected to the same
etcd
database. More information on configuring anetcd
database can be found here. - Enable the active/standby mode in the gateway configuration file. This is done by setting the
GATEWAY_COMMANDER_SERVER_HA_ENABLED
configuration variable totrue
on all core nodes in the cluster. - Set the
GATEWAY_COMMANDER_SERVER_HA_IS_PRIMARY
configuration variable totrue
on the active node. This will ensure the active node will always maintain a connection to Itential Cloud when it is online. - Add the cluster to Itential Cloud if it is not already configured. The entire cluster will be treated as a single gateway in the UI as shown in Figure 3 below.
- Ensure that every active and standby node has a proper certificate key-pair configured via the configuration variables
GATEWAY_COMMANDER_CERTIFICATE_FILE
andGATEWAY_COMMANDER_PRIVATE_KEY_FILE
. Every core gateway can have its own key pair as shown in Figure 4 below.
And that's it! The cluster is now configured in an active/standby mode. You can now test the fail-over by stopping the active node and observing the standby node takeover via the logs on both servers.
Figure 3
Figure 4
Fail-Over Log Example
An example subset of the logs on an active node and a standby node when an active node is shut down is shown below to demonstrate how the fail-over works.
active-gateway | 2024-12-23T18:03:59Z INF connected to commander at my-itential-cloud-server-ip:443
standby-gateway | 2024-12-23T18:04:19Z DBG this core node with Id of '717d13b92073_0193f4b0-ac05-7b71-b14b-d418048bd729' is not the active node. The current active core node is '6fe81873bab5_0193f4b0-ab55-70f3-8aca-cd0b7203a7f9'
active-gateway | 2024-12-23T18:04:22Z INF got signal for shutdown....terminated
active-gateway | 2024-12-23T18:04:22Z INF received shutdown signal
active-gateway exited with code 0
standby-gateway | 2024-12-23T18:04:24Z INF node 717d13b92073_0193f4b0-ac05-7b71-b14b-d418048bd729 is elected as the leader. About to start commander connection...
standby-gateway | 2024-12-23T18:04:24Z INF creating connection to commander at 'my-itential-cloud-server.itential.io:443'
standby-gateway | 2024-12-23T18:04:24Z INF attempting to connect to wss://my-itential-cloud-server.itential.io:443/ws
standby-gateway | 2024-12-23T18:04:24Z INF connected to commander at my-itential-cloud-server-ip:443