ctrl-exec - High Availability
Running redundant ctrl-exec instances with shared state
ctrl-exec - High Availability
ctrl-exec is designed so that all persistent state lives on disk in known paths, and the ctrl-exec process itself holds no runtime state that cannot be reconstructed from those files. This property makes horizontal redundancy straightforward: any number of ctrl-exec instances sharing the same state files can serve requests interchangeably.
This document covers what state exists and where, approaches to replicating or sharing it, load balancing and failover patterns, and what HA does not protect against.
For installation and configuration reference, see REFERENCE.md. For CA and cert security guidance, see SECURITY.md and SECURITY-OPERATIONS.md.
What State Exists and Where
All ctrl-exec state is on the filesystem of the ctrl-exec host. There is no embedded database, no in-memory cluster state, and no daemon with persistent connections that must be preserved across restarts.
/etc/ctrl-exec/ca.key- The CA private key. Root of trust for the entire deployment. This is the most sensitive file in the system — access to it allows issuing arbitrary agent certificates. Mode 0600, owned by root.
/etc/ctrl-exec/ca.crt- The CA certificate. Distributed to agents at pairing time and used by both the ctrl-exec and all agents to verify peer certificates.
/etc/ctrl-exec/ca.serial-
The serial counter for cert issuance. Incremented on every signing
operation (
sign_csrinCA.pm). Must be consistent across all ctrl-exec instances — concurrent signing operations against different copies would produce duplicate serials. /etc/ctrl-exec/ctrl-exec.key- The ctrl-exec's own private key.
/etc/ctrl-exec/ctrl-exec.crt-
The ctrl-exec's TLS certificate, signed by the CA. Its serial number is
the value agents store and compare on every
/run,/ping, and/capabilitiesrequest. All ctrl-exec instances must present the same cert. /var/lib/ctrl-exec/agents/-
The agent registry. One JSON file per paired agent (e.g.
web-01.json). Contains hostname, IP, pairing timestamp, cert expiry, and serial tracking state. Read and written by pairing, renewal, rotation, and registry commands. Written atomically via rename. /var/lib/ctrl-exec/locks/-
Concurrency lock files. One file per
host--scriptpair, held viaflock(2)for the duration of a dispatch. These are process-local to the ctrl-exec instance running the dispatch. They do not need to be shared across instances and should not be — see Active/active below. /var/lib/ctrl-exec/runs/-
Stored run results, written by
ctrl-exec-api. Keyed by reqid, retained for 24 hours. Required only ifGET /status/{reqid}is used. If result retrieval is not used, this directory does not affect correctness. /var/lib/ctrl-exec/pairing/- Pending pairing requests. Written when an agent submits a CSR and deleted on approval, denial, or stale expiry (10-minute timeout). Only required on whichever node is running pairing mode. Pairing mode should run on one node at a time.
/var/lib/ctrl-exec/rotation.json-
Cert rotation state: current serial, previous serial, rotation timestamp,
overlap expiry, and per-agent serial tracking status. Written by
rotate-certand the internal check loop.
The paths that must be shared or replicated for active/passive or active/active operation are:
/etc/ctrl-exec/— all CA and cert material/var/lib/ctrl-exec/agents/— agent registry/var/lib/ctrl-exec/rotation.json— rotation state
Lock files and run results are instance-local concerns.
Replication Approaches
Shared filesystem
The simplest approach for bare-metal or VM deployments is a shared
filesystem mounted at /etc/ctrl-exec and /var/lib/ctrl-exec on all
ctrl-exec hosts. Both NFS and DRBD (in primary/secondary or dual-primary
mode) work. All instances read and write the same files.
Considerations:
- Serial counter consistency:
ca.serialis read and written on every cert signing. Under NFS, open-file locking is advisory and may not be respected across clients. Use DRBD with OCFS2 or GFS2 for cluster-safe locking if pairing operations run concurrently across nodes. - Registry writes are atomic (rename), which is safe over NFS on the same subnet but not guaranteed over high-latency links.
- Lock files in
/var/lib/ctrl-exec/locks/should not be on the shared filesystem. Mount only the CA and registry paths; keep locks on local storage per instance.
Active/passive with rsync
For a cold-standby arrangement, rsync the state directories from the primary to the standby on a schedule or after each significant write:
# Run on primary after pairing or rotation events
rsync -az --delete /etc/ctrl-exec/ standby:/etc/ctrl-exec/
rsync -az --delete /var/lib/ctrl-exec/agents/ standby:/var/lib/ctrl-exec/agents/
rsync -az --delete /var/lib/ctrl-exec/rotation.json standby:/var/lib/ctrl-exec/rotation.json
RPO is the rsync interval. For low-frequency pairing environments (agents paired once and seldom changed), a 5-minute cron is sufficient. For fleets where pairing and rotation happen regularly, trigger rsync post-operation rather than on a schedule.
Transfer the CA key over an encrypted, host-authenticated channel only:
scp with known_hosts verification, not StrictHostKeyChecking=no.
Object storage for the registry
The agent registry (/var/lib/ctrl-exec/agents/) is a directory of small
JSON files. In cloud environments, it can be stored in object storage (S3,
GCS, Azure Blob) and synced to local disk on each instance at startup and
after write operations. This is suitable when the fleet is managed from
ephemeral ctrl-exec instances (e.g. autoscaling groups) and a shared NFS
mount is inconvenient.
The CA material (/etc/ctrl-exec/) should not be in object storage — the
CA key must remain in a secrets manager or encrypted block volume with
audited access controls, not in a general-purpose object bucket.
Load Balancing
Port 7443 carries mTLS connections for /run, /ping, and /capabilities.
Each connection is self-contained: the agent authenticates the connecting
cert against the CA, verifies the ctrl-exec serial, processes the request,
and closes the connection. There is no session state that must be pinned to
a specific ctrl-exec instance.
Any TCP/L4 load balancer works for port 7443:
- HAProxy
- L4 or L7 TCP proxy. Configure a backend pool of ctrl-exec hosts with health checks on port 7443. mTLS passthrough (L4 mode) requires no cert configuration on the load balancer.
- keepalived
- Virtual IP failover using VRRP. The active ctrl-exec holds the VIP; on failure the VIP moves to the standby. Agents connect to the VIP address and are unaware of the failover. Suitable for two-node active/passive.
- DNS round-robin
- Multiple A records for the ctrl-exec hostname. Agents resolve the name on each request. No dedicated load balancer required. Failover depends on DNS TTL and client retry behaviour; not suitable where sub-minute failover is required.
Port 7444 (pairing) and port 7445 (API) do not need to be load-balanced
in normal operation. Pairing mode runs on one node at a time. The API can
be load-balanced but result storage in /var/lib/ctrl-exec/runs/ must
be on a shared path if GET /status/{reqid} is expected to work regardless
of which node handled the original request.
Active/Passive Failover
In an active/passive setup, one ctrl-exec instance handles all traffic; the standby holds a replicated copy of all state and takes over when the primary fails.
Promotion procedure:
- Confirm the primary is unreachable (avoid split-brain — do not promote the standby while the primary may still be serving).
- Ensure the standby has a current copy of the state directories. If using rsync replication, trigger a final sync if the primary is still accessible, or accept the lag from the last scheduled sync.
On the standby, start the ctrl-exec services:
bash systemctl start ctrl-exec-apiMove the virtual IP or update DNS to point at the standby.
Agents reconnect transparently on their next request. There is no re-pairing required. The standby presents the same ctrl-exec cert (same serial) as the primary — agents see no difference.
If the standby was behind in registry state (new agents paired on the
primary after the last sync), those agents will be unknown to the newly
promoted node. They will still connect successfully on port 7443 (mTLS
trust is CA-based, not registry-based) but will not appear in
list-agents until the registry entry is recovered or the agent is
re-paired.
Active/Active
Multiple ctrl-exec instances serving port 7443 simultaneously is
supported for run and ping operations. All instances present the same
cert (same serial), share the same registry, and agents accept connections
from any of them.
- Concurrency locking
-
Lock files in
/var/lib/ctrl-exec/locks/are per-instance. An active/active setup does not provide cross-instance concurrency locks — two instances can dispatch the same script to the same agent at the same time. If concurrency control matters, either keep lock files on a shared filesystem with cluster-safe locking, or route all requests for a given agent through the same instance (consistent hashing at the load balancer). - Pairing mode
-
Pairing mode should only run on one node at a time. The pairing queue
in
/var/lib/ctrl-exec/pairing/is not designed for concurrent write access from multiple instances. Run pairing interactively on a designated node, or useapproveanddenycommands on the same node that accepted the request. - Cert rotation
-
rotate-certshould be run on one node. It writesrotation.json, broadcasts the new serial to all agents, and updates per-agent status in the registry. Running it simultaneously from two nodes would produce a race onrotation.jsonandca.serial. Schedule rotation as a maintenance operation on a designated node. - Registry writes
- Agent registry files are written atomically via rename. Concurrent writes from multiple instances to different agent files are safe. Concurrent writes to the same agent file (e.g. two renewals for the same agent) are last-write-wins — operationally harmless since the content converges.
Cert Rotation in an HA Setup
ctrl-exec cert rotation updates the serial stored on every agent. In an HA setup, all instances must present the new cert immediately after rotation — an instance still presenting the old cert will be rejected by agents that have already updated their stored serial.
Rotation procedure for HA:
- Run
ctrl-exec rotate-certon one designated node. This generates the new cert, writes it to/etc/ctrl-exec/ctrl-exec.crtand/etc/ctrl-exec/ctrl-exec.key, and broadcasts the new serial to all agents viaupdate-ctrl-exec-serial. - Sync the updated
/etc/ctrl-exec/to all other ctrl-exec instances immediately. All instances must reload their cert before any agent completes its serial update. In practice the broadcast takes seconds to minutes depending on fleet size; sync should complete before that window closes. Reload or restart all ctrl-exec instances:
bash systemctl restart ctrl-exec-apictrl-exec-apireads its cert at startup. There is no live cert reload — a restart is required.
The update-ctrl-exec-serial script on each agent writes the new serial
and sends SIGHUP to the agent process. After SIGHUP, the agent will reject
connections from any ctrl-exec presenting the old serial. The overlap
window (cert_overlap_days, default 30 days) is the time allowed for
agents that were unreachable during the broadcast to reconnect and receive
the update — it is not a grace period for the ctrl-exec instances themselves.
All ctrl-exec instances must be updated before the first agent processes
its serial update.
What HA Does Not Solve
- CA key compromise
- An attacker with the CA key can issue valid agent certificates regardless of how many ctrl-exec instances exist. The CA is the single root of trust for the deployment. HA increases availability; it does not limit the blast radius of a CA key compromise. All instances share the same CA, so a compromise affects all of them equally. See SECURITY-OPERATIONS.md for the CA compromise recovery procedure.
- Cert serial consistency
- All instances must present the same ctrl-exec cert. Divergence — one instance presenting an old cert — causes agents to reject that instance after a rotation. The replication and reload procedure must be treated as an atomic operation across the fleet.
- Pairing queue coordination
-
Pending pairing requests in
/var/lib/ctrl-exec/pairing/are not replicated in a standard rsync setup. A request submitted to one node's pairing mode cannot be approved on another. Run pairing on a single designated node. - Agent cert revocation propagation
-
The revocation list on each agent (
/etc/ctrl-exec-agent/revoked-serials) must be updated via actrl-exec runto each agent individually. HA on the ctrl-exec side does not change this — revocation state lives on the agents, not on the ctrl-exec. A ctrl-exec failover does not affect which certs agents will accept or reject. - Split-brain
-
If two ctrl-exec instances both believe they are primary and both run
rotate-certsimultaneously, the results are undefined. Use VRRP, distributed locking, or operational discipline to ensure rotation runs on exactly one node at a time.