ctrl-exec - High Availability
Running redundant dispatcher instances with shared state
ctrl-exec - High Availability
ctrl-exec is designed so that all persistent state lives on disk in known paths, and the dispatcher process itself holds no runtime state that cannot be reconstructed from those files. This property makes horizontal redundancy straightforward: any number of dispatcher instances sharing the same state files can serve requests interchangeably.
This document covers what state exists and where, approaches to replicating or sharing it, load balancing and failover patterns, and what HA does not protect against.
For installation and configuration reference, see REFERENCE.md. For CA and cert security guidance, see SECURITY.md and SECURITY-OPERATIONS.md.
What State Exists and Where
All ctrl-exec state is on the filesystem of the dispatcher host. There is no embedded database, no in-memory cluster state, and no daemon with persistent connections that must be preserved across restarts.
/etc/ctrl-exec/ca.key- The CA private key. Root of trust for the entire deployment. This is the most sensitive file in the system — access to it allows issuing arbitrary agent certificates. Mode 0600, owned by root.
/etc/ctrl-exec/ca.crt- The CA certificate. Distributed to agents at pairing time and used by both the dispatcher and all agents to verify peer certificates.
/etc/ctrl-exec/ca.serial-
The serial counter for cert issuance. Incremented on every signing
operation (
sign_csrinCA.pm). Must be consistent across all dispatcher instances — concurrent signing operations against different copies would produce duplicate serials. /etc/ctrl-exec/dispatcher.key- The dispatcher's own private key.
/etc/ctrl-exec/dispatcher.crt-
The dispatcher's TLS certificate, signed by the CA. Its serial number is
the value agents store as a key in their trusted-dispatcher map and check
on every
/run,/ping,/result, and/capabilitiesrequest. All instances of the same dispatcher must present the same cert. (As of 0.9.0 an agent's map may hold the serials of several distinct dispatchers — see Native Multi-Dispatcher below. Within one dispatcher's HA set, every instance still presents that dispatcher's single shared cert.) /var/lib/ctrl-exec/agents/-
The agent registry. One JSON file per paired agent (e.g.
web-01.json). Contains hostname, IP, pairing timestamp, cert expiry, and serial tracking state. Read and written by pairing, renewal, rotation, and registry commands. Written atomically via rename. /var/lib/ctrl-exec/locks/-
Concurrency lock files. One file per
host--scriptpair, held viaflock(2)for the duration of a dispatch. These are process-local to the dispatcher instance running the dispatch. They do not need to be shared across instances and should not be — see Active/active below. /var/lib/ctrl-exec/runs/-
Stored run results, written by
ctrl-exec-api. Keyed by reqid, retained for 24 hours. Required only ifGET /status/{reqid}is used. If result retrieval is not used, this directory does not affect correctness. /var/lib/ctrl-exec/pairing/- Pending pairing requests. Written when an agent submits a CSR and deleted on approval, denial, or stale expiry (10-minute timeout). Only required on whichever node is running pairing mode. Pairing mode should run on one node at a time.
/var/lib/ctrl-exec/rotation.json-
Cert rotation state: current serial, previous serial, rotation timestamp,
overlap expiry, and per-agent serial tracking status. Written by
rotate-certand the internal check loop.
The paths that must be shared or replicated for active/passive or active/active operation are:
/etc/ctrl-exec/— all CA and cert material/var/lib/ctrl-exec/agents/— agent registry/var/lib/ctrl-exec/rotation.json— rotation state
Lock files and run results are instance-local concerns.
Replication Approaches
Shared filesystem
The simplest approach for bare-metal or VM deployments is a shared
filesystem mounted at /etc/ctrl-exec and /var/lib/ctrl-exec on all
dispatcher hosts. Both NFS and DRBD (in primary/secondary or dual-primary
mode) work. All instances read and write the same files.
Considerations:
- Serial counter consistency:
ca.serialis read and written on every cert signing. Under NFS, open-file locking is advisory and may not be respected across clients. Use DRBD with OCFS2 or GFS2 for cluster-safe locking if pairing operations run concurrently across nodes. - Registry writes are atomic (rename), which is safe over NFS on the same subnet but not guaranteed over high-latency links.
- Lock files in
/var/lib/ctrl-exec/locks/should not be on the shared filesystem. Mount only the CA and registry paths; keep locks on local storage per instance.
Active/passive with rsync
For a cold-standby arrangement, rsync the state directories from the primary to the standby on a schedule or after each significant write:
# Run on primary after pairing or rotation events
rsync -az --delete /etc/ctrl-exec/ standby:/etc/ctrl-exec/
rsync -az --delete /var/lib/ctrl-exec/agents/ standby:/var/lib/ctrl-exec/agents/
rsync -az --delete /var/lib/ctrl-exec/rotation.json standby:/var/lib/ctrl-exec/rotation.json
RPO is the rsync interval. For low-frequency pairing environments (agents paired once and seldom changed), a 5-minute cron is sufficient. For fleets where pairing and rotation happen regularly, trigger rsync post-operation rather than on a schedule.
Transfer the CA key over an encrypted, host-authenticated channel only:
scp with known_hosts verification, not StrictHostKeyChecking=no.
Object storage for the registry
The agent registry (/var/lib/ctrl-exec/agents/) is a directory of small
JSON files. In cloud environments, it can be stored in object storage (S3,
GCS, Azure Blob) and synced to local disk on each instance at startup and
after write operations. This is suitable when the fleet is managed from
ephemeral dispatcher instances (e.g. autoscaling groups) and a shared NFS
mount is inconvenient.
The CA material (/etc/ctrl-exec/) should not be in object storage — the
CA key must remain in a secrets manager or encrypted block volume with
audited access controls, not in a general-purpose object bucket.
Load Balancing
Port 7443 carries mTLS connections for /run, /ping, /result, and
/capabilities. Each connection is self-contained: the agent authenticates
the connecting cert against the CA, checks the connecting serial is a key in
its trusted-dispatcher map, processes the request, and closes the connection.
There is no session state that must be pinned to a specific dispatcher
instance.
Any TCP/L4 load balancer works for port 7443:
- HAProxy
- L4 or L7 TCP proxy. Configure a backend pool of dispatcher hosts with health checks on port 7443. mTLS passthrough (L4 mode) requires no cert configuration on the load balancer.
- keepalived
- Virtual IP failover using VRRP. The active dispatcher holds the VIP; on failure the VIP moves to the standby. Agents connect to the VIP address and are unaware of the failover. Suitable for two-node active/passive.
- DNS round-robin
- Multiple A records for the dispatcher hostname. Agents resolve the name on each request. No dedicated load balancer required. Failover depends on DNS TTL and client retry behaviour; not suitable where sub-minute failover is required.
Port 7444 (pairing) and port 7445 (API) do not need to be load-balanced
in normal operation. Pairing mode runs on one node at a time. The API can
be load-balanced but result storage in /var/lib/ctrl-exec/runs/ must
be on a shared path if GET /status/{reqid} is expected to work regardless
of which node handled the original request. (On the agent side, async results
are partitioned by owning dispatcher under
/var/lib/ctrl-exec-agent/runs/<dispatcher-id>/<reqid>.json, and
GET /result/<reqid> is owner-gated — see DEVELOPER.md. That is a per-agent
concern and is unaffected by dispatcher-side load balancing.)
Active/Passive Failover
In an active/passive setup, one dispatcher instance handles all traffic; the standby holds a replicated copy of all state and takes over when the primary fails.
Promotion procedure:
- Confirm the primary is unreachable (avoid split-brain — do not promote the standby while the primary may still be serving).
- Ensure the standby has a current copy of the state directories. If using rsync replication, trigger a final sync if the primary is still accessible, or accept the lag from the last scheduled sync.
On the standby, start the dispatcher services:
bash systemctl start ctrl-exec-apiMove the virtual IP or update DNS to point at the standby.
Agents reconnect transparently on their next request. There is no
re-pairing required. The standby presents the same dispatcher cert (same
serial, same dispatcher_id) as the primary — agents see no difference, and
the serial already in their trusted-dispatcher map continues to match.
If the standby was behind in registry state (new agents paired on the
primary after the last sync), those agents will be unknown to the newly
promoted node. They will still connect successfully on port 7443 (mTLS
trust is CA-based, not registry-based) but will not appear in
list-agents until the registry entry is recovered or the agent is
re-paired.
Active/Active
Multiple dispatcher instances serving port 7443 simultaneously is
supported for run and ping operations. All instances present the same
cert (same serial), share the same registry, and agents accept connections
from any of them.
This active/active model is several instances of one dispatcher sharing a
single identity (one cert, one serial, one dispatcher_id) for redundancy.
It is distinct from native multi-dispatcher (0.9.0), where several distinct
dispatchers — each with its own identity, cert, and dispatcher_id — appear
as separate keys in an agent's trusted-dispatcher map. The two compose: a
single dispatcher in such a map may itself be run active/active. See Native
Multi-Dispatcher below.
- Concurrency locking
-
Lock files in
/var/lib/ctrl-exec/locks/are per-instance. An active/active setup does not provide cross-instance concurrency locks — two instances can dispatch the same script to the same agent at the same time. If concurrency control matters, either keep lock files on a shared filesystem with cluster-safe locking, or route all requests for a given agent through the same instance (consistent hashing at the load balancer). - Pairing mode
-
Pairing mode should only run on one node at a time. The pairing queue
in
/var/lib/ctrl-exec/pairing/is not designed for concurrent write access from multiple instances. Run pairing interactively on a designated node, or useapproveanddenycommands on the same node that accepted the request. - Cert rotation
-
rotate-certshould be run on one node. It writesrotation.json, broadcasts the new serial to all agents, and updates per-agent status in the registry. Running it simultaneously from two nodes would produce a race onrotation.jsonandca.serial. Schedule rotation as a maintenance operation on a designated node. - Registry writes
- Agent registry files are written atomically via rename. Concurrent writes from multiple instances to different agent files are safe. Concurrent writes to the same agent file (e.g. two renewals for the same agent) are last-write-wins — operationally harmless since the content converges.
Native Multi-Dispatcher
Native multi-dispatcher (0.9.0) lets a single agent serve more than one
distinct dispatcher. Each dispatcher presents its own cert, chaining to a
shared CA root, and each appears as its own line in the agent's
trusted-dispatcher map at /var/lib/ctrl-exec-agent/ctrl-exec-dispatchers — one
line per dispatcher, <hex-serial> <dispatcher-id>. The map lives in the
agent-writable state directory (not under /etc/ctrl-exec-agent, which holds
only secrets), so the agent can rewrite it in place during rotation. It
replaces the old single ctrl-exec-serial file and is reloaded on SIGHUP.
/run, /ping, /result, and /capabilities are accepted only if the
connecting cert's serial is a key in the map. The matched dispatcher's
identity (its dispatcher_id) then drives permission and attribution: agent
logs carry DISPATCHER=<id>, auth hooks receive ENVEXEC_DISPATCHER and
ENVEXEC_DISPATCHER_SERIAL, and the async result store is partitioned by
owner under /var/lib/ctrl-exec-agent/runs/<dispatcher-id>/<reqid>.json so a
run is returned only to the dispatcher that submitted it (others get 404).
Each dispatcher has a stable dispatcher_id, set via dispatcher_id in
ctrl-exec.conf and defaulting to the dispatcher's hostname. It is delivered
to the agent at pairing time. Pairing appends to the map rather than
replacing it, so pairing a second dispatcher to an agent leaves the first in
place.
This is the supported mechanism for running multiple operators of differing trust classes against one host. Running separate agent instances per operator remains possible as optional defence-in-depth, but is no longer the way to achieve multi-operator separation. Independent per-dispatcher CAs with SNI are out of scope; all dispatchers in an agent's map chain to one shared CA root.
- Relation to the HA models above
- Active/passive and active/active concern redundant instances of one dispatcher sharing a single identity. Native multi-dispatcher concerns several distinct dispatcher identities trusted by one agent. They are orthogonal and may be combined: any dispatcher listed in an agent's map can itself be deployed active/active behind a load balancer or VIP.
Cert Rotation in an HA Setup
dispatcher cert rotation changes the serial agents must trust. In an HA setup
all instances present the same cert, so they rotate together — they share one
cert, one serial, and one dispatcher_id. As of 0.9.0 rotation propagates to
each reachable agent automatically over the run channel, with no re-pairing,
via add-then-remove against the trusted-dispatcher map (see below).
- Seamless rotation (0.9.0)
-
Rotation updates each agent's trusted-dispatcher map at
/var/lib/ctrl-exec-agent/ctrl-exec-dispatchersautomatically. The serial broadcast (broadcast_serial→ built-in/rotate-serial) adds the new serial to the map against the dispatcher's stabledispatcher_id, while the old serial stays trusted through the overlap window — so the live cert (the old one before the instances restart, the new one after) is accepted throughout. After the overlap window the dispatcher broadcasts removal of the old serial (retire_previous_serial, logged asserial-retire), so the retired cert stops being trusted. Becausedispatcher_idis stable across rotation, trust and attribution carry over, and the agent can rewrite its own map because it lives in the agent-writable state directory. Only an agent that was offline during the broadcast misses the new serial and, once the overlap window expires, is markedserial-staleand needs re-pairing.
Rotation procedure for HA:
- Run
ced rotate-certon one designated node. This generates the new cert, writes it to/etc/ctrl-exec/dispatcher.crtand/etc/ctrl-exec/dispatcher.key, and broadcasts the new serial to all agents via their built-in/rotate-serialoperation. - Sync the updated
/etc/ctrl-exec/to all other dispatcher instances. The old serial stays trusted through the overlap window, so an instance still presenting the old cert continues to be accepted while the sync and restarts roll through — there is no narrow window in which an unsynced instance is rejected. Complete the sync well before the overlap window expires. Reload or restart all dispatcher instances:
bash systemctl restart ctrl-exec-apictrl-exec-apireads its cert at startup. There is no live cert reload — a restart is required.
Each agent's built-in /rotate-serial handler adds the new serial to the
trusted-dispatcher map and sends SIGHUP to the agent process; the agent reloads
its map on SIGHUP and immediately trusts the rotated cert, with no re-pairing.
The overlap window (cert_overlap_days, default 30 days) is the time the old
serial remains trusted: it lets the dispatcher instances roll from the old cert
to the new at their own pace, and it is also the window in which agents that
were unreachable during the broadcast can reconnect and receive the update. An
agent still offline when the window expires is marked serial-stale and needs
re-pairing.
What HA Does Not Solve
- CA key compromise
- An attacker with the CA key can issue valid agent certificates regardless of how many dispatcher instances exist. The CA is the single root of trust for the deployment. HA increases availability; it does not limit the blast radius of a CA key compromise. All instances share the same CA, so a compromise affects all of them equally. See SECURITY-OPERATIONS.md for the CA compromise recovery procedure.
- Cert serial consistency
- All instances must converge on the new dispatcher cert during a rotation. The overlap window keeps the old serial trusted while instances roll over, so an instance briefly presenting the old cert is still accepted — divergence is not immediately fatal. But an instance left on the old cert past the overlap window, after the old serial has been retired, is rejected by every agent. The replication and reload procedure must complete well within the window.
- Pairing queue coordination
-
Pending pairing requests in
/var/lib/ctrl-exec/pairing/are not replicated in a standard rsync setup. A request submitted to one node's pairing mode cannot be approved on another. Run pairing on a single designated node. - Agent cert revocation propagation
-
The revocation list on each agent (
/etc/ctrl-exec-agent/revoked-serials) must be updated via aced runto each agent individually. HA on the dispatcher side does not change this — revocation state lives on the agents, not on the dispatcher. A dispatcher failover does not affect which certs agents will accept or reject. - Split-brain
-
If two dispatcher instances both believe they are primary and both run
rotate-certsimultaneously, the results are undefined. Use VRRP, distributed locking, or operational discipline to ensure rotation runs on exactly one node at a time.