ctrl-exec - High Availability

ctrl-exec is designed so that all persistent state lives on disk in known paths, and the ctrl-exec process itself holds no runtime state that cannot be reconstructed from those files. This property makes horizontal redundancy straightforward: any number of ctrl-exec instances sharing the same state files can serve requests interchangeably.

This document covers what state exists and where, approaches to replicating or sharing it, load balancing and failover patterns, and what HA does not protect against.

For installation and configuration reference, see REFERENCE.md. For CA and cert security guidance, see SECURITY.md and SECURITY-OPERATIONS.md.

What State Exists and Where

All ctrl-exec state is on the filesystem of the ctrl-exec host. There is no embedded database, no in-memory cluster state, and no daemon with persistent connections that must be preserved across restarts.

/etc/ctrl-exec/ca.key
The CA private key. Root of trust for the entire deployment. This is the most sensitive file in the system — access to it allows issuing arbitrary agent certificates. Mode 0600, owned by root.
/etc/ctrl-exec/ca.crt
The CA certificate. Distributed to agents at pairing time and used by both the ctrl-exec and all agents to verify peer certificates.
/etc/ctrl-exec/ca.serial
The serial counter for cert issuance. Incremented on every signing operation (sign_csr in CA.pm). Must be consistent across all ctrl-exec instances — concurrent signing operations against different copies would produce duplicate serials.
/etc/ctrl-exec/ctrl-exec.key
The ctrl-exec's own private key.
/etc/ctrl-exec/ctrl-exec.crt
The ctrl-exec's TLS certificate, signed by the CA. Its serial number is the value agents store and compare on every /run, /ping, and /capabilities request. All ctrl-exec instances must present the same cert.
/var/lib/ctrl-exec/agents/
The agent registry. One JSON file per paired agent (e.g. web-01.json). Contains hostname, IP, pairing timestamp, cert expiry, and serial tracking state. Read and written by pairing, renewal, rotation, and registry commands. Written atomically via rename.
/var/lib/ctrl-exec/locks/
Concurrency lock files. One file per host--script pair, held via flock(2) for the duration of a dispatch. These are process-local to the ctrl-exec instance running the dispatch. They do not need to be shared across instances and should not be — see Active/active below.
/var/lib/ctrl-exec/runs/
Stored run results, written by ctrl-exec-api. Keyed by reqid, retained for 24 hours. Required only if GET /status/{reqid} is used. If result retrieval is not used, this directory does not affect correctness.
/var/lib/ctrl-exec/pairing/
Pending pairing requests. Written when an agent submits a CSR and deleted on approval, denial, or stale expiry (10-minute timeout). Only required on whichever node is running pairing mode. Pairing mode should run on one node at a time.
/var/lib/ctrl-exec/rotation.json
Cert rotation state: current serial, previous serial, rotation timestamp, overlap expiry, and per-agent serial tracking status. Written by rotate-cert and the internal check loop.

The paths that must be shared or replicated for active/passive or active/active operation are:

  • /etc/ctrl-exec/ — all CA and cert material
  • /var/lib/ctrl-exec/agents/ — agent registry
  • /var/lib/ctrl-exec/rotation.json — rotation state

Lock files and run results are instance-local concerns.

Replication Approaches

Shared filesystem

The simplest approach for bare-metal or VM deployments is a shared filesystem mounted at /etc/ctrl-exec and /var/lib/ctrl-exec on all ctrl-exec hosts. Both NFS and DRBD (in primary/secondary or dual-primary mode) work. All instances read and write the same files.

Considerations:

  • Serial counter consistency: ca.serial is read and written on every cert signing. Under NFS, open-file locking is advisory and may not be respected across clients. Use DRBD with OCFS2 or GFS2 for cluster-safe locking if pairing operations run concurrently across nodes.
  • Registry writes are atomic (rename), which is safe over NFS on the same subnet but not guaranteed over high-latency links.
  • Lock files in /var/lib/ctrl-exec/locks/ should not be on the shared filesystem. Mount only the CA and registry paths; keep locks on local storage per instance.

Active/passive with rsync

For a cold-standby arrangement, rsync the state directories from the primary to the standby on a schedule or after each significant write:

# Run on primary after pairing or rotation events
rsync -az --delete /etc/ctrl-exec/ standby:/etc/ctrl-exec/
rsync -az --delete /var/lib/ctrl-exec/agents/ standby:/var/lib/ctrl-exec/agents/
rsync -az --delete /var/lib/ctrl-exec/rotation.json standby:/var/lib/ctrl-exec/rotation.json

RPO is the rsync interval. For low-frequency pairing environments (agents paired once and seldom changed), a 5-minute cron is sufficient. For fleets where pairing and rotation happen regularly, trigger rsync post-operation rather than on a schedule.

Transfer the CA key over an encrypted, host-authenticated channel only: scp with known_hosts verification, not StrictHostKeyChecking=no.

Object storage for the registry

The agent registry (/var/lib/ctrl-exec/agents/) is a directory of small JSON files. In cloud environments, it can be stored in object storage (S3, GCS, Azure Blob) and synced to local disk on each instance at startup and after write operations. This is suitable when the fleet is managed from ephemeral ctrl-exec instances (e.g. autoscaling groups) and a shared NFS mount is inconvenient.

The CA material (/etc/ctrl-exec/) should not be in object storage — the CA key must remain in a secrets manager or encrypted block volume with audited access controls, not in a general-purpose object bucket.

Load Balancing

Port 7443 carries mTLS connections for /run, /ping, and /capabilities. Each connection is self-contained: the agent authenticates the connecting cert against the CA, verifies the ctrl-exec serial, processes the request, and closes the connection. There is no session state that must be pinned to a specific ctrl-exec instance.

Any TCP/L4 load balancer works for port 7443:

HAProxy
L4 or L7 TCP proxy. Configure a backend pool of ctrl-exec hosts with health checks on port 7443. mTLS passthrough (L4 mode) requires no cert configuration on the load balancer.
keepalived
Virtual IP failover using VRRP. The active ctrl-exec holds the VIP; on failure the VIP moves to the standby. Agents connect to the VIP address and are unaware of the failover. Suitable for two-node active/passive.
DNS round-robin
Multiple A records for the ctrl-exec hostname. Agents resolve the name on each request. No dedicated load balancer required. Failover depends on DNS TTL and client retry behaviour; not suitable where sub-minute failover is required.

Port 7444 (pairing) and port 7445 (API) do not need to be load-balanced in normal operation. Pairing mode runs on one node at a time. The API can be load-balanced but result storage in /var/lib/ctrl-exec/runs/ must be on a shared path if GET /status/{reqid} is expected to work regardless of which node handled the original request.

Active/Passive Failover

In an active/passive setup, one ctrl-exec instance handles all traffic; the standby holds a replicated copy of all state and takes over when the primary fails.

Promotion procedure:

  1. Confirm the primary is unreachable (avoid split-brain — do not promote the standby while the primary may still be serving).
  2. Ensure the standby has a current copy of the state directories. If using rsync replication, trigger a final sync if the primary is still accessible, or accept the lag from the last scheduled sync.
  3. On the standby, start the ctrl-exec services:

    bash systemctl start ctrl-exec-api

  4. Move the virtual IP or update DNS to point at the standby.

Agents reconnect transparently on their next request. There is no re-pairing required. The standby presents the same ctrl-exec cert (same serial) as the primary — agents see no difference.

If the standby was behind in registry state (new agents paired on the primary after the last sync), those agents will be unknown to the newly promoted node. They will still connect successfully on port 7443 (mTLS trust is CA-based, not registry-based) but will not appear in list-agents until the registry entry is recovered or the agent is re-paired.

Active/Active

Multiple ctrl-exec instances serving port 7443 simultaneously is supported for run and ping operations. All instances present the same cert (same serial), share the same registry, and agents accept connections from any of them.

Concurrency locking
Lock files in /var/lib/ctrl-exec/locks/ are per-instance. An active/active setup does not provide cross-instance concurrency locks — two instances can dispatch the same script to the same agent at the same time. If concurrency control matters, either keep lock files on a shared filesystem with cluster-safe locking, or route all requests for a given agent through the same instance (consistent hashing at the load balancer).
Pairing mode
Pairing mode should only run on one node at a time. The pairing queue in /var/lib/ctrl-exec/pairing/ is not designed for concurrent write access from multiple instances. Run pairing interactively on a designated node, or use approve and deny commands on the same node that accepted the request.
Cert rotation
rotate-cert should be run on one node. It writes rotation.json, broadcasts the new serial to all agents, and updates per-agent status in the registry. Running it simultaneously from two nodes would produce a race on rotation.json and ca.serial. Schedule rotation as a maintenance operation on a designated node.
Registry writes
Agent registry files are written atomically via rename. Concurrent writes from multiple instances to different agent files are safe. Concurrent writes to the same agent file (e.g. two renewals for the same agent) are last-write-wins — operationally harmless since the content converges.

Cert Rotation in an HA Setup

ctrl-exec cert rotation updates the serial stored on every agent. In an HA setup, all instances must present the new cert immediately after rotation — an instance still presenting the old cert will be rejected by agents that have already updated their stored serial.

Rotation procedure for HA:

  1. Run ctrl-exec rotate-cert on one designated node. This generates the new cert, writes it to /etc/ctrl-exec/ctrl-exec.crt and /etc/ctrl-exec/ctrl-exec.key, and broadcasts the new serial to all agents via update-ctrl-exec-serial.
  2. Sync the updated /etc/ctrl-exec/ to all other ctrl-exec instances immediately. All instances must reload their cert before any agent completes its serial update. In practice the broadcast takes seconds to minutes depending on fleet size; sync should complete before that window closes.
  3. Reload or restart all ctrl-exec instances:

    bash systemctl restart ctrl-exec-api

    ctrl-exec-api reads its cert at startup. There is no live cert reload — a restart is required.

The update-ctrl-exec-serial script on each agent writes the new serial and sends SIGHUP to the agent process. After SIGHUP, the agent will reject connections from any ctrl-exec presenting the old serial. The overlap window (cert_overlap_days, default 30 days) is the time allowed for agents that were unreachable during the broadcast to reconnect and receive the update — it is not a grace period for the ctrl-exec instances themselves. All ctrl-exec instances must be updated before the first agent processes its serial update.

What HA Does Not Solve

CA key compromise
An attacker with the CA key can issue valid agent certificates regardless of how many ctrl-exec instances exist. The CA is the single root of trust for the deployment. HA increases availability; it does not limit the blast radius of a CA key compromise. All instances share the same CA, so a compromise affects all of them equally. See SECURITY-OPERATIONS.md for the CA compromise recovery procedure.
Cert serial consistency
All instances must present the same ctrl-exec cert. Divergence — one instance presenting an old cert — causes agents to reject that instance after a rotation. The replication and reload procedure must be treated as an atomic operation across the fleet.
Pairing queue coordination
Pending pairing requests in /var/lib/ctrl-exec/pairing/ are not replicated in a standard rsync setup. A request submitted to one node's pairing mode cannot be approved on another. Run pairing on a single designated node.
Agent cert revocation propagation
The revocation list on each agent (/etc/ctrl-exec-agent/revoked-serials) must be updated via a ctrl-exec run to each agent individually. HA on the ctrl-exec side does not change this — revocation state lives on the agents, not on the ctrl-exec. A ctrl-exec failover does not affect which certs agents will accept or reject.
Split-brain
If two ctrl-exec instances both believe they are primary and both run rotate-cert simultaneously, the results are undefined. Use VRRP, distributed locking, or operational discipline to ensure rotation runs on exactly one node at a time.