Agent Failover

Aigon Pro feature. Auto-failover and the manual “Failover now” button require @aigon/pro. The event vocabulary (agent.token_exhausted, agent.failover_switched) works in OSS, but policy: switch degrades to policy: notify without Pro loaded.

Why this exists

Long-running autonomous features regularly hit a provider’s rolling quota window before the work is done. Without failover, a hit means the agent stops, the feature stalls, and you manually restart it on a different agent when the window resets (often hours later).

Aigon’s failover chain lets the work continue automatically. When the supervisor detects that an agent has hit its quota, it picks the next agent in your configured priority list, kills the stalled session, and spawns a replacement in the same worktree — passing a handoff prompt so the replacement knows where things left off.

How it works

The supervisor runs a sweep every 30 seconds. For each active feature agent slot it checks three inputs, in this order:

Live tmux pane — the supervisor captures the agent’s tmux pane each sweep and matches its content against the agent’s quota strings (usage limit, quota exceeded, rate limit, etc.). This is the primary path. Most CLIs print a quota message and stay at their REPL prompt instead of exiting, so they never trigger an exit-code path. Live-pane scanning catches them anyway.
Stderr patterns + exit code — for the rare CLIs that do exit with code 1 on quota, the post-mortem agent-status file is matched against the same patterns.
Telemetry over-limit — the session’s perSessionBillableTokens counter exceeds the configured threshold.

When any of these fires and the agent slot has not yet been flagged as exhausted, the supervisor writes an agent.token_exhausted event with a source field (live_pane_pattern, stderr_pattern, or telemetry_limit), flips the slot’s status to needs_attention, and records the tokenExhausted flag on the snapshot.

Auto-failover then fires only when both conditions hold:

agentFailover.policy is switch, AND
the feature is running in autonomous (autopilot) mode.

Manual / Drive / Fleet runs never auto-swap — the operator is at the wheel and clicks the “Failover now →” button when they’re ready. With those conditions met, Pro’s exhaustion handler:

Calls chooseNextAgent(chain, currentAgentId, [currentAgentId]) to pick the replacement.
Kills the current tmux session.
Spawns a new session for the replacement agent in the same worktree, with a handoff prompt that includes the last reachable commit and the slot history.
Writes an agent.failover_switched event.
Clears the tokenExhausted flag so the sweep does not re-fire on the same exhaustion.

The chain is forward-only — once an agent is exhausted it is not retried later in the same run.

Configuration

Add an agentFailover block to your repo’s .aigon/config.json:


{
  "agentFailover": {
    "policy": "switch",
    "chain": ["cc", "cx", "gg"],
    "tokenLimits": {
      "perSessionBillableTokens": 500000
    }
  }
}

Field	Type	Description
`policy`	`"notify"` \| `"switch"` \| `"pause"`	`notify` logs and notifies but does not switch. `switch` triggers auto-failover (Pro). `pause` marks the agent slot paused for manual action.
`chain`	`string[]`	Ordered priority list of agent IDs. The supervisor walks forward from the current agent’s position.
`tokenLimits.perSessionBillableTokens`	`number`	Soft limit (tokens) above which telemetry-based exhaustion fires. Defaults to unlimited if unset.

Per-agent model expectations

You can set a per-agent model override in .aigon/agents/<id>/config.json. When the failover handler spawns a replacement, it resolves the replacement agent’s CLI config independently — each agent uses its own model preference.

Auto-failover (autonomous runs only)

The full auto path, step by step:

Feature is started in autonomous mode (e.g. via aigon feature-autonomous-start) on cc with chain [cc, cx, gg] and policy: switch.
cc hits its quota. The CLI prints “usage limit” and returns to its REPL prompt — it does not exit.
The supervisor sweep (runs every 30 s) reads the live tmux pane, matches the quota string, fires.
agent.token_exhausted is appended to events.jsonl (source: live_pane_pattern). The slot’s status flips to needs_attention.
Because the feature is autonomous and policy: switch, the Pro exhaustion handler fires. It selects cx (next in chain after cc).
The cc tmux session is killed.
A new session is spawned for cx in the same worktree, with a handoff prompt referencing the last commit.
agent.failover_switched is appended.
tokenExhausted is cleared from the snapshot — the sweep will not re-fire until the next real exhaustion.

If cx later exhausts, the process repeats with gg. When gg exhausts with no next agent, agent.token_exhausted is recorded but no switch fires.

Manual runs never auto-swap. A feature started with aigon feature-start (Drive / Fleet / solo) records the exhaustion event, flips the card to “Token limit hit”, and waits for you to click the failover button. The autonomy gate is intentional — the operator is at the wheel.

Manual failover

When a slot has been flagged exhausted, its card shows:

a yellow ⚠ Token limit hit badge in place of “Implementing”
a yellow “Failover now → <next-agent>” button as the slot’s primary action

Both appear when:

entityType is feature
The agent slot has tokenExhausted set (i.e. exhaustion was recorded server-side)
The failover chain has a candidate remaining

When the chain is fully exhausted, the button appears disabled with the tooltip “No agents left in failover chain”. When the slot is healthy (no exhaustion recorded), the button is hidden entirely.

Clicking sends a POST /api/feature-failover request. The server validates that tokenExhausted is set and that a chain candidate exists before executing the switch. On success a green toast confirms Switched cc → cx; on failure (e.g. 409 chain exhausted) a red toast shows the server’s error.

Gate: The button only appears after exhaustion has been recorded server-side. You cannot trigger a pre-emptive switch before the supervisor has flagged the slot.

Verifying it ran

Dashboard slot card — when exhaustion is recorded, the card visibly changes:

The status badge flips from green “Implementing” to yellow ⚠ Token limit hit
A yellow “Failover now → <next-agent>” button appears as the slot’s primary action
After a swap, the agent label updates to <new-agent> (was <old-agent>) so the history is legible

Activity feed — open the feature detail panel. The timeline shows:

🟡 cc hit token limit · live_pane_pattern · 14:02 for each agent.token_exhausted event
🟢 Failover · cc → cx · last commit a1b2c3d · 14:02 for each agent.failover_switched event

events.jsonl — events are stored at .aigon/workflows/features/<id>/events.jsonl. To grep for both types:


grep -E '"agent\.token_exhausted"|"agent\.failover_switched"' \
  .aigon/workflows/features/<id>/events.jsonl | jq .

Limitations

Forward-only chain. A fully exhausted agent is not retried later in the same run, even if its quota window resets.
Feature entities only. The supervisor’s failover branch runs for entityType === 'feature'; research entities are out of scope.
One global chain. There is a single chain per repo config, not per-agent chains.
Pre-exhaustion switching is not supported. The manual button and auto-switch both require the tokenExhausted flag to be set first.
Model override edge case. If a slot has a modelOverride (e.g. opus), the replacement session uses its own agent’s config, not the slot’s override. Track separately if this causes problems in practice.

Pro tier note

policy: switch and the “Failover now” dashboard button require @aigon/pro.
The event vocabulary (agent.token_exhausted, agent.failover_switched) and their projector branches are part of the OSS event schema and work without Pro — events.jsonl will still record exhaustion events.
Without Pro, policy: switch logs a one-line warning at supervisor startup (agentFailover.policy=switch requires aigon-pro; falling back to notify) and behaves as notify.