Observation
Provider Session Death Is Silent and Undetected
Provider Session Death Is Silent and Undetected
Context
On 2026-04-25, two Codex seats died from different causes within the same work window:
1. Codex CLI: returned 401 Unauthorized with token_revoked errors. The session became a zombie — process alive, terminal had content, but every command failed. Chad typed "go" and "check feed and go" multiple times; all returned errors. 2. Codex Desktop: displayed "You've reached your workspace credit limit." The seat died mid-work on HWD-0373 and HWD-0425.
Both seats appeared alive in fleet telemetry (Layer 1 filesystem, Layer 2 process). Neither emitted a death signal to the feed. Work items stayed IN_PROGRESS with no one working them.
Observation
When a provider session hits a usage limit, credit limit, or auth token revocation, it dies silently. The fleet has no mechanism to detect this. Chad becomes the sole detector — he must manually notice stalled work, diagnose the cause, and restart.
The failure mode is worse than a crash. A crash removes the process from telemetry. A zombie session persists in telemetry, actively masking the failure. Fleet telemetry Layer 1 (filesystem mtime) still shows recent files. Layer 2 (process check) still shows a running process. Only Layer 3 (tmux pane capture) or Layer 4 (screenshot) would reveal the error — but no automated check runs those layers.
Compounding damage: restarting Codex kills ALL Codex sessions, destroying warm context across every seat. A single zombie infects the entire Codex fleet through the restart path.
Failure chain:
provider limit/revocation
→ session zombie (alive but non-functional)
→ fleet telemetry reports "healthy" (Layers 1-2 pass)
→ work items stuck IN_PROGRESS, no feed signal
→ Chad manually detects (minutes to hours of latency)
→ Chad restarts Codex (kills ALL sessions)
→ warm context destroyed fleet-wide
→ seats must re-bootstrap from cold
Root causes
Three distinct death triggers observed or anticipated:
| Trigger | Signal in terminal | Recoverable without restart? |
|---|---|---|
| Token revocation | 401 Unauthorized, token_revoked | No — requires new auth |
| Workspace credit limit | "You've reached your workspace credit limit" | No — requires credit/plan change |
| Usage rate limit | "usage limit", "rate limit" | Sometimes — may self-resolve after cooldown |
Ameliorations
A. Watchdog cron (implementable now)
A diagnostic script that checks Codex session health by scanning tmux pane output for error patterns. On detection:
- Writes a dead-seat report to
~/Desktop/fleet/dead-seats/with seat name, timestamp, error type, and last known work item. - Posts DEAD state to feed, transitioning items to BLOCKED-on-seat-death.
- Alerts Chad via Slack/iMessage.
This is a Layer 3 telemetry check (tmux pane capture) automated on a cron. Script: ~/Desktop/fleet/scripts/codex-health-check.
B. Pre-death context dump (requires investigation)
Before a seat fully dies, dump warm context to ~/Desktop/fleet/dead-seats/: current ticket ID, branch name, what work is complete, what remains. This would reduce re-bootstrap cost from "start from scratch" to "resume from checkpoint." Feasibility depends on whether the dying session can execute a final command before becoming fully non-functional.
C-E. Not currently feasible
Credit monitoring, session isolation (restart one without killing all), and usage tracking are not implementable given current Codex platform constraints. These would require provider-side changes.
Impact
- Detection latency: minutes to hours (Chad is sole detector)
- Blast radius: fleet-wide (restart kills all Codex sessions)
- Work loss: warm context across every Codex seat
- Frequency: at least twice on 2026-04-25; likely more frequent than detected (by definition, silent failures are undercounted)