Observation

Provider Session Death Is Silent and Undetected

ID: ORA-2026-0073
Date: 2026-04-25
Status: draft
Maturity: M1
Source: docs/entries/observations/ORA-2026-0073_provider-session-death-silent-and-undetected.md

fleet-opssession-healthheartbeat

Provider Session Death Is Silent and Undetected

Context

On 2026-04-25, two Codex seats died from different causes within the same work window:

1. Codex CLI: returned 401 Unauthorized with token_revoked errors. The session became a zombie — process alive, terminal had content, but every command failed. Chad typed "go" and "check feed and go" multiple times; all returned errors. 2. Codex Desktop: displayed "You've reached your workspace credit limit." The seat died mid-work on HWD-0373 and HWD-0425.

Both seats appeared alive in fleet telemetry (Layer 1 filesystem, Layer 2 process). Neither emitted a death signal to the feed. Work items stayed IN_PROGRESS with no one working them.

Observation

When a provider session hits a usage limit, credit limit, or auth token revocation, it dies silently. The fleet has no mechanism to detect this. Chad becomes the sole detector — he must manually notice stalled work, diagnose the cause, and restart.

The failure mode is worse than a crash. A crash removes the process from telemetry. A zombie session persists in telemetry, actively masking the failure. Fleet telemetry Layer 1 (filesystem mtime) still shows recent files. Layer 2 (process check) still shows a running process. Only Layer 3 (tmux pane capture) or Layer 4 (screenshot) would reveal the error — but no automated check runs those layers.

Compounding damage: restarting Codex kills ALL Codex sessions, destroying warm context across every seat. A single zombie infects the entire Codex fleet through the restart path.

Failure chain:

provider limit/revocation
  → session zombie (alive but non-functional)
  → fleet telemetry reports "healthy" (Layers 1-2 pass)
  → work items stuck IN_PROGRESS, no feed signal
  → Chad manually detects (minutes to hours of latency)
  → Chad restarts Codex (kills ALL sessions)
  → warm context destroyed fleet-wide
  → seats must re-bootstrap from cold

Root causes

Three distinct death triggers observed or anticipated:

Trigger	Signal in terminal	Recoverable without restart?
Token revocation	`401 Unauthorized`, `token_revoked`	No — requires new auth
Workspace credit limit	"You've reached your workspace credit limit"	No — requires credit/plan change
Usage rate limit	"usage limit", "rate limit"	Sometimes — may self-resolve after cooldown

Ameliorations

A. Watchdog cron (implementable now)

A diagnostic script that checks Codex session health by scanning tmux pane output for error patterns. On detection:

Writes a dead-seat report to ~/Desktop/fleet/dead-seats/ with seat name, timestamp, error type, and last known work item.
Posts DEAD state to feed, transitioning items to BLOCKED-on-seat-death.
Alerts Chad via Slack/iMessage.

This is a Layer 3 telemetry check (tmux pane capture) automated on a cron. Script: ~/Desktop/fleet/scripts/codex-health-check.

B. Pre-death context dump (requires investigation)

Before a seat fully dies, dump warm context to ~/Desktop/fleet/dead-seats/: current ticket ID, branch name, what work is complete, what remains. This would reduce re-bootstrap cost from "start from scratch" to "resume from checkpoint." Feasibility depends on whether the dying session can execute a final command before becoming fully non-functional.

C-E. Not currently feasible

Credit monitoring, session isolation (restart one without killing all), and usage tracking are not implementable given current Codex platform constraints. These would require provider-side changes.

Impact

Detection latency: minutes to hours (Chad is sole detector)
Blast radius: fleet-wide (restart kills all Codex sessions)
Work loss: warm context across every Codex seat
Frequency: at least twice on 2026-04-25; likely more frequent than detected (by definition, silent failures are undercounted)