Doctrine

ORA-2026-0097: Cross-provider verification gate for physical-world judgment tasks

ID: ORA-2026-0097
Date: 2026-04-28
Status: active
Maturity: unknown
Source: docs/entries/doctrines/ORA-2026-0097_cross-provider-gate-for-physical-world-judgment.md

chromaticphysical-worldverification

ORA-2026-0097: Cross-provider verification gate for physical-world judgment tasks

Doctrine

Any fleet task or pipeline stage that makes a physical-world judgment — interpreting what exists in reality from ambiguous signals — MUST use at least two different LLM providers before treating the result as decided. Same-provider consensus is correlated error, not independent verification.

Gate trigger categories

The gate fires when the task involves any of these judgment types:

Category	Examples in codebase	Current state
Photo classification / visual perception	`bt_photo_*` scripts, `visionConsensus` primitive	COMPLIANT — `visionConsensus` already uses `comparisonProvider` cross-provider by design
Physical-world identity resolution	`resolve` primitive in `gmail-financial-pipeline` (vendor/contact disambiguation from email text)	NOT COMPLIANT — `resolve` uses single provider via `runStructuredDecision`
Receipt/document reading	`financial_bill_cover_manual_packet_ocr_index.rb`, receipt page-image reading	PARTIAL — OCR is mechanical but interpretation of handwritten notes or ambiguous amounts is judgment
Ambiguous perception from images	Any `vision` call where the image content is the evidence (not just formatting)	PARTIAL — `vision` primitive supports single-provider calls
Speaker/entity attribution from transcripts	GT review desk attribution, `journal-extract` speaker identification	NOT COMPLIANT — single-provider extraction

Gate does NOT fire for

Structured text extraction where the answer is deterministic (parsing a clearly printed invoice number)
Classification against a closed vocabulary where the categories are unambiguous (mapping a known vendor name to a cost code)
Mechanical parallel reads of code, documents, or data where the content is unambiguous
Summarization where fidelity can be verified by reference to source text
Rules-based detection (e.g., drift_detector.ts — deterministic, no LLM)

Existing infrastructure

The codebase already has the infrastructure for cross-provider judgment:

LlmProvider type: "openai" | "anthropic" | "google" (_shared/llm/types.ts)
visionConsensus primitive (_shared/llm/primitives.ts:795): takes comparisonProvider and comparisonModel, runs both providers, returns agree / disagree / needs_review
DEFAULT_MODELS map in bt_photo_consensus_smoke.ts: anthropic→claude-haiku-4-5, openai→gpt-4o-mini, google→gemini-2.5-flash

What's missing: an equivalent resolveConsensus wrapper around the resolve primitive that mirrors the visionConsensus pattern.

Implementation path

1. No new infrastructure needed for vision tasks — visionConsensus already enforces the gate. 2. resolve primitive needs a consensus wrapper — add resolveConsensus that mirrors visionConsensus: call resolve with two different providers, compare chosen_id, return agree / disagree / needs_review. 3. gmail-financial-pipeline contact resolution (line 1221) should use resolveConsensus instead of resolve when the candidate set is ambiguous (>1 candidate with confidence > 0.3). 4. GT attribution extraction should use cross-provider verification for speaker identity and project attribution — the two highest-error-rate judgment calls in the pipeline.

Shepherd check

Every shepherd tick that audits IN_PROGRESS work should check:

For any ticket whose subject contains "classify", "identify", "resolve identity", "photo", "receipt read", "attribution", or "perception": does the implementation plan or code use at least 2 different LlmProvider values?

If single-provider: flag as CROSS_PROVIDER_GATE_VIOLATION in the tick output.

Rationale

Yang et al. 2026 demonstrated that 2 diverse agents match the judgment quality of 16 homogeneous ones. The fleet's own bt_photo_consensus_smoke.ts confirms this — it was built specifically because single-provider photo classification produced confident-but-wrong results. The visionConsensus primitive exists because this lesson was already learned for vision. This doctrine generalizes it to all physical-world judgment.

Same-provider scaling amplifies the provider's training biases. Cross-provider disagreement surfaces genuine ambiguity — it is a signal that the input needs human review, not a failure to be averaged away.