Doctrine
ORA-2026-0097: Cross-provider verification gate for physical-world judgment tasks
ORA-2026-0097: Cross-provider verification gate for physical-world judgment tasks
Doctrine
Any fleet task or pipeline stage that makes a physical-world judgment — interpreting what exists in reality from ambiguous signals — MUST use at least two different LLM providers before treating the result as decided. Same-provider consensus is correlated error, not independent verification.
Gate trigger categories
The gate fires when the task involves any of these judgment types:
| Category | Examples in codebase | Current state |
|---|---|---|
| Photo classification / visual perception | bt_photo_* scripts, visionConsensus primitive | COMPLIANT — visionConsensus already uses comparisonProvider cross-provider by design |
| Physical-world identity resolution | resolve primitive in gmail-financial-pipeline (vendor/contact disambiguation from email text) | NOT COMPLIANT — resolve uses single provider via runStructuredDecision |
| Receipt/document reading | financial_bill_cover_manual_packet_ocr_index.rb, receipt page-image reading | PARTIAL — OCR is mechanical but interpretation of handwritten notes or ambiguous amounts is judgment |
| Ambiguous perception from images | Any vision call where the image content is the evidence (not just formatting) | PARTIAL — vision primitive supports single-provider calls |
| Speaker/entity attribution from transcripts | GT review desk attribution, journal-extract speaker identification | NOT COMPLIANT — single-provider extraction |
Gate does NOT fire for
- Structured text extraction where the answer is deterministic (parsing a clearly printed invoice number)
- Classification against a closed vocabulary where the categories are unambiguous (mapping a known vendor name to a cost code)
- Mechanical parallel reads of code, documents, or data where the content is unambiguous
- Summarization where fidelity can be verified by reference to source text
- Rules-based detection (e.g.,
drift_detector.ts— deterministic, no LLM)
Existing infrastructure
The codebase already has the infrastructure for cross-provider judgment:
LlmProvidertype:"openai" | "anthropic" | "google"(_shared/llm/types.ts)visionConsensusprimitive (_shared/llm/primitives.ts:795): takescomparisonProviderandcomparisonModel, runs both providers, returnsagree/disagree/needs_reviewDEFAULT_MODELSmap inbt_photo_consensus_smoke.ts: anthropic→claude-haiku-4-5, openai→gpt-4o-mini, google→gemini-2.5-flash
What's missing: an equivalent resolveConsensus wrapper around the resolve primitive that mirrors the visionConsensus pattern.
Implementation path
1. No new infrastructure needed for vision tasks — visionConsensus already enforces the gate. 2. resolve primitive needs a consensus wrapper — add resolveConsensus that mirrors visionConsensus: call resolve with two different providers, compare chosen_id, return agree / disagree / needs_review. 3. gmail-financial-pipeline contact resolution (line 1221) should use resolveConsensus instead of resolve when the candidate set is ambiguous (>1 candidate with confidence > 0.3). 4. GT attribution extraction should use cross-provider verification for speaker identity and project attribution — the two highest-error-rate judgment calls in the pipeline.
Shepherd check
Every shepherd tick that audits IN_PROGRESS work should check:
For any ticket whose subject contains "classify", "identify", "resolve identity", "photo", "receipt read", "attribution", or "perception": does the implementation plan or code use at least 2 different LlmProvider values?
If single-provider: flag as CROSS_PROVIDER_GATE_VIOLATION in the tick output.
Rationale
Yang et al. 2026 demonstrated that 2 diverse agents match the judgment quality of 16 homogeneous ones. The fleet's own bt_photo_consensus_smoke.ts confirms this — it was built specifically because single-provider photo classification produced confident-but-wrong results. The visionConsensus primitive exists because this lesson was already learned for vision. This doctrine generalizes it to all physical-world judgment.
Same-provider scaling amplifies the provider's training biases. Cross-provider disagreement surfaces genuine ambiguity — it is a signal that the input needs human review, not a failure to be averaged away.