Doctrine

ORA-2026-0097: Cross-provider verification gate for physical-world judgment tasks

chromaticphysical-worldverification

ORA-2026-0097: Cross-provider verification gate for physical-world judgment tasks

Doctrine

Any fleet task or pipeline stage that makes a physical-world judgment — interpreting what exists in reality from ambiguous signals — MUST use at least two different LLM providers before treating the result as decided. Same-provider consensus is correlated error, not independent verification.

Gate trigger categories

The gate fires when the task involves any of these judgment types:

CategoryExamples in codebaseCurrent state
Photo classification / visual perceptionbt_photo_* scripts, visionConsensus primitiveCOMPLIANT — visionConsensus already uses comparisonProvider cross-provider by design
Physical-world identity resolutionresolve primitive in gmail-financial-pipeline (vendor/contact disambiguation from email text)NOT COMPLIANT — resolve uses single provider via runStructuredDecision
Receipt/document readingfinancial_bill_cover_manual_packet_ocr_index.rb, receipt page-image readingPARTIAL — OCR is mechanical but interpretation of handwritten notes or ambiguous amounts is judgment
Ambiguous perception from imagesAny vision call where the image content is the evidence (not just formatting)PARTIAL — vision primitive supports single-provider calls
Speaker/entity attribution from transcriptsGT review desk attribution, journal-extract speaker identificationNOT COMPLIANT — single-provider extraction

Gate does NOT fire for

  • Structured text extraction where the answer is deterministic (parsing a clearly printed invoice number)
  • Classification against a closed vocabulary where the categories are unambiguous (mapping a known vendor name to a cost code)
  • Mechanical parallel reads of code, documents, or data where the content is unambiguous
  • Summarization where fidelity can be verified by reference to source text
  • Rules-based detection (e.g., drift_detector.ts — deterministic, no LLM)

Existing infrastructure

The codebase already has the infrastructure for cross-provider judgment:

  • LlmProvider type: "openai" | "anthropic" | "google" (_shared/llm/types.ts)
  • visionConsensus primitive (_shared/llm/primitives.ts:795): takes comparisonProvider and comparisonModel, runs both providers, returns agree / disagree / needs_review
  • DEFAULT_MODELS map in bt_photo_consensus_smoke.ts: anthropic→claude-haiku-4-5, openai→gpt-4o-mini, google→gemini-2.5-flash

What's missing: an equivalent resolveConsensus wrapper around the resolve primitive that mirrors the visionConsensus pattern.

Implementation path

1. No new infrastructure needed for vision tasksvisionConsensus already enforces the gate. 2. resolve primitive needs a consensus wrapper — add resolveConsensus that mirrors visionConsensus: call resolve with two different providers, compare chosen_id, return agree / disagree / needs_review. 3. gmail-financial-pipeline contact resolution (line 1221) should use resolveConsensus instead of resolve when the candidate set is ambiguous (>1 candidate with confidence > 0.3). 4. GT attribution extraction should use cross-provider verification for speaker identity and project attribution — the two highest-error-rate judgment calls in the pipeline.

Shepherd check

Every shepherd tick that audits IN_PROGRESS work should check:

For any ticket whose subject contains "classify", "identify", "resolve identity", "photo", "receipt read", "attribution", or "perception": does the implementation plan or code use at least 2 different LlmProvider values?

If single-provider: flag as CROSS_PROVIDER_GATE_VIOLATION in the tick output.

Rationale

Yang et al. 2026 demonstrated that 2 diverse agents match the judgment quality of 16 homogeneous ones. The fleet's own bt_photo_consensus_smoke.ts confirms this — it was built specifically because single-provider photo classification produced confident-but-wrong results. The visionConsensus primitive exists because this lesson was already learned for vision. This doctrine generalizes it to all physical-world judgment.

Same-provider scaling amplifies the provider's training biases. Cross-provider disagreement surfaces genuine ambiguity — it is a signal that the input needs human review, not a failure to be averaged away.