Every AI model has different blind spots

I run every code review through at least two AI models. Not because one is better than the other. Because they're wrong about different things.

Claude catches UX problems. Codex catches architectural risks. Qwen catches transaction logic neither of them saw. They all sound confident. They're all partially blind.

The interesting part isn't which model is "best." It's what happens when you compare their outputs side by side.

The experiment

I built a small tool called Conclave that runs the same prompt through multiple AI CLIs in parallel and collects structured results. The core is a shell script. Nothing fancy.

The workflow: take a git diff, send it to Codex and Claude Opus simultaneously, get back two independent reviews. Sometimes I add a third model - Qwen, Gemini, whatever is available. The results come back as JSON. I compare them.

I've been doing this on real pull requests for months. The patterns are consistent.

What different models see

Here's a real example. A teammate submitted a fix for a tracked changes bug in our document editor. The diff touched ProseMirror transaction logic, text selection handling, and visual test helpers. Medium-sized PR, maybe 200 lines changed.

Claude Opus flagged:

Block-level deletion tracking was silently dropped
TextSelection.near could land the cursor inside deleted content
A visual test had no assertions - it would pass no matter what
Zero-width node guards were missing

Codex flagged:

If the first run was wrapped in an inline node, paragraph defaults would stop updating
Missing integration test for the actual user-reported scenario (select-all + font formatting)
Schema validity concerns around run nesting

Qwen flagged:

Incomplete node reassignment - mark reuse across ranges wasn't handled
The removed positionAdjusted flag broke deletion span logic
Caret jumping on delete in specific sequences

Three models. Almost zero overlap in what they found. Claude worried about user-facing consequences. Codex worried about architectural fragility. Qwen worried about lower-level transaction mechanics.

None of them were wrong. All of them were incomplete.

Why this happens

Different training data, different optimization objectives, different blind spots. This isn't just intuition - there's real evidence behind it.

Chatbot Arena maintains category-specific leaderboards across coding, math, creative writing, and instruction-following. A model that ranks #3 overall might rank #1 for coding and mid-pack for creative writing. The strength profiles are genuinely different.

A 2025 study ran 10 LLMs on the same hierarchical text classification task with 8,660 human-annotated samples. The best single model hit 0.55 F1. The 10-model ensemble hit 0.92 - a 67% improvement. Even a 2-model ensemble showed a 33% gain. MIT researchers found that models debating each other solved problems neither could solve alone.

You can see it in practice with our reviews. Codex is trained heavily on GitHub repositories. It's strong on ProseMirror internals, position mapping, mark semantics. It tends to ask "what could break in edge cases?" Claude is trained on a broader mix of text. It's better at assessing practical user impact - "does this actually affect someone?" Qwen, trained on a different corpus entirely, sometimes catches logic errors both miss. On this specific PR, Qwen was the only model that caught the root cause - the fix addressed the symptom but not the underlying setMark behavior that triggered the bug.

One important nuance: a 2025 ICML paper tested 349+ LLMs and found that larger, more accurate models are converging in their error patterns - agreeing on incorrect answers 60% of the time. As frontier models improve, their blind spots may overlap more. That doesn't invalidate the multi-model approach today, but it means the diversity advantage may narrow over time. For now, the practical gains are substantial.

This isn't a ranking. It's a map of blind spots. Each model has confident opinions about what it knows and silence about what it doesn't.

The consensus rule

After running hundreds of these comparisons, one pattern holds:

When two or more models independently flag the same issue, it's almost always a real problem. The false positive rate on consensus findings is very low. Different training data arriving at the same conclusion is strong signal.

When only one model flags something, treat it with skepticism. It might be a genuine catch the others missed. It might be a hallucination. It might be the model overfitting to patterns in its training data that don't apply here.

In practice, I tag every finding with which model raised it. Consensus items go straight to "must address." Solo findings get a quick manual check before I spend time on them.

This simple filter cuts the noise in half without losing real bugs.

The practical workflow

This isn't theory. Here's how it actually works on a day-to-day basis:

PR comes in. I run the diff through Codex and Opus in parallel - three passes each (correctness, developer experience, test coverage). Six reviews total.
I group findings by concern area. Anything flagged by both models gets marked as high confidence. Solo findings get noted but weighted lower.
For consensus findings, I investigate directly. Sometimes that means adding debug logging and reproducing in the dev app. Sometimes the finding is obvious enough to act on immediately.
Solo findings get a 30-second gut check. If it sounds plausible, I investigate. If it sounds like the model is pattern-matching on something irrelevant, I skip it.
PR comments reference which model found what. "Both Codex and Opus flagged this" carries more weight with the team than "the AI said so."

The whole process adds maybe 5 minutes to a review. The parallel execution means I'm not waiting for sequential model responses. The structured output means I'm comparing, not reading two walls of text.

What this changes

Single-model code review feels like having one very smart colleague who never admits uncertainty. Multi-model review feels like a panel where disagreements reveal what none of them fully understand.

The consensus items are the most valuable output. Not because any individual model is reliable - but because independent agreement from different knowledge bases is a fundamentally different kind of signal than a single confident answer.

The models will get better. The blind spots will shift. But the principle stays the same: if you're relying on one perspective, you're missing things. Karpathy calls this an LLM council - a panel of models voting on the same question. The intuition is the same whether you're evaluating vibes or reviewing code.

The pattern is going mainstream. OpenAI shipped a Codex plugin for Claude Code that runs Codex in headless mode as a second opinion - reviews, adversarial challenges, task delegation. When a company builds a tool to get a second opinion from a competitor's product, that tells you something about how valuable multiple perspectives are.

Run the same question through different training data. See where the answers converge. That's where the real issues are.