Caio Pizzol

Every AI model has different blind spots

I run every code review through at least two AI models. Not because one is better than the other. Because they're wrong about different things.

Claude catches UX problems. Codex catches architectural risks. Qwen catches transaction logic neither of them saw. They all sound confident. They're all partially blind.

The interesting part isn't which model is "best." It's what happens when you compare their outputs side by side.

The experiment

I built a small tool called Conclave that runs the same prompt through multiple AI CLIs in parallel and collects structured results. The core is a shell script. Nothing fancy.

The workflow: take a git diff, send it to Codex and Claude Opus simultaneously, get back two independent reviews. Sometimes I add a third model - Qwen, Gemini, whatever is available. The results come back as JSON. I compare them.

I've been doing this on real pull requests for months. The patterns are consistent.

What different models see

Here's a real example. A teammate submitted a fix for a tracked changes bug in our document editor. The diff touched ProseMirror transaction logic, text selection handling, and visual test helpers. Medium-sized PR, maybe 200 lines changed.

Claude Opus flagged:

  • Block-level deletion tracking was silently dropped
  • TextSelection.near could land the cursor inside deleted content
  • A visual test had no assertions - it would pass no matter what
  • Zero-width node guards were missing

Codex flagged:

  • If the first run was wrapped in an inline node, paragraph defaults would stop updating
  • Missing integration test for the actual user-reported scenario (select-all + font formatting)
  • Schema validity concerns around run nesting

Qwen flagged:

  • Incomplete node reassignment - mark reuse across ranges wasn't handled
  • The removed positionAdjusted flag broke deletion span logic
  • Caret jumping on delete in specific sequences

Three models. Almost zero overlap in what they found. Claude worried about user-facing consequences. Codex worried about architectural fragility. Qwen worried about lower-level transaction mechanics.

None of them were wrong. All of them were incomplete.

Why this happens

Different training data produces different knowledge bases. This isn't surprising, but the practical impact is bigger than most people expect.

Codex is trained heavily on GitHub repositories. It's strong on ProseMirror internals, position mapping, mark semantics. It tends to ask "what could break in edge cases?"

Claude is trained on a broader mix of internet text. It's better at assessing practical user impact - "does this actually affect someone?" It catches test quality problems that Codex ignores.

Qwen, trained on a different corpus entirely, sometimes flags logic errors that both Claude and Codex miss. On this specific PR, Qwen was the only model that caught the root cause - the fix addressed the symptom but not the underlying setMark behavior that triggered the bug.

This isn't a ranking. It's a map of blind spots. Each model has confident opinions about what it knows and silence about what it doesn't.

The consensus rule

After running hundreds of these comparisons, one pattern holds:

When two or more models independently flag the same issue, it's almost always a real problem. The false positive rate on consensus findings is very low. Different training data arriving at the same conclusion is strong signal.

When only one model flags something, treat it with skepticism. It might be a genuine catch the others missed. It might be a hallucination. It might be the model overfitting to patterns in its training data that don't apply here.

In practice, I tag every finding with which model raised it. Consensus items go straight to "must address." Solo findings get a quick manual check before I spend time on them.

This simple filter cuts the noise in half without losing real bugs.

The practical workflow

This isn't theory. Here's how it actually works on a day-to-day basis:

  1. PR comes in. I run the diff through Codex and Opus in parallel - three passes each (correctness, developer experience, test coverage). Six reviews total.

  2. I group findings by concern area. Anything flagged by both models gets marked as high confidence. Solo findings get noted but weighted lower.

  3. For consensus findings, I investigate directly. Sometimes that means adding debug logging and reproducing in the dev app. Sometimes the finding is obvious enough to act on immediately.

  4. Solo findings get a 30-second gut check. If it sounds plausible, I investigate. If it sounds like the model is pattern-matching on something irrelevant, I skip it.

  5. PR comments reference which model found what. "Both Codex and Opus flagged this" carries more weight with the team than "the AI said so."

The whole process adds maybe 5 minutes to a review. The parallel execution means I'm not waiting for sequential model responses. The structured output means I'm comparing, not reading two walls of text.

What this changes

Single-model code review feels like having one very smart colleague who never admits uncertainty. Multi-model review feels like a panel where disagreements reveal what none of them fully understand.

The consensus items are the most valuable output. Not because any individual model is reliable - but because independent agreement from different knowledge bases is a fundamentally different kind of signal than a single confident answer.

The models will get better. The blind spots will shift. But the principle stays the same: if you're relying on one perspective, you're missing things. Karpathy calls this an LLM council - a panel of models voting on the same question. The intuition is the same whether you're evaluating vibes or reviewing code. Run the same question through different training data. See where the answers converge. That's where the real issues are.