LLMs Don't Reason, They Remember | Mark James

I spend a lot of time thinking about current model capabilities and how they'll likely improve in the short to medium term. This matters when considering how best to leverage AI for real-world business tasks—both to ensure a successful project now and to build something that isn't obsolete within months of release.

For the last year, nearly all capability gains have centred around reasoning improvements. The ability for these models to "think" through a problem rather than just blurting out an answer in one shot. Early on, I made the mistake of considering this reasoning a form of System 2 processing—the slow, deliberate thinking from Kahneman's Thinking, Fast and Slow.

That was wrong. And the error has important ramifications for how we factor in model improvements over the next year and the opportunities this presents.

This week, ARC Prize released their end-of-year review—what I consider the most rigorous benchmark for general intelligence available. Their findings crystallised something I've been circling for months. They call it Knowledge-Reasoning Coupling. I'm going to call it something more direct: LLM reasoning is System 1, not System 2.

Understanding this changes how you build with AI.

What LLMs Actually Do When They 'Think'

When humans reason through a novel problem, we do System 2 work: we search the solution space, hit dead ends, backtrack, verify against reality. It's slow and effortful because we're actually exploring.

When GPT-5.1 or Gemini 3 Pro "reasons," it's doing something different—but not nothing. During training, these models performed genuine search: millions of paths, backtracking from failures, reward signals shaping which routes survive. That's real System 2 exploration. But once training stops, the weights freeze. What remains is a compression of that search.

The result is powerful. These models can generalise within their training distribution—combining patterns in novel ways, interpolating between examples, applying heuristics they've seen work before. This is why they excel at coding, math, and analysis: domains with massive training coverage where learned patterns transfer.

But this generalisation is bounded. The model hasn't learned to search. It's learned what successful searches found. It can recombine and extend patterns—but only patterns that exist in its weights. When a problem falls outside that coverage, the generalisation breaks down.

It's the difference between a musician sight-reading an unseen piece (System 2) and improvising within a style they've mastered (powerful, but bounded System 1). The base model—even a frontier one—is doing the latter. Superhuman within its domain. Brittle outside it.

Products like Gemini 3 Deep Think add a layer on top: running multiple sessions in parallel, selecting the best path, iterating toward solutions. That's weak System 2—but it's happening in the harness, not the model. The weights aren't searching. The wrapper is.

Where the System 2 Actually Happened

If the model isn't doing System 2 at inference, where did the reasoning come from?

It happened during training.

The training harness is System 2—genuine search across millions of paths, backtracking from dead ends, reward signals shaping which routes survive. That's real exploration. But once training stops, the weights freeze. What remains is a compression of that search into instant recall.

The model doesn't learn to search. It learns what successful searches found. System 2 in, System 1 out.

The manifesto calls AI "pattern-matching without intent." Now I understand why: there's no intent because there's no active search. Just retrieval of learned paths.

The Evidence

The ARC-AGI benchmark was designed to test exactly this. Each puzzle is unique—unlike anything in the training data. If these models possessed a general reasoning engine, they should apply it equally well to coding problems (familiar) and ARC puzzles (novel).

They don't.

On coding and math, Claude Opus 4.5 and Gemini 3 Pro are superhuman. They've seen the movie before. On ARC-AGI, the raw models collapse. The bounded generalisation fails, and you get confident, verbose hallucinations.

The ARC analysis found something telling: Gemini 3 Deep Think's reasoning traces included correct ARC colour mappings—Green is 3, Magenta is 6—despite the verification harness never mentioning them. The model had memorised ARC-specific patterns from training data. Even on a benchmark designed to be novel, it was pattern-matching from memory, not reasoning from first principles.

The article names this the Knowledge-Reasoning Coupling problem:

"Current AI reasoning performance is tied to model knowledge... Human reasoning capability is not bound to knowledge."

In humans, logic is decoupled from trivia. You can reason through a puzzle you've never seen. In LLMs, reasoning and knowledge are fused. The model can only "reason" through paths that exist in its weights.

This is why AI capability is so jagged—godlike on known domains, sub-human on novel ones.

Where the Thinking Actually Happens Now

So if the model isn't doing System 2, and training already happened, where is it at inference time?

It's in the harness.

When you use ChatGPT Pro or Gemini 3 Deep Think, you're not just talking to a model. You're interacting with a System 2 wrapper—an inference-time search loop that forces the model to generate multiple options, verify them (often with code execution or a separate critic), and backtrack when paths fail.

The ARC team calls 2025 the "Year of the Refinement Loop." These loops—explore, verify, iterate—are System 2 by another name. The model provides superhuman pattern-matching. The harness provides search, verification, and feedback. Together, they solve problems neither could alone.

Critically, the ARC analysis confirms you can build this yourself:

"You can add refinement loops at the application layer to meaningfully improve task reliability instead of relying solely on provider reasoning systems."

The intelligence isn't in the model. It's in the system. I'll unpack what this means for how you architect AI solutions in a follow-up.

The Prediction Lens

Understanding this gives you a framework for predicting AI capabilities:

What improves with scale and training (System 1 depth):

Broader domain knowledge
Faster pattern recognition
More sophisticated "intuition" on familiar problems
Better reasoning traces to recite

What requires architectural innovation (actual System 2):

Novel problem-solving outside training distribution
Verification against external reality
Backtracking and search on genuinely new ground
Generalisation to unseen domains

Waiting for the next model won't give you System 2. It'll give you deeper, broader System 1—more paths memorised, more domains covered, faster recall. If your task needs actual reasoning on novel problems, you need to build the harness now.

The Decision Lens

The ARC analysis offers a diagnostic I'll be using in projects. Task domains with these two characteristics are now reliably automatable—no new science needed:

Sufficient knowledge coverage in the foundational model
Verifiable feedback signal the system can check against

Both present? Automate with confidence. The model's System 1 covers the domain, and you can build a harness to verify outputs.

One missing? You're in experimental territory. Either the model lacks the patterns (knowledge gap) or you can't verify its outputs (feedback gap). Proceed carefully.

Neither? Don't wait for better models. The gap is architectural, not capability.

Building for the Alien

The manifesto describes the intelligence-native stack: purpose, business foundation, intelligence infrastructure, work transformation. But you can't build that stack without understanding the alien you're building for.

LLMs are not junior humans who'll get smarter. They're a fundamentally different kind of intelligence—System 1 at inhuman scale, without System 2 unless you provide it.

Design accordingly:

Your processes become the reasoning. The model generates; your architecture verifies, selects, and iterates. The intelligence is in the loop, not the node.
Your feedback loops become the learning. Models don't learn at inference time. Your system does—if you instrument it.
Your judgment becomes the taste. Pattern-matching without intent means generation without evaluation. Humans provide the "is this actually what we want?" that models can't.

The companies winning with AI aren't waiting for models that think. They're building systems that think around models that don't.

The Shift

We're one year into the "reasoning era," and the reasoning isn't where we thought it was.

The breakthrough isn't inside the model. It's in understanding what the model actually is—superhuman intuition, not deliberate thought—and architecting systems that provide what it lacks.

The near-term future doesn't belong to the smartest model. It belongs to the best system—the one that knows when to let the model use its muscle memory, and when to provide the search it cannot do itself.

That's the intelligence shift: not just better AI, but better architecture for the AI we have right now.

Part of The Intelligence Shift

Subscribe to The Intelligence Shift: Join the newsletter