What 'Reasoning' Actually Means in LLMs

2026-01-20·8 min read·aillmengineeringreasoning

OpenAI released o1. Google added 'thinking' to Gemini. Anthropic gave Claude extended thinking. But what does reasoning mean for a system that generates one token at a time? Chain of thought, test-time compute, and the honest assessment.

OpenAI released o1. Then o3. Google released Gemini 2.5 with "thinking." Anthropic released Claude with extended thinking. Everyone started talking about AI that "reasons."

But what does reasoning mean for a system that generates one token at a time?

The Baseline: LLMs Don't Reason

Standard LLMs (GPT-4o, Claude Sonnet, Gemini Flash) don't reason. They predict the next token. If the pattern in the training data suggests that a particular token follows, they produce it.

Ask GPT-4o: "How many R's are in strawberry?"

It might say "2." The correct answer is 3. The model isn't counting. It's predicting what a correct-looking answer to "how many R's" questions looks like. And "2" appears in training data as a common answer to letter-counting questions.

Chain of Thought: Faking It Until It Works

In 2022, researchers at Google discovered something surprising. If you ask an LLM to "think step by step" before answering, its accuracy on math and logic problems improves dramatically.

Without chain of thought:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 balls. How many balls does he have now?
A: 11

With chain of thought:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 balls. How many balls does he have now?
A: Roger started with 5 balls. He bought 2 cans. Each can has 3 balls.
   So he bought 2 x 3 = 6 balls. Total: 5 + 6 = 11 balls.

Same answer. But for harder problems, the chain-of-thought version gets the right answer where the direct version fails. Why?

Because each intermediate step becomes part of the context for the next step. When the model generates "2 x 3 = 6," that token is now in the context window. The next prediction ("5 + 6 = ...") is conditioned on seeing "6" in the previous step. The model is essentially writing itself a scratchpad.

This is not reasoning. It's autoregressive computation — using the model's own output as working memory. But the practical effect is similar to reasoning for many problems.

The limitation: chain-of-thought only works if the problem can be decomposed into steps the model already knows how to do. "2 x 3" is something the model has seen millions of times. "847 x 293" is not. Chain of thought helps with decomposition, not with computations the model can't do at the individual step level.

Test-Time Compute: o1's Actual Innovation

OpenAI's o1 (and later o3) introduced something genuinely different: test-time compute scaling.

Standard LLMs allocate the same amount of compute to every response. Whether you ask "What's 2+2?" or "Prove Fermat's Last Theorem," the model runs the same number of forward passes per token.

o1 changes this. It allocates more compute to harder problems.

How It Works

o1 generates "thinking tokens" — internal reasoning steps that the user doesn't see. Before producing the visible answer, the model generates hundreds or thousands of tokens of internal chain-of-thought reasoning.

User: What is the probability that in a group of 23 people,
      at least two share a birthday?

[Internal thinking - not shown to user]
Step 1: This is the birthday problem. I need P(at least one match).
Step 2: Easier to compute P(no match) and subtract from 1.
Step 3: P(no match) = 365/365 x 364/365 x 363/365 x ... x 343/365
Step 4: That's 365! / (342! x 365^23)
Step 5: Let me compute this...
Step 6: P(no match) ≈ 0.4927
Step 7: P(at least one match) = 1 - 0.4927 = 0.5073

[Visible answer]
The probability is approximately 50.7%.

Those internal thinking tokens consume compute (and money). A question that takes GPT-4o 200 tokens of output might take o1 2,000 tokens of thinking + 200 tokens of visible output. 10x the compute for the same visible response.

Why Thinking Tokens Improve Accuracy

Two mechanisms:

1. More intermediate steps. Each thinking token provides a new piece of context for the next prediction. The model can break a hard problem into many small steps, each of which it can do reliably. The chain of steps gets the right answer even when a single prediction would get it wrong.

2. Self-correction. In the thinking tokens, the model can check its own work. "Wait, that doesn't seem right. Let me recalculate." It generates a wrong intermediate answer, notices the error (because the error pattern is something it's seen in training data), and corrects it. This doesn't happen in standard LLMs because they commit to each token immediately.

The Tradeoff: Latency and Cost

Model	Time to answer a math problem	Cost per problem
GPT-4o	1-2 seconds	~$0.01
o1	10-30 seconds	~$0.10-0.50
o3	30-120 seconds	~$0.50-5.00

o3 on a hard math problem might think for two minutes and cost $5. For that problem, it might achieve 90% accuracy where GPT-4o gets 40%.

The question is not "which model is better." It's "does this problem justify 100x the cost and 60x the latency?"

For most problems, no. Scheduling an appointment doesn't need reasoning. Summarizing an email doesn't need reasoning. Answering "what's the weather" doesn't need reasoning.

For hard math, complex code, scientific analysis, legal reasoning, medical diagnosis — problems where getting it wrong has consequences — test-time compute is the right tradeoff.

How Reasoning Models Are Trained

Standard LLMs are trained with next-token prediction on text. o1 and similar models add a second training phase.

Phase 1: Pre-training. Same as standard LLMs. Predict the next token on trillions of tokens of text. This gives the model language understanding, world knowledge, and pattern recognition.

Phase 2: Reinforcement learning on reasoning. The model is given problems with verifiable answers (math, code, logic puzzles). It generates reasoning chains. Chains that reach the correct answer are rewarded. Chains that reach wrong answers are penalized. Over thousands of iterations, the model learns reasoning strategies that are more likely to reach correct answers.

This is different from RLHF (Reinforcement Learning from Human Feedback), which trains models to be helpful, harmless, and honest. Reasoning RL trains models to solve hard problems by developing effective thinking strategies.

The result: the model doesn't just generate plausible-looking reasoning. It generates reasoning that has been optimized (via RL) to actually reach correct answers. The chains of thought are not performative — they're functional.

Claude's Extended Thinking

Anthropic's approach with Claude is similar in concept but different in implementation.

Claude's extended thinking:

Generates internal thinking tokens (like o1)
The thinking is visible to developers (via API) but can be hidden from users
The model explicitly labels when it's thinking vs when it's answering
Thinking tokens are priced differently from output tokens

The key difference: Claude's thinking tends to be more interpretable. You can read the thinking tokens and understand the reasoning chain. o1's internal reasoning was initially more opaque (OpenAI didn't show the full chain).

Gemini's Thinking Mode

Google's Gemini 2.5 Pro and Flash include a "thinking" mode that works similarly:

Internal reasoning tokens before the visible response
Configurable thinking budget (you can set how many thinking tokens the model is allowed to use)
Integrated with Gemini's 1M token context window

The thinking budget is a practical feature. You can say "think for up to 5,000 tokens on this problem" — setting an upper bound on cost and latency. For simple questions, the model might use 100 thinking tokens. For hard ones, it uses the full budget.

When to Use Reasoning Models vs Standard Models

Use Case	Model Choice	Why
Chat, support, scheduling	Standard (GPT-4o, Sonnet, Flash)	No reasoning needed. Speed and cost matter.
Simple Q&A from documents	Standard + RAG	Retrieval, not reasoning.
Math, data analysis	Reasoning (o1, Claude thinking)	Accuracy on computation.
Code generation (complex)	Reasoning	Multi-step planning, edge case handling.
Medical triage	Reasoning	Getting it wrong has consequences. Audit trail matters.
Legal analysis	Reasoning	Complex multi-factor analysis. Interpretability matters.
Voice AI (real-time)	Standard (fast models)	Latency is critical. Can't wait 10 seconds for a response.
Creative writing	Standard (high temperature)	Creativity, not correctness.
Translation	Standard	Pattern matching, not reasoning.

The Honest Assessment

"Reasoning" in LLMs is not human reasoning. It's a specific technique — generating more tokens of intermediate work before producing an answer — that improves accuracy on problems requiring multi-step logic.

It works because:

More intermediate steps = more context for each subsequent prediction
RL training optimizes the thinking strategy for correctness
Self-correction becomes possible when the model can see its own prior work

It doesn't work because:

The underlying mechanism is still next-token prediction
Each individual step can still be wrong
The model can't do computations that aren't learnable from patterns (truly novel mathematical proofs, for example)

This is post 7 of the AI Engineering Explained series.

Next post: Tool calling — when LLMs stop talking and start doing. How function calling works, how the model decides when to call a tool, and why this changes everything about what LLMs can do.

← More posts Home