Hallucination, Context Windows, and Why ChatGPT Forgets Your Name

2026-01-11·9 min read·aillmengineering

LLMs don't make mistakes — they do exactly what they're designed to do. The problem is that 'most probable' and 'correct' aren't the same thing. Here's why the mechanism fails and what to do about it.

Last post: how LLMs work. Tokenization, embedding, attention, next-token prediction. The mechanism.

This post: why that mechanism fails. And why the failures are more interesting than the successes.

Hallucination Is Not a Bug

The first thing to understand: LLMs don't "make mistakes." They do exactly what they're designed to do — predict the most probable next token. The problem is that "most probable" and "correct" are not the same thing.

Ask an LLM: "What papers has Dr. Arun Sharma published on cardiac stem cells?"

The model has learned patterns like:

Questions about researchers -> produce researcher-sounding answers
Cardiac stem cell research -> associated terms: regeneration, myocardial, differentiation
Academic papers -> format: "Title (Year), published in Journal"

So it produces: "Dr. Arun Sharma published 'Myocardial Regeneration via Stem Cell Differentiation' (2019) in the Journal of Cardiac Research."

This paper doesn't exist. Dr. Sharma may not exist. The journal might not exist. But every token in that response was the most probable token given the preceding context. The model is doing its job perfectly. It's generating text that looks like a correct answer.

Hallucination is the name we give to the gap between "text that looks correct" and "text that is correct." The model has no mechanism to distinguish between them. It has no fact database. It has no verification step. It produces plausible text, always.

Three Types of Hallucination

Factual hallucination: The model states something that's wrong. Wrong date, wrong name, wrong number. "Einstein was born in 1880" (actual: 1879). This happens because the model is predicting tokens that fit the pattern, not looking up the fact.

Fabrication: The model invents something that doesn't exist. A fake research paper. A fake person. A fake company. This happens when the query asks about something not well-represented in training data, so the model fills the gap with plausible-sounding text.

Logical hallucination: The model follows a reasoning chain that looks valid but reaches a wrong conclusion. "If A implies B, and B implies C, then A implies C" — correct structure, but the model might state A implies B when it doesn't. The reasoning pattern is right. The premises are wrong.

Why You Can't "Fix" Hallucination

This is why every production LLM application needs external verification:

RAG (retrieval-augmented generation) grounds responses in retrieved documents
Tool calling lets the model look up facts instead of generating them
Output validation checks claims against databases
Human review catches what automated checks miss

The model is the generator. Everything else is the validator.

Context Windows: The Working Memory Constraint

An LLM's context window is how many tokens it can "see" at once. Think of it as working memory.

Model	Context Window
GPT-4o	128K tokens (~96,000 words)
Claude Sonnet/Opus	200K tokens (~150,000 words)
Gemini 1.5 Pro	1M tokens (~750,000 words)
GPT-4o-mini	128K tokens
Claude Haiku	200K tokens

These numbers look large. But they have hidden constraints.

Cost Scales with Context

Every token in the context window gets processed on every forward pass. More context = more computation = more cost.

Sending a 100,000-token context to GPT-4o costs ~$0.25 for input alone. If the conversation runs for 50 exchanges, each adding to the context, you're reprocessing the entire history on every turn. The cost adds up fast.

This is why production systems use techniques to keep context small:

Summarization (compress old conversation into a few sentences)
Sliding window (only send the last N exchanges)
External state (store facts outside context, inject as needed)

Attention Degradation: The "Lost in the Middle" Problem

Here's the non-obvious constraint: LLMs don't pay equal attention to all parts of the context.

Research from Stanford (2023) found that LLMs consistently perform worse on information placed in the middle of the context window. Information at the beginning (primacy) and end (recency) gets disproportionate attention.

If you put a critical fact at position 50,000 in a 100,000-token context, the model is measurably less likely to use it than if the same fact were at position 1,000 or position 99,000.

This is the "Lost in the Middle" effect. It's why:

System prompts go at the beginning (high attention)
User's most recent message goes at the end (high attention)
Old conversation history in the middle gets partially ignored

And it's why your chatbot forgets your name after a long conversation — your name was mentioned at the beginning, but as the context grows, the middle (where your name now sits) gets less attention.

Why "Bigger Context" Doesn't Solve Everything

Gemini offers 1 million tokens of context. You might think: just put everything in context, problem solved.

Three issues:

1. Cost. Processing 1M tokens on every exchange is expensive. At Gemini's pricing, that's several dollars per conversation.

2. Latency. More tokens = slower time-to-first-token. For real-time applications (voice AI, chat), this latency increase is noticeable.

3. Attention degradation still applies. A 1M context window doesn't mean the model perfectly recalls information at position 500,000. It means the "Lost in the Middle" zone is larger.

Why ChatGPT "Forgets" Your Name

You start a conversation. You say "My name is Priya." Twenty messages later, you reference something from the beginning. ChatGPT doesn't know your name anymore.

Here's what actually happened:

Message 1: Context = system prompt + "My name is Priya." Your name is near the end. High attention. Model knows your name.
Message 10: Context = system prompt + messages 1-10. Your name is still in context, but it's migrated toward the middle. Attention is lower. Model probably still knows your name.
Message 30: Context = system prompt + messages 1-30. The context is getting long. Your name is deep in the middle. Attention is low. If the model needs to reference your name, it might miss it.
Message 50: The context approaches the window limit. The application starts truncating — dropping the oldest messages to fit. Your name was in message 1. Message 1 gets dropped. Your name is no longer in context at all.

There are two separate problems:

Attention degradation (model sees your name but doesn't attend to it)
Truncation (your name is removed from context entirely)

How Production Systems Fix This

Approach 1: Entity extraction + injection. After the user provides their name, a background process extracts it and stores it in a session object. On every subsequent turn, the session object is injected into the system prompt (at the beginning — high attention zone):

System: You are an assistant. The user's name is Priya.
They are calling about their IVF appointment scheduled for March 22.

The name is always at the beginning. Always high attention. Never truncated.

Approach 2: Summary-based context management. Instead of sending the full message history, periodically summarize old messages:

[Summary of messages 1-20: User introduced herself as Priya.
She asked about IVF costs and was quoted Rs 1.5-2.5L.
She expressed concern about success rates.]

[Full messages 21-30: ...]

The summary preserves key facts while reducing token count by 80-90%.

Approach 3: Memory systems. Some applications (like ChatGPT's memory feature) extract facts from conversation and store them in a persistent database. These facts are injected into future conversations, not just future messages. This is how the "remember my name across sessions" feature works.

The Cost Equation: Why Tokens Are Money

Understanding the cost model changes how you build.

Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)
GPT-4o	$2.50	$10.00
GPT-4o-mini	$0.15	$0.60
Claude Sonnet	$3.00	$15.00
Claude Haiku	$0.25	$1.25
Gemini Flash	$0.075	$0.30

For voice AI specifically: A 10-minute call might generate 3,000-5,000 output tokens (the AI's responses, transcribed). At GPT-4o rates, that's $0.03-0.05 per call for the LLM alone. At scale (10,000 calls/day), that's $300-500/day. Switching to GPT-4o-mini for routine calls drops this to $30-50/day. Model selection matters.

RAG: Giving LLMs Memory They Don't Have

The LLM only knows what was in its training data (knowledge cutoff) and what's in the current context window. It doesn't know your company's internal documents, your patient records, or yesterday's news.

RAG (Retrieval-Augmented Generation) fixes this by adding a retrieval step before generation:

User asks a question
  -> Retrieve relevant documents from a database
  -> Inject retrieved documents into the LLM's context
  -> LLM generates a response grounded in the retrieved documents

The LLM isn't "searching the internet" or "looking things up." The retrieval system finds relevant documents and physically inserts them into the prompt. The LLM then generates a response as if it had always known that information.

This is how every production system that needs current or private information works:

Customer support bots (retrieve from knowledge base)
Healthcare agents (retrieve from patient records)
Legal assistants (retrieve from case law)
Code assistants (retrieve from codebase)

RAG is a deep topic. It gets its own dedicated posts later in this series. For now, the key insight: LLMs don't know anything that isn't in their context window. RAG puts things in their context window.

Fine-Tuning vs RAG: When to Use Which

Two ways to give an LLM knowledge it doesn't have:

Fine-tuning: Train the model on your specific data. The model's weights change. It permanently "learns" your domain.

Use when: You need the model to adopt a specific style, tone, or response format. Or when your domain has specialized vocabulary the base model handles poorly.
Don't use when: You need the model to know specific facts (it'll still hallucinate them). Or when your information changes frequently (retraining is expensive).

RAG: Retrieve information at inference time and inject it into the prompt.

Use when: You need accurate, verifiable answers grounded in specific documents. Or when your information changes frequently.
Don't use when: Your knowledge base is too large to retrieve from effectively. Or when the model needs to reason deeply across your entire knowledge base simultaneously.

Most production systems use RAG for factual grounding and fine-tuning (or system prompts) for behavior/style. They're complementary, not competitive.

Next post: What "reasoning" actually means in LLMs. Chain of thought, o1/o3, test-time compute, and why some models "think" before answering.

← More posts Home