How LLMs Actually Work — What Happens When You Hit Send

2026-01-07·9 min read·aillmengineering

You type a prompt. Two seconds later, a coherent response appears. What happened? Not thinking. Not understanding. Something far stranger — tokenization, attention, and next-token prediction at extraordinary scale.

You type "Explain quantum computing to a 10-year-old." You hit send. Two seconds later, a coherent, well-structured explanation appears.

What happened in those two seconds?

Not thinking. Not understanding. Not reasoning in any way you'd recognize. Something far stranger and more interesting.

Step 1: Your Words Become Numbers

The model doesn't read English. It reads numbers.

Your sentence gets broken into tokens — chunks of text that the model treats as atomic units. A token is not a word. It's a piece of a word, or sometimes multiple words.

"Explain quantum computing to a 10-year-old"

Becomes something like:

["Explain", " quantum", " computing", " to", " a", " 10", "-year", "-old"]

Each token maps to a number — its position in the model's vocabulary. GPT-4's vocabulary has roughly 100,000 tokens. Claude's is similar. The vocabulary was built by analyzing billions of pages of text and finding the most efficient way to break language into chunks.

Common words get their own token: "the" is one token. Rare words get split: "quantum" might be one token, but "bioluminescence" might become "bio" + "lumin" + "escence" — three tokens.

This is called tokenization. The tool that does it is a tokenizer (BPE — Byte Pair Encoding — is the most common algorithm). It runs before the neural network sees anything.

Step 2: Numbers Become Positions in Space

Each token number maps to a vector — a list of numbers that represents that token's position in a high-dimensional space.

GPT-4 uses vectors with roughly 12,288 dimensions. That means each token becomes a list of 12,288 numbers.

"quantum" -> [0.23, -0.87, 0.45, 0.12, ..., -0.33]  (12,288 numbers)
"computing" -> [0.19, -0.91, 0.38, 0.15, ..., -0.28]  (12,288 numbers)

This is called an embedding. The key insight: words with similar meanings end up near each other in this space. "King" and "queen" are close. "King" and "refrigerator" are far apart. "Doctor" and "physician" are almost the same point.

These embeddings aren't hand-designed. They emerge from training on billions of text examples. The model learned that "doctor" and "physician" appear in the same contexts, so their vectors converged.

The positions also encode relationships. The classic example: the vector from "king" to "queen" is similar to the vector from "man" to "woman." The model learned gender as a direction in space.

This is not metaphorical. It's literal linear algebra. And it's happening in 12,288 dimensions simultaneously.

Step 3: Attention — Which Words Matter to Which

Here's where the magic happens.

Your prompt is now a sequence of vectors. The model needs to figure out: for each token, which other tokens are relevant to understanding it?

In the sentence "The cat sat on the mat because it was tired," what does "it" refer to? The cat. Not the mat. How does the model know? Attention.

The attention mechanism computes a score between every pair of tokens. "It" and "cat" get a high score. "It" and "mat" get a low score. These scores determine how much each token influences the understanding of every other token.

The math:

For each token, compute three vectors:
  Q (Query):  "What am I looking for?"
  K (Key):    "What do I contain?"
  V (Value):  "What information do I provide?"

Attention score = softmax(Q * K^T / sqrt(d))
Output = Attention score * V

Each token broadcasts a query: "What should I pay attention to?" Every other token responds with a key: "Here's what I'm about." The dot product between query and key determines the attention score. High score = high attention. The output is a weighted sum of values, weighted by attention scores.

This happens not once but many times in parallel. GPT-4 has 96 attention heads running simultaneously, each looking at different relationships. One head might track syntactic relationships (subject-verb agreement). Another might track semantic relationships (what "it" refers to). Another might track positional patterns.

And this entire attention block repeats across 128 layers in GPT-4. Each layer builds more abstract representations. Early layers might learn word relationships. Middle layers learn phrase and clause relationships. Later layers learn document-level patterns, tone, and style.

128 layers, 96 heads per layer. That's 12,288 parallel attention computations per token.

Step 4: Prediction — Always the Next Token

After 128 layers of attention, the model produces a probability distribution over its entire vocabulary (~100,000 tokens). This distribution answers one question:

Given everything before this point, what is the most likely next token?

"Explain quantum computing to a 10-year-old"

Next token probabilities:
  "." -> 12%
  "\n" -> 15%
  "Quantum" -> 8%
  "Sure" -> 6%
  "Imagine" -> 4%
  ...

The model doesn't plan a full response. It doesn't outline. It doesn't draft and revise. It generates one token at a time, each time running the entire forward pass (all 128 layers, all attention heads) to determine the next most probable token.

After generating "Sure," the entire sequence becomes:

"Explain quantum computing to a 10-year-old. Sure"

And the model runs again to predict the next token after "Sure." And again. And again. Token by token, the response builds.

This is called autoregressive generation. Each token is conditioned on all previous tokens — both the original prompt and all tokens the model has generated so far.

The response you see streaming in character by character? That's literally how the model works. It's not buffering a pre-generated response. It's computing each token in real-time, one after another.

Step 5: Sampling — Why Responses Aren't Deterministic

The model produces a probability distribution. "Sure" has 6% probability, "Imagine" has 4%. How does it choose?

Temperature controls this.

At temperature 0 (greedy decoding), the model always picks the highest-probability token. The response is deterministic — same input, same output, every time.

At temperature 1, the model samples from the full distribution according to probabilities. Higher-probability tokens are more likely but not guaranteed. This introduces randomness. Same input, different output each time.

Higher temperature = more creative (and more unpredictable). Lower temperature = more focused (and more repetitive).

Top-p (nucleus sampling) is the other control. Instead of considering all 100,000 tokens, it only considers the smallest set of tokens whose cumulative probability exceeds p. At top-p 0.9, the model considers enough tokens to cover 90% of the probability mass and ignores the remaining 10%.

This prevents the model from occasionally picking extremely improbable tokens (which produce nonsense) while still allowing creative variation among the likely candidates.

The Scale That Makes It Work

Everything I just described is a standard transformer architecture. The same math has been published since 2017 ("Attention Is All You Need" by Vaswani et al.). What makes modern LLMs different is scale.

GPT-4 (estimated):

Parameters: ~1.8 trillion (across a mixture-of-experts architecture)
Training data: ~13 trillion tokens
Training compute: ~$100 million worth of GPU time
Layers: 128
Attention heads per layer: 96
Embedding dimension: 12,288

Claude (Anthropic doesn't publish exact numbers, but similar order of magnitude).

The parameters are the learned weights — the numbers that determine how attention is computed, how embeddings are transformed, what probability distribution is produced. Training adjusts these weights by showing the model trillions of text examples and nudging the weights to better predict the next token.

That's it. The entire training objective is: predict the next token better. From this single objective, language understanding, reasoning, coding, translation, summarization, and creative writing all emerge.

What This Means Practically

Why LLMs are fast at short responses and slow at long ones: Each token requires a full forward pass through the model. A 500-token response requires 500 forward passes. Time scales linearly with output length.

Why LLMs cost what they cost: You pay per token. Input tokens are cheaper (one forward pass for the whole prompt, thanks to parallelization). Output tokens are expensive (one forward pass per token, sequential).

Why LLMs are bad at math: Next-token prediction is pattern matching. "What is 847 x 293?" doesn't have a pattern in the training data for every possible multiplication. The model tries to predict the answer token by token, sometimes getting intermediate digits wrong. It's predicting what an answer "looks like," not computing it.

Why LLMs hallucinate: The model produces the most probable next token. If the training data contains conflicting information, or if the question is about something not well-represented in training data, the most probable token might be wrong. The model has no mechanism to say "I don't know" — it always produces a probability distribution, and always picks from it.

Why context window matters: The attention mechanism computes scores between every pair of tokens. That's O(n^2) — quadratic scaling. Doubling the context window quadruples the compute. This is why context windows have limits, and why longer contexts are more expensive.

Why system prompts work: The system prompt is just tokens prepended to the conversation. The attention mechanism treats them the same as user tokens. The model predicts the next token conditioned on everything — system prompt + conversation history + current message. System prompt instructions work because the model learned during training that instructions at the beginning of text tend to govern what follows.

What an LLM is NOT

It's not a database. It doesn't "store" information and "retrieve" it. It learned patterns from training data and generates responses that match those patterns.

It's not a search engine. It doesn't look things up. It predicts what a good answer looks like based on the patterns it learned.

It's not thinking. It's not reasoning in the way you reason. It's producing the next most probable token, conditioned on everything before it. When the output looks like reasoning, it's because the training data contained examples of reasoning, and the model learned to produce text that follows similar patterns.

Next post: Why LLMs get things wrong. Hallucination isn't a bug — it's the mechanism. And understanding it changes how you build with them.

← More posts Home