Definition

Context Window

The maximum amount of text — measured in tokens — that an LLM can consider at once when generating a response.

The Full Definition

The context window is the maximum size of input a language model can process in a single call, measured in tokens (roughly ¾ of a word in English). It includes both the prompt you provide and the response the model generates. Modern models range from 8K tokens (around 6,000 words) to 1M+ tokens (around 750,000 words) of context. Within that window, the model can reason across everything it sees; beyond it, content must be summarized, retrieved, or chunked.

Why It Matters

Context window size is a fundamental architectural constraint. It dictates whether you can stuff a contract into the prompt directly or whether you need RAG. Bigger context windows are powerful but expensive — long-context calls cost dramatically more, and quality can degrade in the middle of very long contexts ("lost in the middle").

How This Shows Up in Practice

A team tried to drop a 400-page corporate document into a 1M-token model and ask questions of it. It worked — but each query cost $4 and took 30 seconds. Switching to RAG over chunks of the same document brought cost to $0.02 and latency to under a second, with equal quality.

Common Questions

Is a bigger context window always better?

No. Bigger windows increase cost roughly linearly with size, can suffer from "lost in the middle" effects, and rarely outperform a well-designed RAG system on retrieval-style tasks. Match context size to actual need.

What's a token?

A token is the unit a model reads — roughly ¾ of a word in English. "Tokenization" is the chunking step that converts text to tokens before the model processes it.

Related Terms

Want to put this to work?

A free process audit maps where context window — and the rest of the modern AI stack — actually move the needle in your business.

Survey My Business