Definition

Transformer

The neural network architecture — based on attention — that powers every modern LLM, image model, and most state-of-the-art AI.

The Full Definition

The transformer is a neural network architecture introduced in 2017 ("Attention Is All You Need") that uses self-attention — a mechanism for letting every token in a sequence look at every other token — to process language. Unlike its predecessors (RNNs, LSTMs), transformers process all tokens in parallel, scale efficiently with compute and data, and capture long-range relationships in text far better. Every major LLM (GPT, Claude, Llama, Gemini) is a transformer, as are modern image models (Vision Transformers) and many other state-of-the-art systems.

Why It Matters

You don't need to implement a transformer to use one — but understanding the architecture clarifies why context windows have the cost profile they do (attention is O(n²) in sequence length), why models can be parallelized so well, and why scale has driven the AI capability curve.

How This Shows Up in Practice

Most teams interact with transformers through APIs (OpenAI, Anthropic) or open-weight model loaders (Hugging Face). Understanding the architecture matters when you need to make decisions about model size, context length, or fine-tuning approach — all of which trade against the transformer's computational cost profile.

Common Questions

Are transformers being replaced?

Not yet. Alternatives like Mamba and state-space models are promising for very long contexts, but transformers remain the dominant production architecture and continue improving.

Do I need to understand transformers to build with AI?

Not in detail. But understanding the basics — attention, tokens, the cost profile — helps in making good architectural decisions about how to use models in production.

Related Terms

Want to put this to work?

A free process audit maps where transformer — and the rest of the modern AI stack — actually move the needle in your business.

Survey My Business