Part I · Foundations

A vector, a dot product.

Before "attention" is anything, it is a way of asking how similar two vectors are. That operation — the dot product — is the smallest piece of the Transformer. Get it in your bones and the rest is bookkeeping.

Two vectors. Say the model has learned to represent the word cat as a four-number sketch:

Figure 1 · a single 4-vector
Fig 1 — each cell is one dimension. Nothing about any single number means anything on its own. Meaning lives in the pattern.

The dot product of two such vectors is one number. It is bigger when the vectors point the same way, zero when they are perpendicular, and negative when they disagree. That is all attention knows how to ask: "are these two things aligned?"

Drag the second vector

Below, the query is fixed. Drag the tip of the key and watch the dot product update. Notice: same direction wins, opposite direction loses, orthogonal gives zero.

Figure 2 · interactive dot product
Fig 2 — the dot product is q · k = |q||k|cos θ. When you drag the key to angle 0°, cos θ = 1 and the score is maximal.

Why this matters

Attention scores six tokens against six tokens by taking thirty-six of these dot products. That is the entire first half of the attention equation. Everything that comes later — scaling, softmax, the weighted sum — exists to turn this raw similarity into a usable answer.

One sentence to remember — the dot product measures alignment. Attention is built on nothing else.