A vector, a dot product.
Before "attention" is anything, it is a way of asking how similar two vectors are. That operation — the dot product — is the smallest piece of the Transformer. Get it in your bones and the rest is bookkeeping.
Two vectors. Say the model has learned to represent the word cat as a four-number sketch:
The dot product of two such vectors is one number. It is bigger when the vectors point the same way, zero when they are perpendicular, and negative when they disagree. That is all attention knows how to ask: "are these two things aligned?"
Drag the second vector
Below, the query is fixed. Drag the tip of the key and watch the dot product update. Notice: same direction wins, opposite direction loses, orthogonal gives zero.
Why this matters
Attention scores six tokens against six tokens by taking thirty-six of these dot products. That is the entire first half of the attention equation. Everything that comes later — scaling, softmax, the weighted sum — exists to turn this raw similarity into a usable answer.