Eight heads, eight readings.
A single attention head can only express one relationship at a time. Multi-head attention runs h of them in parallel, each with its own learned Q, K, V projections — so the model can, in one layer, track syntax and semantics and position at once.
Below are eight heads, all attending over the same six tokens: "the cat sat on the mat". Each head's weight matrix was sampled from a different random projection, so each produces a distinct pattern. In a trained model these patterns specialize: one head learns subject–verb, another learns nearest-adjective, another copies positional information.
Eight heads at once
Hover any head to read its row-normalized weights. Click to lock-focus.
Then what?
After each head produces its own [n, dk] output, the eight results are concatenated along the feature axis (giving [n, h·dk] = [n, dmodel]) and passed through a final linear projection WO. The model's next layer sees a single tensor, but encoded inside it are eight different views of the same sentence.