Part III · Multi-head and beyond

Eight heads, eight readings.

A single attention head can only express one relationship at a time. Multi-head attention runs h of them in parallel, each with its own learned Q, K, V projections — so the model can, in one layer, track syntax and semantics and position at once.

Below are eight heads, all attending over the same six tokens: "the cat sat on the mat". Each head's weight matrix was sampled from a different random projection, so each produces a distinct pattern. In a trained model these patterns specialize: one head learns subject–verb, another learns nearest-adjective, another copies positional information.

Eight heads at once

Hover any head to read its row-normalized weights. Click to lock-focus.

Figure 5 · multi-head attention grid

Fig 5 — each 6×6 block is one head. Rows are queries (who is asking), columns are keys (who is being looked at). Darker = more attention.

Then what?

After each head produces its own [n, d_k] output, the eight results are concatenated along the feature axis (giving [n, h·d_k] = [n, d_model]) and passed through a final linear projection W^O. The model's next layer sees a single tensor, but encoded inside it are eight different views of the same sentence.

One sentence to remember — one head = one relationship. Eight heads = eight relationships, discovered in parallel, at no extra sequence-length cost.