What just happened

Each row of the output is a convex combination of the value rows, weighted by how much this query "asked" each key. A row that attends uniformly just averages all values; a row that attends sharply copies a single value. Multi-head attention runs h copies of this in parallel, each with its own learned Q, K, V projections — giving the model h different ways to relate tokens.

Try it yourself

Reproduce the numbers above with NumPy. The shapes are small enough that you can print every intermediate.

Next up: we'll stack h = 8 heads, each with dk = 64, and see how concatenation followed by the output projection WO gives us back a [n, d_model] tensor — the same shape we started with, now enriched by eight distinct attention views.