Part III · Multi-head and beyond

The Transformer block, wired up.

We have every ingredient. Now we assemble. A Transformer block is a piece of circuitry with one attention sub-layer, one feed-forward sub-layer, two residuals, and two layer-norms — no more, no less.

Click any wire in the diagram. The right-hand panel explains what flows through it, what shape it has, and what the math does at that point.

The block

Figure 7 · encoder-layer data-flow

click any wire

Start with the input

Every wire carries a tensor of shape [n, d_model] — n tokens, each represented by a d_model-wide vector. That shape is preserved through the entire block, which is why you can stack them.

Fig 7 — the only operation that mixes across tokens is multi-head attention. The FFN applies position-wise; both residuals skip exactly one sub-layer.

The equation

Written out, one encoder layer is:

z = LayerNorm(x + MultiHead(x, x, x))
y = LayerNorm(z + FFN(z))

The residuals let gradient flow unblocked past every sub-layer; layer-norm keeps activations in a well-conditioned range; multi-head attention mixes information across tokens; the FFN (two linears with a ReLU between them) does the position-wise computation. Everything else in a full model — decoder, cross-attention, masking — is a variation on these four parts.

One sentence to remember — one block mixes across tokens once and across features once, then hands the result to the next block unchanged in shape.

What you've built

From a single dot product you now have: similarity → softmax → scaled dot-product attention → multi-head attention → positional encoding → the full block. Every Transformer in use today — the ones writing code, translating languages, folding proteins — is a stack of these blocks with variations on what's masked and what's cross-attended. The math is no harder than what you just read.