Transformer Forward Pass
Layer l, timestep t=4 ("cat") · d_model=512, d_head=64, n_heads=8
1D Activation
2D Weight
Residual
Nonlinear
Per-Head
Layer l-1 outputs
t=1I[512]
t=2am[512]
t=3a[512]
[512]t=4cat
Multi-Head Attention
I
am
a
cat
← K,V use all t≤4
Query (t=4)
[512]
×
x₄
[512×512]
Wq
Keys
[4×512]
×
X
[512×512]
Wk
Values
[4×512]
×
X
[512×512]
Wv
[512]
Q
[4×512]
K
[4×512]
V
split → 8 × 64
8 Independent Heads
concat → 512
Output Projection linear
[512]
×
H
[512×512]
→
Wo
[512]
out
current token queries history
↓
8 independent summaries
1
2
3
4
5
6
7
8
each: softmax → weighted avg
↓
mix heads linearly → residual
Loss Computation
Unembedding
[512]
×
xL
[512×50257]
→
WET
[50257]
z
Cross-Entropy
[50257]
softmax
−log p[y]
→
z
scalar
ℒ
512
unembed
50257
CE
ℒ
Input
xl-14
∂ℒ/∂xl-1
[512]
LN₁
to attn
drop
++attn
LN₂
to mlp
drop
++mlp
Output
xl4
∂ℒ/∂xl
[512]
predicted
that
[argmax(z)]
label
sat
[y=9338]
MLP
Up (4× expand)
[512]
×
x
[512×2048]
→
W↑
[2048]
h
Activation
[2048]
GELU
→
h
[2048]
h'
Down (contract)
[2048]
×
h'
[2048×512]
→
W↓
[512]
out
512
expand
2048
GELU
2048
contract
512
Click any element to see equations, dimensions, and code →