The paper that changed everything: “Attention Is All You Need” (2017). No RNNs, no convolutions. Just attention. Now powers GPT, BERT, and nearly every language model.
High-level structure
Original transformer: encoder-decoder
Encoder: Process input, create representations Decoder: Generate output, attending to encoder
BERT uses encoder only. GPT uses decoder only.
Full architecture tour: Transformer Animation
Encoder block
Each encoder layer:
- Multi-head self-attention
- Add & normalize (residual + layer norm)
- Feed-forward network
- Add & normalize
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
# Pre-norm style
x = x + self.attention(self.norm1(x))
x = x + self.ffn(self.norm2(x))
return x
Stack N of these (usually 6-96 depending on model size).
Decoder block
Same as encoder, plus:
- Masked self-attention: Can only attend to previous positions
- Cross-attention: Attend to encoder outputs
class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff):
super().__init__()
self.self_attention = MultiHeadAttention(d_model, num_heads)
self.cross_attention = MultiHeadAttention(d_model, num_heads)
self.ffn = nn.Sequential(...)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
def forward(self, x, encoder_output, mask):
x = x + self.self_attention(self.norm1(x), mask=mask)
x = x + self.cross_attention(self.norm2(x), encoder_output)
x = x + self.ffn(self.norm3(x))
return x
Feed-forward network
Simple but crucial. Processes each position independently:
$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$
Typically d_ff = 4 × d_model. This is where most parameters live.
Recent theory: FFN acts as key-value memory. First layer selects patterns, second retrieves information.
Attention patterns
Multi-head attention lets model learn different relationship types:
- Head 1: syntactic dependencies
- Head 2: coreference
- Head 3: positional patterns
- etc.
Different heads specialize automatically during training.
Residual connections
Every sublayer has skip connection:
$$\text{output} = \text{sublayer}(x) + x$$
Critical for:
- Training deep networks (gradient flow)
- Preserving information
- Allowing layers to learn “refinements”
The full picture
Input Embeddings + Positional Encoding
↓
┌──────────────────────────┐
│ Encoder Block ×N │
│ - Self-attention │
│ - FFN │
└──────────────────────────┘
↓
Encoder Output
↓
┌──────────────────────────┐
│ Decoder Block ×N │
│ - Masked self-attention │
│ - Cross-attention │
│ - FFN │
└──────────────────────────┘
↓
Linear + Softmax
↓
Output Tokens
Variants
Encoder-only (BERT): Classification, understanding tasks Decoder-only (GPT): Generation, language modeling Encoder-decoder (T5): Translation, summarization
Modern trend: decoder-only scales best, most versatile.