Neural networks learn by adjusting weights to reduce error. Gradient descent tells you which direction to adjust. It’s the engine behind all of deep learning.
The basic idea
You have a loss function L(w) that measures how wrong your predictions are.
Goal: find weights w that minimize L.
Gradient ∇L tells you which direction increases L fastest. Go opposite direction to decrease L.
$$w_{new} = w_{old} - \alpha \nabla L(w)$$
Where α is learning rate.
Watch the optimization: Gradient Descent Animation
Why it works
Loss surface is like landscape. You’re at some point. Gradient points uphill.
Move opposite direction = move downhill = lower loss.
Repeat until you reach a minimum.
Learning rate matters
Too small: Takes forever to converge. Might get stuck.
Too large: Overshoots minimum. Loss oscillates or diverges.
Just right: Converges smoothly.
# Typical ranges
lr = 1e-3 # common starting point
lr = 1e-4 # smaller, more stable
lr = 3e-4 # often works for Adam
Types of gradient descent
Batch (full) gradient descent: Compute gradient on entire dataset. Accurate but slow.
for epoch in range(epochs):
gradient = compute_gradient(all_data)
weights -= lr * gradient
Stochastic gradient descent (SGD): Compute gradient on single sample. Fast but noisy.
for sample in dataset:
gradient = compute_gradient(sample)
weights -= lr * gradient
Mini-batch gradient descent: Compute gradient on small batch. Best of both worlds.
for batch in dataloader: # batch_size = 32, 64, 128...
gradient = compute_gradient(batch)
weights -= lr * gradient
This is what everyone uses in practice.
Momentum
SGD is noisy, oscillates. Momentum smooths it out.
Keep running average of gradients: $$v_t = \beta v_{t-1} + \nabla L$$ $$w = w - \alpha v_t$$
Like a ball rolling downhill - it builds up speed in consistent directions.
v = 0
for batch in dataloader:
gradient = compute_gradient(batch)
v = beta * v + gradient
weights -= lr * v
Adam optimizer
Adaptive learning rate per parameter. Most popular optimizer.
Combines momentum with adaptive scaling:
- Parameters with large gradients: smaller effective learning rate
- Parameters with small gradients: larger effective learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for batch in dataloader:
optimizer.zero_grad()
loss = compute_loss(batch)
loss.backward()
optimizer.step()
Adam usually “just works” but sometimes SGD+momentum generalizes better.
Learning rate scheduling
Learning rate should often decrease during training.
Step decay: Reduce by factor every N epochs
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
Cosine annealing: Smooth decrease following cosine curve
scheduler = CosineAnnealingLR(optimizer, T_max=100)
Warmup: Start small, increase, then decrease
# Linear warmup for first 1000 steps
# Then decay
Local minima and saddle points
Loss surface isn’t simple bowl. Has:
- Local minima (not globally optimal)
- Saddle points (minimum in some directions, maximum in others)
Scary but in high dimensions, most “bad” critical points are saddle points. Noise from mini-batches helps escape them.
Practical tips
- Start with Adam, lr=1e-3 or 3e-4
- Use learning rate warmup for large models
- Monitor loss curves - should decrease smoothly
- If loss explodes, reduce learning rate
- Try SGD+momentum for final fine-tuning