Sigmoid was the standard. Then ReLU came along and made deep learning actually work. Such a simple function but it changed everything.
What is ReLU?
Rectified Linear Unit. Just:
$$f(x) = \max(0, x)$$
Negative inputs → 0 Positive inputs → unchanged
def relu(x):
return max(0, x)
# or with numpy
def relu(x):
return np.maximum(0, x)
That’s it. Why is this good?
Visual explanation: ReLU Animation
The sigmoid problem
Before ReLU, networks used sigmoid: $$\sigma(x) = \frac{1}{1 + e^{-x}}$$
Looks nice but has issues.
Vanishing gradients
Sigmoid derivative maxes at 0.25. Multiple layers multiply gradients together. 0.25^10 = 0.0000009. Signal dies.
Deep networks couldn’t train. Gradients vanished before reaching early layers.
Saturation
Very negative or very positive inputs have gradient ≈ 0. Neurons get “stuck” and stop learning.
Compute expensive
Exponential is slow compared to max.
ReLU solves these
Gradient is 1 for positive inputs
No matter how deep, positive signals pass through unchanged.
No saturation on positive side
Large positive values don’t squash gradient.
Stupid fast
Just a comparison. No exponential.
ReLU’s problem - dying neurons
Negative side has gradient 0. If neuron always outputs negative, it stops learning entirely.
“Dead ReLU” - neuron that never activates.
Happens when:
- Learning rate too high
- Bad initialization
- Unlucky input distribution
Some networks end up with 20-40% dead neurons.
Leaky ReLU
Small slope for negative values:
$$f(x) = \begin{cases} x & x > 0 \ 0.01x & x \leq 0 \end{cases}$$
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
Dead neurons can recover. Still fast.
Other variants
PReLU (Parametric ReLU)
Like Leaky but alpha is learned per channel.
ELU $$f(x) = \begin{cases} x & x > 0 \ \alpha(e^x - 1) & x \leq 0 \end{cases}$$
Smooth, pushes mean activations toward zero.
GELU (Gaussian Error Linear Unit)
$$f(x) = x \cdot \Phi(x)$$
Where Φ is CDF of standard normal. Used in BERT, GPT.
Swish/SiLU $$f(x) = x \cdot \sigma(x)$$
Google found it through automated search. Works well.
Which one to use?
For most cases: ReLU or Leaky ReLU
For transformers: GELU
For very deep networks: Check if dead neurons are a problem, switch to Leaky if so
Don’t overthink it. Difference is usually small.
Code examples
PyTorch:
import torch.nn as nn
# In Sequential
model = nn.Sequential(
nn.Linear(100, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
# Other activations
nn.LeakyReLU(0.01)
nn.ELU()
nn.GELU()
TensorFlow:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(50, activation='relu'),
tf.keras.layers.Dense(10)
])