From Numbers to Probabilities: The Softmax Function

Neural network outputs raw numbers called “logits.” These could be anything: -500, 2.3, 47. For classification, we need probabilities between 0 and 1 that sum to 1. Softmax does exactly that conversion.

The formula

Given a vector z of logits:

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

In plain English: take e^(each value), then divide by the total. Now everything is positive and sums to 1!

def softmax(z):
    exp_z = np.exp(z)
    return exp_z / exp_z.sum()

logits = [2.0, 1.0, 0.1]
probs = softmax(logits)  # [0.659, 0.242, 0.099]

Interactive demo: Softmax Animation - drag the logits around and watch the probabilities change.

What softmax actually does

Let’s build intuition:

Bigger logit → higher probability (the exponential amplifies differences)
Negative logit → small probability (but never zero!)
Ranking preserved - if logit A > logit B, then prob(A) > prob(B)
All outputs in [0, 1] and they always sum to 1

Think of it as a “soft” version of argmax. Instead of picking one winner, it gives each option a score proportional to how much “bigger” it is.

Why use exponential?

We need a function that:

Makes all numbers positive
Preserves which one is biggest
Is smooth and differentiable

Exponential nails all three. Plus it has nice mathematical properties that make gradients clean.

Watch out: numerical stability!

Here’s a trap:

logits = [1000, 1001, 1002]
np.exp(logits)  # [inf, inf, inf] - overflow!

The fix is simple - subtract the max first:

def stable_softmax(z):
    z = z - np.max(z)  # now max is 0
    exp_z = np.exp(z)
    return exp_z / exp_z.sum()

This doesn’t change the result (the subtraction cancels out), but exp() never sees a number > 0. No overflow.

Good news: PyTorch and TensorFlow handle this automatically.

Temperature: controlling confidence

Want sharper or softer probabilities? Add temperature:

$$\text{softmax}(z_i, T) = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$$

Temperature	Effect
T = 1	Normal softmax
T < 1	Sharper, more confident (winner takes more)
T > 1	Softer, more uniform (spread out)
T → 0	Becomes argmax (one-hot)
T → ∞	Becomes uniform

def softmax_with_temp(z, temperature=1.0):
    z = z / temperature
    return stable_softmax(z)

You’ll see temperature in:

Knowledge distillation - soft labels from teacher model
Text generation - controlling randomness (higher T = more creative)
Attention - sometimes used to sharpen focus

Softmax vs other activations

Softmax - multiclass classification output Sigmoid - binary classification or multi-label ReLU - hidden layers

Softmax is for output layer when you have mutually exclusive classes.

With cross-entropy loss

Almost always used together:

$$L = -\sum_i y_i \log(\text{softmax}(z_i))$$

Mathematically: $$\frac{\partial L}{\partial z_i} = \text{softmax}(z_i) - y_i$$

Beautiful gradient. Just predicted minus actual.

In code, use combined function:

# PyTorch - these are equivalent but second is more stable
loss1 = nn.CrossEntropyLoss()(logits, targets)
loss2 = nn.NLLLoss()(F.log_softmax(logits), targets)

Log softmax

Often want log of softmax probabilities:

$$\log\text{softmax}(z_i) = z_i - \log\sum_j e^{z_j}$$

More numerically stable than log(softmax(z)).

# Bad - can get log(0) = -inf
log_probs = np.log(softmax(z))

# Good
log_probs = F.log_softmax(z, dim=-1)

Softmax in attention

Attention scores use softmax:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Converts similarity scores to attention weights that sum to 1.

Common confusions

“Softmax regression” = logistic regression for multiclass. The model is linear, softmax just converts to probabilities.

Independent vs mutually exclusive:

Mutually exclusive classes (cat OR dog) → softmax
Independent labels (has_cat AND has_dog) → sigmoid per label

Now you get softmax! Drop a star on ML Animations and share this with your study group!