How uncertain is a distribution? Entropy measures that. High entropy = hard to predict. Low entropy = predictable.
Definition
$$H(X) = -\sum_x P(x) \log P(x)$$
For continuous distributions, use differential entropy with integral.
Interactive demo: Entropy Animation
Intuition
Consider distribution over weather: {sunny, rainy, cloudy, snowy}
High entropy: Each equally likely (25% each). Very unpredictable. $$H = -4 \times 0.25 \log_2(0.25) = 2 \text{ bits}$$
Low entropy: Sunny 90%, others 3.33% each. Pretty predictable. $$H \approx 0.61 \text{ bits}$$
Zero entropy: Sunny 100%. No uncertainty. $$H = 0$$
Properties
-
Non-negative: H(X) ≥ 0
-
Maximum for uniform: Among distributions over k outcomes, uniform has highest entropy (log k)
-
Additive for independent: H(X, Y) = H(X) + H(Y)
Computing entropy
import numpy as np
from scipy.stats import entropy
# From probability distribution
probs = [0.25, 0.25, 0.25, 0.25]
H = entropy(probs, base=2) # in bits
# Manual calculation
def compute_entropy(probs):
probs = np.array(probs)
probs = probs[probs > 0] # avoid log(0)
return -np.sum(probs * np.log2(probs))
In machine learning
Decision trees: Split to maximize information gain (reduce entropy).
$$\text{IG}(S, A) = H(S) - \sum_v \frac{|S_v|}{|S|} H(S_v)$$
Maximum entropy models: Among distributions satisfying constraints, choose highest entropy (least assuming).
Language models: Perplexity = 2^H. Measures how “surprised” model is.
Cross-entropy
Compare true distribution P with model Q:
$$H(P, Q) = -\sum_x P(x) \log Q(x)$$
Always H(P, Q) ≥ H(P). Equality when Q = P.
This is the loss function for classification!
KL divergence
Difference between distributions:
$$D_{KL}(P || Q) = H(P, Q) - H(P) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$
Not symmetric! D(P||Q) ≠ D(Q||P)
Used in:
- VAE loss
- Information bottleneck
- Bayesian inference
Mutual information
How much knowing X tells you about Y:
$$I(X; Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)$$
Symmetric. Zero if independent.
Used for feature selection, representation learning.