How uncertain is a distribution? Entropy measures that. High entropy = hard to predict. Low entropy = predictable.
What is entropy, really?
Think of entropy as a measure of “surprise.” If you flip a fair coin, you’re genuinely uncertain about the outcome - that’s high entropy. But if you have a coin that lands heads 99% of the time, you’re rarely surprised - that’s low entropy.
The formula looks scary but the idea is simple: we’re averaging how surprised we’d be across all possible outcomes.
$$H(X) = -\sum_x P(x) \log P(x)$$
For continuous distributions, use differential entropy with integral.
Try the interactive demo: Entropy Animation - slide the probabilities around and watch entropy change in real time.
Building intuition
Let’s make this concrete with a weather example. Imagine predicting tomorrow’s weather from {sunny, rainy, cloudy, snowy}.
High entropy (2 bits): Each outcome has 25% chance. You really have no idea what’s coming. Maximum confusion!
$$H = -4 \times 0.25 \log_2(0.25) = 2 \text{ bits}$$
Low entropy (~0.61 bits): Sunny 90%, others 3.33% each. You can confidently guess “sunny” and be right most of the time.
$$H \approx 0.61 \text{ bits}$$
Zero entropy: Sunny 100%. No uncertainty at all - you always know exactly what will happen.
$$H = 0$$
The “bits” unit tells you how many yes/no questions you’d need to identify the outcome. With 4 equally likely options, you need 2 questions. With one dominant option, you barely need to ask anything.
Properties
-
Non-negative: H(X) ≥ 0 (you can’t have negative uncertainty)
-
Maximum for uniform: When all outcomes are equally likely, uncertainty is highest (log k for k outcomes)
-
Additive for independent events: H(X, Y) = H(X) + H(Y) - knowing about one doesn’t help with the other
Computing entropy in Python
import numpy as np
from scipy.stats import entropy
# From probability distribution
probs = [0.25, 0.25, 0.25, 0.25]
H = entropy(probs, base=2) # in bits
# Manual calculation
def compute_entropy(probs):
probs = np.array(probs)
probs = probs[probs > 0] # avoid log(0)
return -np.sum(probs * np.log2(probs))
Why ML engineers care about entropy
Decision trees: When building a decision tree, we want splits that reduce uncertainty the most. Information gain measures how much a split decreases entropy:
$$\text{IG}(S, A) = H(S) - \sum_v \frac{|S_v|}{|S|} H(S_v)$$
A good split creates child nodes with lower entropy (purer groups).
Maximum entropy models: When you only have partial information about a distribution, the safest bet is the one with highest entropy. It makes the fewest assumptions.
Language models: Perplexity = 2^H tells us how “surprised” a model is by text. Lower perplexity = better model. A perplexity of 100 means the model is as confused as if choosing uniformly from 100 words at each step.
Cross-entropy: comparing distributions
What if we have a true distribution P but we’re modeling it with Q?
$$H(P, Q) = -\sum_x P(x) \log Q(x)$$
Cross-entropy is always ≥ H(P), with equality only when Q = P perfectly.
This is why cross-entropy is THE loss function for classification! We’re measuring how well our predicted probabilities Q match the true labels P.
KL divergence: the gap between distributions
How different are P and Q?
$$D_{KL}(P | Q) = H(P, Q) - H(P) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$
Important: KL divergence is NOT symmetric! D(P|Q) ≠ D(Q|P). The direction matters.
You’ll see KL divergence in:
- VAE loss (regularizing the latent space)
- Information bottleneck
- Bayesian inference (comparing posteriors)
Mutual information: shared knowledge
How much does knowing X tell you about Y?
$$I(X; Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)$$
This one IS symmetric. It’s zero when X and Y are independent (knowing one tells you nothing about the other).
Used for feature selection (which features tell us most about the target?) and representation learning.
Information theory clicked? Star the ML Animations repo and share this with other curious minds!