How uncertain is a distribution? Entropy measures that. High entropy = hard to predict. Low entropy = predictable.

Definition

$$H(X) = -\sum_x P(x) \log P(x)$$

For continuous distributions, use differential entropy with integral.

Entropy

Interactive demo: Entropy Animation

Intuition

Consider distribution over weather: {sunny, rainy, cloudy, snowy}

High entropy: Each equally likely (25% each). Very unpredictable. $$H = -4 \times 0.25 \log_2(0.25) = 2 \text{ bits}$$

Low entropy: Sunny 90%, others 3.33% each. Pretty predictable. $$H \approx 0.61 \text{ bits}$$

Zero entropy: Sunny 100%. No uncertainty. $$H = 0$$

Properties

  1. Non-negative: H(X) ≥ 0

  2. Maximum for uniform: Among distributions over k outcomes, uniform has highest entropy (log k)

  3. Additive for independent: H(X, Y) = H(X) + H(Y)

Computing entropy

import numpy as np
from scipy.stats import entropy

# From probability distribution
probs = [0.25, 0.25, 0.25, 0.25]
H = entropy(probs, base=2)  # in bits

# Manual calculation
def compute_entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]  # avoid log(0)
    return -np.sum(probs * np.log2(probs))

In machine learning

Decision trees: Split to maximize information gain (reduce entropy).

$$\text{IG}(S, A) = H(S) - \sum_v \frac{|S_v|}{|S|} H(S_v)$$

Maximum entropy models: Among distributions satisfying constraints, choose highest entropy (least assuming).

Language models: Perplexity = 2^H. Measures how “surprised” model is.

Cross-entropy

Compare true distribution P with model Q:

$$H(P, Q) = -\sum_x P(x) \log Q(x)$$

Always H(P, Q) ≥ H(P). Equality when Q = P.

This is the loss function for classification!

KL divergence

Difference between distributions:

$$D_{KL}(P || Q) = H(P, Q) - H(P) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$

Not symmetric! D(P||Q) ≠ D(Q||P)

Used in:

  • VAE loss
  • Information bottleneck
  • Bayesian inference

Mutual information

How much knowing X tells you about Y:

$$I(X; Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)$$

Symmetric. Zero if independent.

Used for feature selection, representation learning.