Two numbers that capture essence of a distribution: expected value (center) and variance (spread).
Expected value
The “average” value. Weighted sum of outcomes by probability.
Discrete: $$E[X] = \sum_x x \cdot P(X=x)$$
Continuous: $$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx$$
Visual explanation: Expected Value Animation
Example
Fair die: $$E[X] = 1(\frac{1}{6}) + 2(\frac{1}{6}) + … + 6(\frac{1}{6}) = \frac{21}{6} = 3.5$$
You never actually roll 3.5, but it’s the long-run average.
Properties
Linearity: $$E[aX + b] = aE[X] + b$$ $$E[X + Y] = E[X] + E[Y]$$
Always true, even if X and Y dependent.
For independent X, Y: $$E[XY] = E[X] \cdot E[Y]$$
Only for independent variables!
Variance
How spread out the distribution is.
$$\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2$$
Standard deviation: σ = √Var(X)
import numpy as np
data = [1, 2, 3, 4, 5]
mean = np.mean(data)
var = np.var(data) # or np.var(data, ddof=1) for sample variance
std = np.std(data)
Variance properties
Scaling: $$\text{Var}(aX) = a^2 \text{Var}(X)$$
Shift: $$\text{Var}(X + b) = \text{Var}(X)$$
Sum (independent): $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$$
If not independent, need covariance: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)$$
Covariance
Measure of joint variability:
$$\text{Cov}(X,Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]$$
Positive: X and Y move together Negative: X up means Y down Zero: no linear relationship
Common distributions
| Distribution | E[X] | Var(X) |
|---|---|---|
| Bernoulli(p) | p | p(1-p) |
| Binomial(n,p) | np | np(1-p) |
| Poisson(λ) | λ | λ |
| Normal(μ,σ²) | μ | σ² |
| Uniform(a,b) | (a+b)/2 | (b-a)²/12 |
| Exponential(λ) | 1/λ | 1/λ² |
In machine learning
Loss functions: Minimize expected loss: $$\min_\theta E_{(x,y)}[\text{Loss}(f_\theta(x), y)]$$
Bias-variance tradeoff: $$E[\text{error}] = \text{Bias}^2 + \text{Variance} + \text{Noise}$$
Monte Carlo estimation: $$E[f(X)] \approx \frac{1}{n}\sum_{i=1}^n f(x_i)$$
Sample mean converges to expected value.
Law of Large Numbers
Sample mean → expected value as n → ∞
$$\bar{X}_n = \frac{1}{n}\sum X_i \to E[X]$$
Why training on more data helps: better estimate of true expected loss.
Central Limit Theorem
Sum of many independent random variables → Normal distribution
$$\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \to \mathcal{N}(0,1)$$
Why Normal shows up everywhere.
Computing from data
# Sample estimates
mean = np.mean(X)
var = np.var(X, ddof=1) # ddof=1 for unbiased sample variance
# Covariance matrix
cov_matrix = np.cov(X, Y)
# Correlation (normalized covariance)
corr = np.corrcoef(X, Y)
Moment generating functions
More advanced: MGF encodes all moments.
$$M_X(t) = E[e^{tX}]$$
Derivatives at t=0 give moments:
- M’(0) = E
- M’’(0) = E[X²]