Learn the value of actions without knowing environment dynamics. Q-learning: simple, powerful, the foundation of DQN and modern deep RL.
The Q-function
Q(s, a) = expected return starting from state s, taking action a, then acting optimally.
If we know Q*, we know optimal policy: $$\pi^(s) = \arg\max_a Q^(s, a)$$
Just pick the action with highest Q-value.
The update rule
$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a’} Q(s’, a’) - Q(s, a) \right]$$
Move Q toward observed return.
$r + \gamma \max_{a’} Q(s’, a’)$ is the target: immediate reward plus best future.
$\alpha$ is learning rate.
Watch Q-values converge: Q-Learning Animation
Why it works
This is temporal difference learning. Instead of waiting for episode end:
- Observe one transition
- Update Q using estimated future value
- Repeat
Off-policy: learns optimal Q regardless of what policy generated the data.
Algorithm
def q_learning(env, num_episodes, alpha=0.1, gamma=0.99, epsilon=0.1):
Q = defaultdict(lambda: np.zeros(env.action_space.n))
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
next_state, reward, done, _ = env.step(action)
# Q-learning update
best_next = np.max(Q[next_state]) if not done else 0
td_target = reward + gamma * best_next
Q[state][action] += alpha * (td_target - Q[state][action])
state = next_state
return Q
Key components
Epsilon-greedy
Balance exploration and exploitation:
- With probability ε: random action (explore)
- With probability 1-ε: best action (exploit)
Common: start ε high, decay over time.
Learning rate
Too high: unstable, oscillates Too low: learns slowly
Often decayed: $\alpha_t = 1/(1 + t)$
Discount factor
γ close to 1: values long-term rewards γ close to 0: myopic, immediate rewards only
Example: Cliff Walking
S . . . . . . . . G
C C C C C C C C C C
S = start, G = goal, C = cliff (big negative reward)
Q-learning finds optimal path: hug the cliff for shortest route.
SARSA (on-policy) learns safer path: stay away from cliff.
Deep Q-Networks (DQN)
Q-table doesn’t scale. Use neural network:
$$Q(s, a; \theta) \approx Q^*(s, a)$$
Challenges:
- Correlated samples → experience replay
- Moving target → target network
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, state):
return self.net(state) # Q-values for all actions
Limitations
- Overestimates Q-values (max over noise)
- Struggles with large/continuous action spaces
- Sample inefficient
Fixes: Double DQN, Dueling DQN, Rainbow…