Q-learning and Deep Q Networks (DQN) represent foundational and advanced techniques in reinforcement learning that enable agents to learn optimal policies through interaction with an environment.
Q-learning is a model-free, value-based algorithm that updates action-value estimates (Q-values) iteratively to guide decision-making.
DQN extends Q-learning by integrating deep neural networks to approximate Q-values in complex, high-dimensional state spaces, enabling reinforcement learning applications in environments with large or continuous inputs such as images.
Q-learning is an off-policy reinforcement learning algorithm focused on learning the value of taking a given action in a given state, expressed as the Q-function Q(s, a).
1. The main objective is to find the optimal action-value function Q* (s, a) which maximizes expected future rewards.
2. It updates Q-values iteratively using the Bellman equation:
Where:
Q-learning balances exploration and exploitation through action selection strategies such as ε-greedy.
Limitations of Classical Q-learning
1. It relies on a discrete state-action space or tabular representation, which is impractical for large or continuous spaces.
2. Requires storing and updating Q-values for all state-action pairs, leading to scalability challenges.
DQN overcomes Q-learning’s limitations by using deep neural networks as function approximators to estimate the Q-value function.
1. The neural network takes raw state inputs (e.g., images) and outputs Q-values for all possible actions.
2. Enables RL in environments with high-dimensional inputs such as Atari games or robotics.
Key innovations in DQN include:
The DQN Algorithm Steps
Listed below are the foundational procedures that guide Deep Q-Network training. These steps describe how experiences are stored, sampled, and used to refine predictions.
1. Initialize online Q-network and target network with random weights.
2. Observe the current state
3. Select action
4. Execute action, observe reward
5. Store transition
6. Sample mini-batch of transitions from replay buffer.
7. Compute target Q-values with the target network and update the online network via gradient descent.
8. Periodically update the target network with online network weights.
1. Successfully applied in challenging domains like Atari 2600 games, achieving human-level performance.
2. Capable of handling raw pixel inputs and learning directly from high-dimensional data.
3. Paved the way for numerous extensions and improvements in deep reinforcement learning.