Deep Reinforcement Learning with Double Q-learning

Abstract: The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether they harm performance, and whether they can generally be prevented. In this paper, we answer all these questions affirmatively. In particular, we first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain. We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Deep Reinforcement Learning, Double Q-learning, Q-learning, Atari 2600, Overestimation
  • Objective: Investigate the overestimation issues in Q-learning and propose a solution through Double Q-learning adapted for deep learning.
  • Hypothesis: Overestimations in Q-learning negatively affect performance, and these can be mitigated using Double Q-learning techniques.

Background

  • Preliminary Theories:

    • Q-learning: A reinforcement learning algorithm that estimates the value of actions to derive optimal policies based on cumulative rewards.
    • Overestimation Bias: A phenomenon where the estimated action values are systematically higher than their true values, leading to suboptimal policy decisions.
    • Double Q-learning: An extension of Q-learning that uses two value functions to decouple action selection from action evaluation, reducing overestimation bias.
  • Prior Research:

    • 1993: Thrun and Schwartz identified issues with function approximation in reinforcement learning, highlighting the risks of overestimation.
    • 2010: Van Hasselt introduced Double Q-learning to address overestimation in tabular settings.
    • 2015: Mnih et al. developed the Deep Q-Network (DQN), which combined Q-learning with deep learning, achieving significant performance on Atari games but still faced overestimation issues.

Methodology

  • Key Ideas:

    • Double DQN: A modified version of DQN that implements Double Q-learning principles, using a target network to evaluate the greedy policy determined by the online network.
    • Experience Replay: A technique where past experiences are stored and sampled to improve learning stability and efficiency.
    • Target Network: A separate network that stabilizes learning by providing consistent target values during updates.
  • Experiments:

    • Evaluated on the Atari 2600 games using the Arcade Learning Environment.
    • Compared performance metrics between DQN and Double DQN across multiple games, measuring the impact of overestimation on policy quality.
    • Conducted ablation studies to analyze the effects of different components of the algorithm.
  • Implications: The design of Double DQN allows for more accurate value estimates and improved policy learning without significantly increasing computational complexity.

Findings

  • Outcomes:

    • DQN exhibited substantial overestimations in action values, negatively impacting performance in various Atari games.
    • Double DQN effectively reduced overestimations, leading to improved scores and more stable learning across all tested games.
    • The results indicated that overestimations were not just artifacts of specific games but a general issue in Q-learning implementations.
  • Significance: This research demonstrated that addressing overestimation biases in reinforcement learning algorithms can lead to significant improvements in policy performance, challenging previous assumptions about the robustness of Q-learning.

  • Future Work: Suggested exploring further refinements to Double DQN, investigating alternative architectures, and applying the findings to more complex environments beyond Atari.

  • Potential Impact: Advancements in reinforcement learning algorithms like Double DQN could enhance the development of AI systems in real-world applications, such as robotics and autonomous systems, where decision-making under uncertainty is critical.

Notes

Meta

Published: 2015-09-22

Updated: 2025-08-27

URL: https://arxiv.org/abs/1509.06461v3

Authors: Hado van Hasselt, Arthur Guez, David Silver

Citations: 6453

H Index: 130

Categories: cs.LG

Model: gpt-4o-mini