Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Abstract: We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.
Synopsis
Overview
- Keywords: Evolution Strategies, Reinforcement Learning, Black Box Optimization, MuJoCo, Atari Games
- Objective: Investigate the efficacy of Evolution Strategies (ES) as a scalable alternative to traditional reinforcement learning methods.
- Hypothesis: ES can outperform or match the performance of existing reinforcement learning algorithms while demonstrating superior scalability and robustness.
- Innovation: Introduction of a novel communication strategy using common random numbers to enhance scalability across numerous parallel workers.
Background
- Preliminary Theories: - Markov Decision Processes (MDP): A mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker.
- Policy Gradient Methods: Techniques in reinforcement learning that optimize the policy directly by adjusting the parameters based on the gradient of expected returns.
- Black Box Optimization: A method that treats the optimization problem as a black box, focusing on input-output relationships without needing gradient information.
- Natural Evolution Strategies (NES): A class of ES that optimizes a parameterized distribution over policies, leveraging stochastic gradient ascent.
 
- Prior Research: - 2015: Deep Q-Networks (DQN) demonstrated significant advancements in playing Atari games using reinforcement learning.
- 2016: Asynchronous Actor-Critic Agents (A3C) improved data efficiency and learning speed in complex environments.
- 2016: Trust Region Policy Optimization (TRPO) introduced a more stable policy update mechanism, addressing issues of variance in policy gradient methods.
- 2017: Research on ES began to highlight its potential in reinforcement learning contexts, particularly in environments with sparse rewards.
 
Methodology
- Key Ideas: - Parallelization: ES is designed to operate on complete episodes, requiring minimal communication between workers, thus allowing for effective scaling.
- Common Random Numbers: A novel approach to synchronize random perturbations across multiple workers, significantly reducing communication overhead.
- Virtual Batch Normalization: A technique to stabilize training by normalizing inputs across the batch, enhancing the reliability of ES.
- Parameter Perturbation: Exploration is driven by perturbing policy parameters rather than actions, allowing for more robust learning in high-dimensional spaces.
 
- Experiments: - MuJoCo Tasks: Evaluated ES on continuous control tasks, comparing performance against TRPO and measuring sample complexity.
- Atari Games: Tested ES on 51 Atari games, comparing final performance metrics against A3C after extensive training.
- Scaling Experiments: Assessed the time taken to solve complex tasks with varying numbers of CPU cores, demonstrating linear scalability.
 
- Implications: The design of ES allows for efficient use of computational resources, making it suitable for environments where traditional reinforcement learning methods struggle. 
Findings
- Outcomes: - ES achieved competitive performance on MuJoCo tasks, often requiring less time to reach similar performance levels as TRPO.
- In Atari games, ES matched or exceeded A3C performance in 23 out of 51 games, despite using more data.
- ES exhibited superior exploration capabilities, discovering diverse behaviors in environments where TRPO struggled.
- The robustness of ES was demonstrated through consistent performance across various hyperparameter settings.
 
- Significance: This research challenges the notion that black box optimization methods are inferior to gradient-based methods in reinforcement learning, showcasing their potential in complex environments. 
- Future Work: Suggested exploration of ES in meta-learning scenarios and integration with low-precision neural network architectures to leverage its gradient-free nature. 
- Potential Impact: Pursuing these avenues could lead to breakthroughs in solving long-horizon tasks and improving the efficiency of learning algorithms in diverse applications. 
