‘Parallel Advantage-Actor Critic’ Implementation

This repo contains a MXNet implementation of a variant of the A3C algorithm from Asynchronous Methods for Deep Reinforcement Learning.

Trajectories are obtained from multiple environments in a single process, batched together, and used to update the model with a single forward and backward pass.

Generalized Advantage Estimation is used to estimate the advantage function.

Please see the accompanying tutorial for additional background.

Author: Sean Welleck (@wellecks), Reed Lee (@loofahcus)


Atari Pong

The model can be trained on various OpenAI gym environments, but was primarily tested on PongDeterministic-v3. To train on this environment with default parameters (16 environments), use:

python train.py

Training a model to achieve a score of 20 takes roughly an hour on a Macbook Pro.

Other environments

Note that other environments may require additional tuning or architecture adjustments. Use python train.py -h to see the command-line arguments. For instance, to train on CartPole-v0, performing updates every 50 steps, use:

python train.py --env-type CartPole-v0 --t-max 50