This repo contains a MXNet implementation of a variant of the A3C algorithm from Asynchronous Methods for Deep Reinforcement Learning.
Trajectories are obtained from multiple environments in a single process, batched together, and used to update the model with a single forward and backward pass.
Generalized Advantage Estimation is used to estimate the advantage function.
Please see the accompanying tutorial for additional background.
Author: Sean Welleck (@wellecks), Reed Lee (@loofahcus)
The model can be trained on various OpenAI gym environments, but was primarily tested on PongDeterministic-v3. To train on this environment with default parameters (16 environments), use:
python train.py
Training a model to achieve a score of 20 takes roughly an hour on a Macbook Pro.
Note that other environments may require additional tuning or architecture adjustments. Use python train.py -h to see the command-line arguments. For instance, to train on CartPole-v0, performing updates every 50 steps, use:
python train.py --env-type CartPole-v0 --t-max 50