This repo contains a MXNet implementation of a variant of the A3C algorithm from Asynchronous Methods for Deep Reinforcement Learning.
Trajectories are obtained from multiple environments in a single process, batched together, and used to update the model with a single forward and backward pass.
Generalized Advantage Estimation is used to estimate the advantage function.
Please see the accompanying tutorial for additional background.
The model can be trained on various OpenAI gym environments, but was primarily tested on
PongDeterministic-v3. To train on this environment with default parameters (16 environments), use:
Training a model to achieve a score of 20 takes roughly an hour on a Macbook Pro.
Note that other environments may require additional tuning or architecture adjustments. Use
python train.py -h to see the command-line arguments. For instance, to train on
CartPole-v0, performing updates every 50 steps, use:
python train.py --env-type CartPole-v0 --t-max 50