| # CheckPoint |
| |
| --- |
| |
| SINGA checkpoints model parameters onto disk periodically according to user |
| configured frequency. By checkpointing model parameters, we can |
| |
| 1. resume the training from the last checkpointing. For example, if |
| the program crashes before finishing all training steps, we can continue |
| the training using checkpoint files. |
| |
| 2. use them to initialize a similar model. For example, the |
| parameters from training a RBM model can be used to initialize |
| a [deep auto-encoder](rbm.html) model. |
| |
| ## Configuration |
| |
| Checkpointing is controlled by two configuration fields: |
| |
| * `checkpoint_after`, start checkpointing after this number of training steps, |
| * `checkpoint_freq`, frequency of doing checkpointing. |
| |
| For example, |
| |
| # job.conf |
| checkpoint_after: 100 |
| checkpoint_frequency: 300 |
| ... |
| |
| Checkpointing files are located at *WORKSPACE/checkpoint/stepSTEP-workerWORKERID*. |
| *WORKSPACE* is configured in |
| |
| cluster { |
| workspace: |
| } |
| |
| For the above configuration, after training for 700 steps, there would be |
| two checkpointing files, |
| |
| step400-worker0 |
| step700-worker0 |
| |
| ## Application - resuming training |
| |
| We can resume the training from the last checkpoint (i.e., step 700) by, |
| |
| ./bin/singa-run.sh -conf JOB_CONF -resume |
| |
| There is no change to the job configuration. |
| |
| ## Application - model initialization |
| |
| We can also use the checkpointing file from step 400 to initialize |
| a new model by configuring the new job as, |
| |
| # job.conf |
| checkpoint : "WORKSPACE/checkpoint/step400-worker0" |
| ... |
| |
| If there are multiple checkpointing files for the same snapshot due to model |
| partitioning, all the checkpointing files should be added, |
| |
| # job.conf |
| checkpoint : "WORKSPACE/checkpoint/step400-worker0" |
| checkpoint : "WORKSPACE/checkpoint/step400-worker1" |
| ... |
| |
| The training command is the same as starting a new job, |
| |
| ./bin/singa-run.sh -conf JOB_CONF |