| # Updater |
| |
| --- |
| |
| Every server in SINGA has an [Updater](../api/classsinga_1_1Updater.html) |
| instance that updates parameters based on gradients. |
| In this page, the *Basic user guide* describes the configuration of an updater. |
| The *Advanced user guide* present details on how to implement a new updater and a new |
| learning rate changing method. |
| |
| ## Basic user guide |
| |
| There are many different parameter updating protocols (i.e., subclasses of |
| `Updater`). They share some configuration fields like |
| |
| * `type`, an integer for identifying an updater; |
| * `learning_rate`, configuration for the |
| [LRGenerator](../api/classsinga_1_1LRGenerator.html) which controls the learning rate. |
| * `weight_decay`, the co-efficient for [L2 * regularization](http://deeplearning.net/tutorial/gettingstarted.html#regularization). |
| * [momentum](http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/). |
| |
| If you are not familiar with the above terms, you can get their meanings in |
| [this page provided by Karpathy](http://cs231n.github.io/neural-networks-3/#update). |
| |
| ### Configuration of built-in updater classes |
| |
| #### Updater |
| The base `Updater` implements the [vanilla SGD algorithm](http://cs231n.github.io/neural-networks-3/#sgd). |
| Its configuration type is `kSGD`. |
| Users need to configure at least the `learning_rate` field. |
| `momentum` and `weight_decay` are optional fields. |
| |
| updater{ |
| type: kSGD |
| momentum: float |
| weight_decay: float |
| learning_rate { |
| ... |
| } |
| } |
| |
| #### AdaGradUpdater |
| |
| It inherits the base `Updater` to implement the |
| [AdaGrad](http://www.magicbroom.info/Papers/DuchiHaSi10.pdf) algorithm. |
| Its type is `kAdaGrad`. |
| `AdaGradUpdater` is configured similar to `Updater` except |
| that `momentum` is not used. |
| |
| #### NesterovUpdater |
| |
| It inherits the base `Updater` to implements the |
| [Nesterov](http://arxiv.org/pdf/1212.0901v2.pdf) (section 3.5) updating protocol. |
| Its type is `kNesterov`. |
| `learning_rate` and `momentum` must be configured. `weight_decay` is an |
| optional configuration field. |
| |
| #### RMSPropUpdater |
| |
| It inherits the base `Updater` to implements the |
| [RMSProp algorithm](http://cs231n.github.io/neural-networks-3/#sgd) proposed by |
| [Hinton](http://www.cs.toronto.edu/%7Etijmen/csc321/slides/lecture_slides_lec6.pdf)(slide 29). |
| Its type is `kRMSProp`. |
| |
| updater { |
| type: kRMSProp |
| rmsprop_conf { |
| rho: float # [0,1] |
| } |
| } |
| |
| #### AdaDeltaUpdater |
| |
| It inherits the base `Updater` to implements the |
| [AdaDelta](http://arxiv.org/abs/1212.5701) updating algorithm. |
| Its type is `kAdaDelta`. |
| |
| updater { |
| type: kAdaDelta |
| adadelta_conf { |
| rho: float # [0,1] |
| } |
| } |
| |
| #### Adam |
| |
| It inherits the base `Updater` to implements the |
| [Adam](http://arxiv.org/pdf/1412.6980.pdf) updating algorithm. |
| Its type is `kAdam`. |
| `beta1` and `beta2` is floats, 0 < `beta` < 1, generally close to 1. |
| |
| updater { |
| type: kAdam |
| adam_conf { |
| beta1: float # [0,1] |
| beta2: float # [0,1] |
| } |
| } |
| |
| #### AdaMax |
| |
| It inherits the base `Updater` to implements the |
| [AdaMax](http://arxiv.org/pdf/1412.6980.pdf) updating algorithm. |
| Its type is `kAdamMax`. |
| `beta1` and `beta2` is floats, 0 < `beta` < 1, generally close to 1. |
| |
| updater { |
| type: kAdamMax |
| adammax_conf { |
| beta1: float # [0,1] |
| beta2: float # [0,1] |
| } |
| } |
| |
| ### Configuration of learning rate |
| |
| The `learning_rate` field is configured as, |
| |
| learning_rate { |
| type: ChangeMethod |
| base_lr: float # base/initial learning rate |
| ... # fields to a specific changing method |
| } |
| |
| The common fields include `type` and `base_lr`. SINGA provides the following |
| `ChangeMethod`s. |
| |
| #### kFixed |
| |
| The `base_lr` is used for all steps. |
| |
| #### kLinear |
| |
| The updater should be configured like |
| |
| learning_rate { |
| base_lr: float |
| linear_conf { |
| freq: int |
| final_lr: float |
| } |
| } |
| |
| Linear interpolation is used to change the learning rate, |
| |
| lr = (1 - step / freq) * base_lr + (step / freq) * final_lr |
| |
| #### kExponential |
| |
| The udapter should be configured like |
| |
| learning_rate { |
| base_lr: float |
| exponential_conf { |
| freq: int |
| } |
| } |
| |
| The learning rate for `step` is |
| |
| lr = base_lr / 2^(step / freq) |
| |
| #### kInverseT |
| |
| The updater should be configured like |
| |
| learning_rate { |
| base_lr: float |
| inverset_conf { |
| final_lr: float |
| } |
| } |
| |
| The learning rate for `step` is |
| |
| lr = base_lr / (1 + step / final_lr) |
| |
| #### kInverse |
| |
| The updater should be configured like |
| |
| learning_rate { |
| base_lr: float |
| inverse_conf { |
| gamma: float |
| pow: float |
| } |
| } |
| |
| |
| The learning rate for `step` is |
| |
| lr = base_lr * (1 + gamma * setp)^(-pow) |
| |
| |
| #### kStep |
| |
| The updater should be configured like |
| |
| learning_rate { |
| base_lr : float |
| step_conf { |
| change_freq: int |
| gamma: float |
| } |
| } |
| |
| |
| The learning rate for `step` is |
| |
| lr = base_lr * gamma^ (step / change_freq) |
| |
| #### kFixedStep |
| |
| The updater should be configured like |
| |
| learning_rate { |
| fixedstep_conf { |
| step: int |
| step_lr: float |
| |
| step: int |
| step_lr: float |
| |
| ... |
| } |
| } |
| |
| Denote the i-th tuple as (step[i], step_lr[i]), then the learning rate for |
| `step` is, |
| |
| step_lr[k] |
| |
| where step[k] is the smallest number that is larger than `step`. |
| |
| |
| ## Advanced user guide |
| |
| ### Implementing a new Updater subclass |
| |
| The base Updater class has one virtual function, |
| |
| class Updater{ |
| public: |
| virtual void Update(int step, Param* param, float grad_scale = 1.0f) = 0; |
| |
| protected: |
| UpdaterProto proto_; |
| LRGenerator lr_gen_; |
| }; |
| |
| It updates the values of the `param` based on its gradients. The `step` |
| argument is for deciding the learning rate which may change through time |
| (step). `grad_scale` scales the original gradient values. This function is |
| called by servers once it receives all gradients for the same `Param` object. |
| |
| To implement a new Updater subclass, users must override the `Update` function. |
| |
| class FooUpdater : public Updater { |
| void Update(int step, Param* param, float grad_scale = 1.0f) override; |
| }; |
| |
| Configuration of this new updater can be declared similar to that of a new |
| layer, |
| |
| # in user.proto |
| FooUpdaterProto { |
| optional int32 c = 1; |
| } |
| |
| extend UpdaterProto { |
| optional FooUpdaterProto fooupdater_conf= 101; |
| } |
| |
| The new updater should be registered in the |
| [main function](programming-guide.html) |
| |
| driver.RegisterUpdater<FooUpdater>("FooUpdater"); |
| |
| Users can then configure the job as |
| |
| # in job.conf |
| updater { |
| user_type: "FooUpdater" # must use user_type with the same string identifier as the one used for registration |
| fooupdater_conf { |
| c : 20; |
| } |
| } |
| |
| ### Implementing a new LRGenerator subclass |
| |
| The base `LRGenerator` is declared as, |
| |
| virtual float Get(int step); |
| |
| To implement a subclass, e.g., `FooLRGen`, users should declare it like |
| |
| class FooLRGen : public LRGenerator { |
| public: |
| float Get(int step) override; |
| }; |
| |
| Configuration of `FooLRGen` can be defined using a protocol message, |
| |
| # in user.proto |
| message FooLRProto { |
| ... |
| } |
| |
| extend LRGenProto { |
| optional FooLRProto foolr_conf = 101; |
| } |
| |
| The configuration is then like, |
| |
| learning_rate { |
| user_type : "FooLR" # must use user_type with the same string identifier as the one used for registration |
| base_lr: float |
| foolr_conf { |
| ... |
| } |
| } |
| |
| Users have to register this subclass in the main function, |
| |
| driver.RegisterLRGenerator<FooLRGen, std::string>("FooLR") |