content/v0.3.0/en/_sources/docs/general-rnn.txt - singa-site - Git at Google

 # RNN in SINGA

 ---

 Recurrent neural networks (RNN) are widely used for modelling sequential data,
 e.g., natural language sentences. In this page, we describe how to implement a
 RNN application (or model) using SINGA built-in RNN layers. We will
 use the [char-rnn modle](https://github.com/karpathy/char-rnn) as an example,
 which trains over setences or source code, with each character as an input
 unit. Particularly, we will train a RNN using GRU over
 [Linux kernel source code](http://cs.stanford.edu/people/karpathy/char-rnn/).
 After training, we expect to generate meaningful code from the model, like the
 one shown by [Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
 There is a [vanilla RNN example](rnn.html) for language modelling using user
 defined RNN layers, which is different to using built-in RNN layers discribed
 in this page.

 ```
 /*
  * If this error is set, we will need anything right after that BSD.
  */
 static void action_new_function(struct s_stat_info *wb)
 {
   unsigned long flags;
   int lel_idx_bit = e->edd, *sys & ~((unsigned long) *FIRST_COMPAT);
   buf[0] = 0xFFFFFFFF & (bit << 4);
   min(inc, slist->bytes);
   printk(KERN_WARNING "Memory allocated %02x/%02x, "
       "original MLL instead\n"),
     min(min(multi_run - s->len, max) * num_data_in),
     frame_pos, sz + first_seg);
   div_u64_w(val, inb_p);
   spin_unlock(&disk->queue_lock);
   mutex_unlock(&s->sock->mutex);
   mutex_unlock(&func->mutex);
   return disassemble(info->pending_bh);
 }
 ```

 ## User configuration

 The major diffences to the configuration of other models, e.g., feed-forward
 models include,

 1. the training algorithm should be changed to BPTT (back-propagation through time).
 2. the layers and their connections should be configured differently.

 The train one batch algorithm can be simply configured as

     train_one_batch {
       alg: kBPTT
     }

 Next, we introduce the configuration of the neural net.

 <img src="../_static/images/char-rnn.png" style="width: 550px"/>
 <p><strong> Fig.1 - Illustration of the structure of the Char-RNN model</strong></p>

 Fig.1 illustrates the net structure of the char-rnn model. The input layer
 buffers all training data (the Linux kernel code is about 6MB). For each
 iteration, it reads `unroll_len +1` (`unroll_len` is configured by users)
 successive characters, e.g., "int a;", and passes the first `unroll_len`
 characters to `OneHotLayer`s (one per layer). Every `OneHotLayer` converts its
 character into the one-hot vector representation. The input layer passes the
 last `unroll_len` characters as labels to the `RNNLabelLayer` (the label of the
 i-th character is the i+1 character, i.e., the objective is to predict the next
 character).  Each `GRULayer` receives an one-hot vector and the hidden feature
 vector from its precedent layer. After some feature transformation, its own
 feature vector is passed to an inner-product layer and its successive
 `GRULayer`. The i-th SoftmaxLossLayer measures the cross-entropy loss for
 predicting the i-th character. According to Karpathy, there could be another
 stack of `GRULayer`s connecting the first stack of `GRULayer`s, which improves
 the performance if there is enough training data. The layer configuration is
 similar to that for other models, e.g., feed-forward models. The major
 difference is on the connection configuration.

 ### Unrolling length

 To model the long dependency, recurrent layers need to be unrolled many times,
 denoted as `unroll_len` (i.e., 50). According to our unified neural net
 representation, the neural net should have configurations for `unroll_len`
 recurrent layers. It is tedious
 to let users configure these layers manually. Hence, SINGA makes it a
 configuration field for each layer.  For example, to unroll the `GRULayer`,
 users just configure it as,

     layer {
       type: kGRU
       unroll_len: 50
     }

 Not only the `GRULayer` is unrolled, other layers like `InnerProductLayer` and
 `SoftmaxLossLayer`, are also unrolled. To simplify the configuration, SINGA
 provides a `unroll_len` field in the net configuration, which sets the
 `unroll_len` of each layer configuration if the `unroll_len` is not configured
 explicitly for that layer. For instance, SINGA would set the `unroll_len` of
 the `GRULayer` to 50 implicitly for the following configuration.

     net {
       unroll_len: 50
        layer {
          type: kCharRNNInput
          unroll_len: 1  // configure it explicitly
        }
        layer {
          type: kGRU
          // no configuration for unroll_len
         }
      }

 ### ConnectionType
 <img src="http://karpathy.github.io/assets/rnn/diags.jpeg" style="width: 550px"/>
 <p><strong> Fig.1 - Different RNN structures from [Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)</strong></p>

 There would be many types of connections between layers in RNN models as shown
 by Karpathy in Fig.2.  For each `srclayer`, there is a connection_type for it.
 Taking the i-th `srclayer` as an example, if its connection type is,

 * kOneToOne, then each unrolled layer is connected with one unrolled layer from the i-th `srclayer`.
 * kOneToALL, then each unrolled layer is connected with all unrolled layers from the i-th `srclayer`.

 ## Implementation

 ### Neural net configuration preprocessing

 User configured neural net is preprocessed to unroll the recurrent layers,
 i.e., duplicating the configuration of the `GRULayer`s, renaming the name of
 each layer with unrolling index, and re-configuring the `srclayer` field. After
 preprocessing, each layer's name is changed to
 `<unrolling_index>#<user_configured_name>.`  Consequently, the (unrolled) neural
 net configuration passed to NeuralNet class includes all layers and their
 connections.  The NeuralNet class creates and setup each layer in the same way
 as for other models.  For example, after partitioning, each layer's name is
 changed to `<layer_name>@<partition_index>`. One difference is that it has some
 special code for sharing Param data and grad Blobs for layers unrolled from the
 same original layer.

 Users can visualize the neural net structure using the Python script `tool/graph.py`
 and the files in *WORKSPACE/visualization/*. For example, after the training program
 is started,

     python tool/graph.py examples/char-rnn/visualization/train_net.json

 The generated image file is shown in Fig.3 for `unroll_len=5`,

 <img src="../_static/images/char-rnn-net.jpg" style="width: 550px"/>
 <p><strong> Fig.3 - Net structure generated by SINGA</strong></p>

 ### BPTTWorker

 The BPTT (back-propagation through time) algorithm is typically used to compute
 gradients of the objective loss w.r.t. parameters for RNN models. It forwards
 propagates through all unrolled layers (i.e., timepoints) to compute features
 of each layer, and backwards propagates to compute gradients of parameters. It
 is the same as the BP algorithm for feed-forward models if the recurrent layers
 are unrolled infinite times. In practice, due to the constraint of memory, the
 truncated BPTT is widely used.  It unrolls the recurrent layers a fixed
 (truncated) times (controlled by `unroll_len`).  In SINGA, a BPTTWorker is
 provided to run the truncated BPTT algorithm for each mini-batch (i.e.,
 iteration).  The pseudo code is

 ```
 BPTTWorker::Forward(phase, net) {
   for each layer in net
     if layer.unroll_index() == 0
       Get(layer.params());   // fetch params values from servers
     srclayers = layer.srclayer();
     if phase & kTest
       srclayers.push_back(net->GetConextLayer(layer))
     layer.ComputeFeature(phase, srclayers)
 }

 BPTTWorker::Backward(phase, net) {
   for each layer in reverse(net.layers())
     layer.ComputeGradient(layer.srclayers())
     if layer.unroll_index() == 0
       Update(layer.params());   // send params gradients to servers
 }
 ```

 The testing phase is processed specially. Because the test phase may sample a
 long sequence of data (e.g., sampling a piece of Linux kernel code), which
 requires many unrolled layers (e.g., more than 1000 characters/layers). But we
 cannot unroll the recurrent layers too many times due to memory constraint.
 The special line add the 0-th unrolled layer as one of its own source layer.
 Consequently, it dynamically adds a recurrent connection to the recurrent layer
 (e.g., GRULayer). Then we can sample from the model for infinite times. Taking
 the char-rnn model as an example, the test job can be configured as

     test_steps: 10000
     train_one_batch {
       Alg: kBPTT
     }
     net {
       // do not set the unroll_len
       layer {
         // do not set the unroll_len
       }
       …
     }

 The instructions for [running test](test.html) is the same for feed-forward
 models.
	# RNN in SINGA

	---

	Recurrent neural networks (RNN) are widely used for modelling sequential data,
	e.g., natural language sentences. In this page, we describe how to implement a
	RNN application (or model) using SINGA built-in RNN layers. We will
	use the [char-rnn modle](https://github.com/karpathy/char-rnn) as an example,
	which trains over setences or source code, with each character as an input
	unit. Particularly, we will train a RNN using GRU over
	[Linux kernel source code](http://cs.stanford.edu/people/karpathy/char-rnn/).
	After training, we expect to generate meaningful code from the model, like the
	one shown by [Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
	There is a [vanilla RNN example](rnn.html) for language modelling using user
	defined RNN layers, which is different to using built-in RNN layers discribed
	in this page.

	```
	/*
	* If this error is set, we will need anything right after that BSD.
	*/
	static void action_new_function(struct s_stat_info *wb)
	{
	unsigned long flags;
	int lel_idx_bit = e->edd, sys & ~((unsigned long) FIRST_COMPAT);
	buf[0] = 0xFFFFFFFF & (bit << 4);
	min(inc, slist->bytes);
	printk(KERN_WARNING "Memory allocated %02x/%02x, "
	"original MLL instead\n"),
	min(min(multi_run - s->len, max) * num_data_in),
	frame_pos, sz + first_seg);
	div_u64_w(val, inb_p);
	spin_unlock(&disk->queue_lock);
	mutex_unlock(&s->sock->mutex);
	mutex_unlock(&func->mutex);
	return disassemble(info->pending_bh);
	}
	```

	## User configuration

	The major diffences to the configuration of other models, e.g., feed-forward
	models include,

	1. the training algorithm should be changed to BPTT (back-propagation through time).
	2. the layers and their connections should be configured differently.

	The train one batch algorithm can be simply configured as

	train_one_batch {
	alg: kBPTT
	}

	Next, we introduce the configuration of the neural net.

	<img src="../_static/images/char-rnn.png" style="width: 550px"/>
	<p><strong> Fig.1 - Illustration of the structure of the Char-RNN model</strong></p>

	Fig.1 illustrates the net structure of the char-rnn model. The input layer
	buffers all training data (the Linux kernel code is about 6MB). For each
	iteration, it reads `unroll_len +1` (`unroll_len` is configured by users)
	successive characters, e.g., "int a;", and passes the first `unroll_len`
	characters to `OneHotLayer`s (one per layer). Every `OneHotLayer` converts its
	character into the one-hot vector representation. The input layer passes the
	last `unroll_len` characters as labels to the `RNNLabelLayer` (the label of the
	i-th character is the i+1 character, i.e., the objective is to predict the next
	character). Each `GRULayer` receives an one-hot vector and the hidden feature
	vector from its precedent layer. After some feature transformation, its own
	feature vector is passed to an inner-product layer and its successive
	`GRULayer`. The i-th SoftmaxLossLayer measures the cross-entropy loss for
	predicting the i-th character. According to Karpathy, there could be another
	stack of `GRULayer`s connecting the first stack of `GRULayer`s, which improves
	the performance if there is enough training data. The layer configuration is
	similar to that for other models, e.g., feed-forward models. The major
	difference is on the connection configuration.

	### Unrolling length

	To model the long dependency, recurrent layers need to be unrolled many times,
	denoted as `unroll_len` (i.e., 50). According to our unified neural net
	representation, the neural net should have configurations for `unroll_len`
	recurrent layers. It is tedious
	to let users configure these layers manually. Hence, SINGA makes it a
	configuration field for each layer. For example, to unroll the `GRULayer`,
	users just configure it as,

	layer {
	type: kGRU
	unroll_len: 50
	}

	Not only the `GRULayer` is unrolled, other layers like `InnerProductLayer` and
	`SoftmaxLossLayer`, are also unrolled. To simplify the configuration, SINGA
	provides a `unroll_len` field in the net configuration, which sets the
	`unroll_len` of each layer configuration if the `unroll_len` is not configured
	explicitly for that layer. For instance, SINGA would set the `unroll_len` of
	the `GRULayer` to 50 implicitly for the following configuration.

	net {
	unroll_len: 50
	layer {
	type: kCharRNNInput
	unroll_len: 1 // configure it explicitly
	}
	layer {
	type: kGRU
	// no configuration for unroll_len
	}
	}

	### ConnectionType
	<img src="http://karpathy.github.io/assets/rnn/diags.jpeg" style="width: 550px"/>
	<p><strong> Fig.1 - Different RNN structures from [Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)</strong></p>

	There would be many types of connections between layers in RNN models as shown
	by Karpathy in Fig.2. For each `srclayer`, there is a connection_type for it.
	Taking the i-th `srclayer` as an example, if its connection type is,

	* kOneToOne, then each unrolled layer is connected with one unrolled layer from the i-th `srclayer`.
	* kOneToALL, then each unrolled layer is connected with all unrolled layers from the i-th `srclayer`.

	## Implementation

	### Neural net configuration preprocessing

	User configured neural net is preprocessed to unroll the recurrent layers,
	i.e., duplicating the configuration of the `GRULayer`s, renaming the name of
	each layer with unrolling index, and re-configuring the `srclayer` field. After
	preprocessing, each layer's name is changed to
	`<unrolling_index>#<user_configured_name>.` Consequently, the (unrolled) neural
	net configuration passed to NeuralNet class includes all layers and their
	connections. The NeuralNet class creates and setup each layer in the same way
	as for other models. For example, after partitioning, each layer's name is
	changed to `<layer_name>@<partition_index>`. One difference is that it has some
	special code for sharing Param data and grad Blobs for layers unrolled from the
	same original layer.

	Users can visualize the neural net structure using the Python script `tool/graph.py`
	and the files in WORKSPACE/visualization/. For example, after the training program
	is started,

	python tool/graph.py examples/char-rnn/visualization/train_net.json

	The generated image file is shown in Fig.3 for `unroll_len=5`,

	<img src="../_static/images/char-rnn-net.jpg" style="width: 550px"/>
	<p><strong> Fig.3 - Net structure generated by SINGA</strong></p>

	### BPTTWorker

	The BPTT (back-propagation through time) algorithm is typically used to compute
	gradients of the objective loss w.r.t. parameters for RNN models. It forwards
	propagates through all unrolled layers (i.e., timepoints) to compute features
	of each layer, and backwards propagates to compute gradients of parameters. It
	is the same as the BP algorithm for feed-forward models if the recurrent layers
	are unrolled infinite times. In practice, due to the constraint of memory, the
	truncated BPTT is widely used. It unrolls the recurrent layers a fixed
	(truncated) times (controlled by `unroll_len`). In SINGA, a BPTTWorker is
	provided to run the truncated BPTT algorithm for each mini-batch (i.e.,
	iteration). The pseudo code is

	```
	BPTTWorker::Forward(phase, net) {
	for each layer in net
	if layer.unroll_index() == 0
	Get(layer.params()); // fetch params values from servers
	srclayers = layer.srclayer();
	if phase & kTest
	srclayers.push_back(net->GetConextLayer(layer))
	layer.ComputeFeature(phase, srclayers)
	}

	BPTTWorker::Backward(phase, net) {
	for each layer in reverse(net.layers())
	layer.ComputeGradient(layer.srclayers())
	if layer.unroll_index() == 0
	Update(layer.params()); // send params gradients to servers
	}
	```

	The testing phase is processed specially. Because the test phase may sample a
	long sequence of data (e.g., sampling a piece of Linux kernel code), which
	requires many unrolled layers (e.g., more than 1000 characters/layers). But we
	cannot unroll the recurrent layers too many times due to memory constraint.
	The special line add the 0-th unrolled layer as one of its own source layer.
	Consequently, it dynamically adds a recurrent connection to the recurrent layer
	(e.g., GRULayer). Then we can sample from the model for infinite times. Taking
	the char-rnn model as an example, the test job can be configured as

	test_steps: 10000
	train_one_batch {
	Alg: kBPTT
	}
	net {
	// do not set the unroll_len
	layer {
	// do not set the unroll_len
	}
	…
	}

	The instructions for [running test](test.html) is the same for feed-forward
	models.