blob: 5cd5a7b8583556368519dfd0b557b53b4ef3322f [file] [log] [blame]
<!--- Licensed to the Apache Software Foundation (ASF) under one -->
<!--- or more contributor license agreements. See the NOTICE file -->
<!--- distributed with this work for additional information -->
<!--- regarding copyright ownership. The ASF licenses this file -->
<!--- to you under the Apache License, Version 2.0 (the -->
<!--- "License"); you may not use this file except in compliance -->
<!--- with the License. You may obtain a copy of the License at -->
<!--- http://www.apache.org/licenses/LICENSE-2.0 -->
<!--- Unless required by applicable law or agreed to in writing, -->
<!--- software distributed under the License is distributed on an -->
<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
<!--- KIND, either express or implied. See the License for the -->
<!--- specific language governing permissions and limitations -->
<!--- under the License. -->
# Survey of Existing Interfaces and Implementations
Commonly used deep learning libraries with good RNN/LSTM support include [Theano](http://deeplearning.net/software/theano/library/scan.html) and its wrappers [Lasagne](http://lasagne.readthedocs.org/en/latest/modules/layers/recurrent.html) and [Keras](http://keras.io/layers/recurrent/); [CNTK](https://cntk.codeplex.com/); [TensorFlow](https://www.tensorflow.org/tutorials/sequences/recurrent); and various implementations in Torch, such as [this well-known character-level language model tutorial](https://github.com/karpathy/char-rnn), [this](https://github.com/Element-Research/rnn).
In this document, we present a comparative analysis of the approaches taken by these libraries.
## Theano
In Theano, RNN support comes via its [scan operator](http://deeplearning.net/software/theano/library/scan.html),
which allows construction of a loop where the number of iterations is specified
as a runtime value of a symbolic variable.
You can find an official example of an LSTM implementation with scan
[here](http://deeplearning.net/tutorial/lstm.html).
### Implementation
I'm not very familiar with the Theano internals,
but it seems from [theano/scan_module/scan_op.py#execute](https://github.com/Theano/Theano/blob/master/theano/scan_module/scan_op.py#L1225)
that the scan operator is implemented with a loop in Python
that performs one iteration at a time:
```python
fn = self.fn.fn
while (i < n_steps) and cond:
# ...
fn()
```
The `grad` function in Theano constructs a symbolic graph for computing gradients. So the `grad` for the scan operator is actually implemented by [constructing another scan operator](https://github.com/Theano/Theano/blob/master/theano/scan_module/scan_op.py#L2527):
```python
local_op = Scan(inner_gfn_ins, inner_gfn_outs, info)
outputs = local_op(*outer_inputs)
```
The [performance guide](http://deeplearning.net/software/theano/library/scan.html#optimizing-scan-s-performance) for Theano's scan operator suggests minimizing the usage of the scan. This might be due to the fact that the loop is executed in Python, which might be a bit slow (due to context switching and the performance of Python itself). Moreover, because no unrolling is performed, the graph optimizer can't see the big picture.
If I understand correctly, when multiple RNN/LSTM layers are stacked, instead of a single loop with each iteration computing the whole feedforward network operation, the computation sequentially does a separate loop for each layer that uses the scan operator. If all of the intermediate values are stored to support computing the gradients, this is fine. Otherwise, using a single loop could be more memory efficient.
### Lasagne
The documentation for RNN in Lasagne can be found [here](http://lasagne.readthedocs.org/en/latest/modules/layers/recurrent.html). In Lasagne, a recurrent layer is just like a standard layer, except that the input shape is expected to be `(batch_size, sequence_length, feature_dimension)`. The output shape is then `(batch_size, sequence_length, output_dimension)`.
Both `batch_size` and `sequence_length` are specified as `None`, and inferred from the data. Alternatively, when memory is sufficient and the (maximum) sequence length is known beforehand, you can set `unroll_scan` to `False`. Then Lasagne will unroll the graph explicitly, instead of using the Theano `scan` operator. Explicitly unrolling is implemented in [utils.py#unroll_scan](https://github.com/Lasagne/Lasagne/blob/master/lasagne/utils.py#L340).
The recurrent layer also accepts a `mask_input`, to support variable length sequences (e.g., when sequences within a mini-batch have different lengths. The mask has the shape `(batch_size, sequence_length)`.
### Keras
The documentation for RNN in Keras can be found [here](http://keras.io/layers/recurrent/). The interface in Keras is similar to the interface in Lasagne. The input is expected to be of shape `(batch_size, sequence_length, feature_dimension)`, and the output shape (if `return_sequences` is `True`) is `(batch_size, sequence_length, feature_dimension)`.
Keras currently supports both a Theano and a TensorFlow back end. RNN for the Theano back end is [implemented with the scan operator](https://github.com/fchollet/keras/blob/master/keras/backend/theano_backend.py#L432). For TensorFlow, it seems to be [implemented via explicitly unrolling](https://github.com/fchollet/keras/blob/master/keras/backend/tensorflow_backend.py#L396). The documentation says that for the TensorFlow back end, the sequence length must be specified beforehand, and masking is currently not working (because `tf.reduce_any` is not functioning yet).
## Torch
[karpathy/char-rnn](https://github.com/karpathy/char-rnn) is implemented by [explicitly unrolling](https://github.com/karpathy/char-rnn/blob/master/model/RNN.lua#L15). On the contrary, [Element-Research/rnn](https://github.com/Element-Research/rnn) runs sequence iteration in Lua. It actually has a very modular design:
* The basic RNN/LSTM modules run only *one* time step per one call of `forward` (and accumulate/store necessary information to support backward computation, if needed). You could have detailed control when using this API directly.
* A collection of `Sequencer`s are defined to model common scenarios, like forwarding sequence, bi-directional sequence, attention models, etc.
* There are other utility modules, like masking to support variable length sequences, etc.
## CNTK
CNTK looks quite different from other common deep learning libraries. I don't understand it very well. I will talk with Yu to get more details.
It seems that the basic data types are matrices (although there is also a `TensorView` utility class). The mini-batch data for sequence data is packed in a matrix with N-row being `feature_dimension` and N-column being `sequence_length * batch_size` (see Figure 2.9 on page 50 of the [CNTKBook](http://research.microsoft.com/pubs/226641/CNTKBook-20151201.pdf)).
Recurrent networks are first-class citizens in CNTK. In section 5.2.1.8 of the CNTKBook, you can see an example of a customized computation node. The node needs to explicitly define the functions for standard forward and forward with a time index, which is used for RNN evaluation:
```cpp
virtual void EvaluateThisNode()
{
EvaluateThisNodeS(FunctionValues(), Inputs(0)->
FunctionValues(), Inputs(1)->FunctionValues());
}
virtual void EvaluateThisNode(const size_t timeIdxInSeq)
{
Matrix<ElemType> sliceInputValue = Inputs(1)->
FunctionValues().ColumnSlice(timeIdxInSeq *
m_samplesInRecurrentStep, m_samplesInRecurrentStep);
Matrix<ElemType> sliceOutputValue = m_functionValues.
ColumnSlice(timeIdxInSeq * m_samplesInRecurrentStep,
m_samplesInRecurrentStep);
EvaluateThisNodeS(sliceOutputValue, Inputs(0)->
FunctionValues(), sliceInput1Value);
}
```
The function `ColumnSlice(start_col, num_col)` takes out the packed data for that time index, as described above (here `m_samplesInRecurrentStep` must be the mini-batch size).
The low-level API for recurrent connection seem to be a *delay node*. But I'm not sure how to use this low-level API. The [example of PTB language model](https://cntk.codeplex.com/SourceControl/latest#Examples/Text/PennTreebank/Config/rnn.config) uses a very high-level API (simply setting `recurrentLayer = 1` in the config).
## TensorFlow
The [current example of RNNLM](https://www.tensorflow.org/tutorials/sequences/recurrent#recurrent-neural-networks) in TensorFlow uses explicit unrolling for a predefined number of time steps. The white-paper mentions that an advanced control flow API (Theano's scan-like) is planned.
## Next Steps
* [MXNet System Overview](http://mxnet.io/architecture/overview.html)