docs-site/docs/graph.md

id: graph title: Computational Graph

SINGA supports buffering operations and computational graph. Using computational graph, SINGA can schedule the execution of operations as well as the memory allocation and release. It makes training more efficient while using less memory.

Features

There are three main features of computational graph, namely (i) Computational graph construction, (ii) Lazy allocation, (iii) Automatic recycling. Details are as follows:

Computational graph construction: Construct a computational graph based on the mathematical or deep learning operations, and then run the graph to accomplish the training task. The computational graph also includes operations like communicator.synch and communicator.fusedSynch for the distributed training.
Lazy allocation: When blocks are allocated, devices do not allocate memory for them immediately. Devices do memory allocation only when an operation uses this block for the first time.
Automatic recycling: When we are running a graph in an iteration, it automatically deallocate the intermediate tensors which won't be used again in the remaining operations.

Design

Computational graph construction

Use the technique of delayed execution to falsely perform operations in the forward propagation and backward propagation once. Buffer all the operations and the tensors read or written by each operation.
Calculate dependencies between all the operations to decide the order of execution. (Support directed cyclic graph)
Execute all the operations in the order we just calculated to update all the parameters.
The system will only analyze the same graph once. If new operations are added to the graph, the calculation graph will be re-analyzed.
Provided a module class for users to use this feature more conveniently.

Lazy allocation

When a device needs to create a new block, pass the device to that block only, instead of allocating a piece of memory from the mempool and passing the pointer to that block.
When a block is accessed for the first time, the device corresponding to the block allocate memory and then access it.

Automatic recycling

When calculating dependencies between the operations during the graph construction, the reference count of tensors can also be calculated.
When an operation is completed, decreases the reference count of tensors that the operation used.
If a tensor‘s reference count reaches zero, it means the tensor won’t be accessed by latter operations, so we can recycle its memory.
The program will track the usage of the block. If a block is used on the python side, it will not be recycled, which is convenient for debugging on the python side.

How to use

An example of CNN:


class CNN(module.Module):

    def __init__(self, optimizer):
        super(CNN, self).__init__()

        self.conv1 = autograd.Conv2d(1, 20, 5, padding=0)
        self.conv2 = autograd.Conv2d(20, 50, 5, padding=0)
        self.linear1 = autograd.Linear(4 * 4 * 50, 500)
        self.linear2 = autograd.Linear(500, 10)
        self.pooling1 = autograd.MaxPool2d(2, 2, padding=0)
        self.pooling2 = autograd.MaxPool2d(2, 2, padding=0)

        self.optimizer = optimizer

    def forward(self, x):
        y = self.conv1(x)
        y = autograd.relu(y)
        y = self.pooling1(y)
        y = self.conv2(y)
        y = autograd.relu(y)
        y = self.pooling2(y)
        y = autograd.flatten(y)
        y = self.linear1(y)
        y = autograd.relu(y)
        y = self.linear2(y)
        return y

    def loss(self, x, ty):
        return autograd.softmax_cross_entropy(x, ty)

    def optim(self, loss):
        self.optimizer.backward_and_update(loss)

# initialization other objects
# ......
model = CNN(sgd)
model.train()
model.on_device(dev)
model.graph(graph, sequential)

# Train
for b in range(num_train_batch):
    # Generate the patch data in this iteration
    # ......

    # Copy the patch data into input tensors
    tx.copy_from_numpy(x)
    ty.copy_from_numpy(y)

    # Train the model
    out = model(tx)
    loss = model.loss(out, ty)
    model.optim(loss)

Some settings: module.py
- training: whether to train the neural network defined in the class or for evaluation.
- graph_mode: the model class defined by users can be trained using computational graph or not.
- sequential: execute operations in graph serially or in the order of BFS.
More examples:

Evaluation

Single node

Experiment settings
- Model
  - Using layer: ResNet50 in resnet.py
  - Using module: ResNet50 in resnet_module.py
- GPU: NVIDIA RTX 2080Ti
Notations
- s ：second
- it ： iteration
- Mem：peak memory usage of single GPU
- Throughout：number of images processed per second
- Time：total time
- Speed：iterations per second
- Reduction：the memory usage reduction rate compared with dev branch
- Speedup: speedup ratio compared with dev branch
Result

Multi processes

Experiment settings
- Model
  - using Layer: ResNet50 in resnet_dist.py
  - using Module: ResNet50 in resnet_module.py
- GPU: NVIDIA RTX 2080Ti * 2
- MPI: two MPI processes on one node
Notations: the same as above
Result

Conclusion

Computational graph does not affect training time and memory usage if the graph is disabled (has backward compatibility).
Computational graph can significantly reduce memory usage and training time.

Include operations in graph

For new operations to be included in the computational graph, they should be submitted to the device. Device class in CPP will add these operations in the computational graph and scheduler will schedule them automatically.

Requirements

When submitting operations, there are some requirements.

Need to pass in the function that the operation executes and the data blocks that the operation reads and writes
For the function of the operation: All variables used in lambda expressions need to be captured according to the following rules.
- capture by value: If the variable is a local variable or will be immediately released (e.g. intermediate tensors). If not captured by value, these variables will be destroyed after buffering. Buffering is just a way to defer real calculations.
- capture by reference：If the variable is recorded on the python side or a global variable (e.g. The parameter W and ConvHand in the Conv2d class).
- mutable: The lambda expression should have mutable tag if a variable captured by value is modified in an expression

Example

Python side: _Conv2d records x, W, b and handle in the class.

class _Conv2d(Operation):

    def __init__(self, handle, odd_padding=(0, 0, 0, 0)):
        super(_Conv2d, self).__init__()
        self.handle = handle  # record handle
        self.odd_padding = odd_padding
        if self.odd_padding != (0, 0, 0, 0):
            self.re_new_handle = True

    def forward(self, x, W, b=None):
		# other code
        # ......

        if training:
            if self.handle.bias_term:
                self.inputs = (x, W, b) # record x, W, b
            else:
                self.inputs = (x, W)

		# other code
        # ......

        if (type(self.handle) != singa.ConvHandle):
            return singa.GpuConvForward(x, W, b, self.handle)
        else:
            return singa.CpuConvForward(x, W, b, self.handle)

    def backward(self, dy):
        if (type(self.handle) != singa.ConvHandle):
            dx = singa.GpuConvBackwardx(dy, self.inputs[1], self.inputs[0],
                                        self.handle)
            dW = singa.GpuConvBackwardW(dy, self.inputs[0], self.inputs[1],
                                        self.handle)
            db = singa.GpuConvBackwardb(
                dy, self.inputs[2],
                self.handle) if self.handle.bias_term else None
        else:
            dx = singa.CpuConvBackwardx(dy, self.inputs[1], self.inputs[0],
                                        self.handle)
            dW = singa.CpuConvBackwardW(dy, self.inputs[0], self.inputs[1],
                                        self.handle)
            db = singa.CpuConvBackwardb(
                dy, self.inputs[2],
                self.handle) if self.handle.bias_term else None
        if self.odd_padding != (0, 0, 0, 0):
            dx = utils.handle_odd_pad_bwd(dx, self.odd_padding)

        if db:
            return dx, dW, db

        else:
            return dx, dW

C++ side: convolution.cc

Tensor GpuConvBackwardx(const Tensor &dy, const Tensor &W, const Tensor &x,
                        const CudnnConvHandle &cch) {
  CHECK_EQ(dy.device()->lang(), kCuda);

  Tensor dx;
  dx.ResetLike(x);

  dy.device()->Exec(
      /*
       * dx is a local variable so it's captured by value
       * dy is an intermediate tensor and isn't recorded on the python side
       * W is an intermediate tensor but it's recorded on the python side
       * chh is a variable and it's recorded on the python side
       */
      [dx, dy, &W, &cch](Context *ctx) mutable {
        Block *wblock = W.block(), *dyblock = dy.block(), *dxblock = dx.block();
        float alpha = 1.f, beta = 0.f;
        cudnnConvolutionBackwardData(
            ctx->cudnn_handle, &alpha, cch.filter_desc, wblock->data(),
            cch.y_desc, dyblock->data(), cch.conv_desc, cch.bp_data_alg,
            cch.workspace.block()->mutable_data(),
            cch.workspace_count * sizeof(float), &beta, cch.x_desc,
            dxblock->mutable_data());
      },
      {dy.block(), W.block()}, {dx.block(), cch.workspace.block()});
      /* the lambda expression reads the blocks of tensor dy and w
       * and writes the blocks of tensor dx and chh.workspace
       */

  return dx;
}

Future features

[ ] Graph substitution: replace a subgraph of the input computation graph with another subgraph which is functionally equivalent to the original one.
[ ] Support recalculation and swapping out variables from the GPU to reduce memory usage.
[ ] Perform operations in the graph in the order of DFS.
[ ] Performing operations in parallel on single GPU.