SINGA supports buffering operations and computational graph. By using computational graph, SINGA can schedule the execution of operations and the memory allocation and release which makes training more efficient while using less memory.
There are three main features of computational graph, namely the construction of the computational graph, lazy allocation, automatic recycling and synchronization pipeline. Details as follows:
Computational graph construction
: Construct a computational graph based on the user-defined neural network or expressions and then run the graph to accomplish the training task. The computational graph also includes operations like synch and fused synch in the communicator.Lazy allocation
: When blocks need to be allocated, devices won't allocate memory for them immediately. Only when an operation uses this block for the first time, memory allocation will be performed.Automatic recycling
: Automatically deallocate the intermediate tensors which won't be used again in the following operations when we are running the graph in an iteration.Synchronization pipeline
: In previous synchronization operations, buffers were used to synchronize multiple tensors at once. But the communicator needs to collect all the tensors before copying them into the buffer. Synchronization pipeline can copy tensors to the buffer separately, which reduces the time for synchronous operations.
class CNN(module.Module): def __init__(self, optimizer): super(CNN, self).__init__() self.conv1 = autograd.Conv2d(1, 20, 5, padding=0) self.conv2 = autograd.Conv2d(20, 50, 5, padding=0) self.linear1 = autograd.Linear(4 * 4 * 50, 500) self.linear2 = autograd.Linear(500, 10) self.pooling1 = autograd.MaxPool2d(2, 2, padding=0) self.pooling2 = autograd.MaxPool2d(2, 2, padding=0) self.optimizer = optimizer def forward(self, x): y = self.conv1(x) y = autograd.relu(y) y = self.pooling1(y) y = self.conv2(y) y = autograd.relu(y) y = self.pooling2(y) y = autograd.flatten(y) y = self.linear1(y) y = autograd.relu(y) y = self.linear2(y) return y def loss(self, x, ty): return autograd.softmax_cross_entropy(x, ty) def optim(self, loss): self.optimizer.backward_and_update(loss) # initialization other objects # ...... model = CNN(sgd) model.train() model.on_device(dev) model.graph(graph, sequential) # Train for b in range(num_train_batch): # Generate the patch data in this iteration # ...... # Copy the patch data into input tensors tx.copy_from_numpy(x) ty.copy_from_numpy(y) # Train the model out = model(tx) loss = model.loss(out, ty) model.optim(loss)
trainng
: whether to train the neural network defined in the class or for evaluationgraph_mode
: the model class defined by users can be trained using computational graph or not.sequential
: execute operations in graph serially or in the order of BFS.s
:secondit
: iterationMem
:peak memory usage of single GPUThroughout
:number of pictures processed per secondTime
:total timeSpeed
:iterations per secondReduction
:the memory usage reduction rate compared with dev branchSeepdup
: speedup ratio compared with dev branchFor new operations, if they need to included in the computational graph, they should be submitted to the device. Device class in CPP will add these operations in the computational graph and scheduler will schedule them automatically.
When submitting operations, there are some requirements.
Need to pass in the function that the operation executes and the data blocks that the operation reads and writes
For the function of the operation: All variables used in lambda expressions need to be captured according to the following rules.
capture by value
: If the variable is a local variable or will be immediately released(e.g. intermediate tensors). If not captured by value, these variables will be destroyed after buffering. Buffering is just a way to defer real calculations.
capture by reference
:If the variable is recorded on the python side or a global variable(e.g. The parameter W and ConvHand in the Conv2d class).
mutable
: The lambda expression should have mutable tag if a variable captured by value is modified in an expression
class _Conv2d(Operation): def __init__(self, handle, odd_padding=(0, 0, 0, 0)): super(_Conv2d, self).__init__() self.handle = handle # record handle self.odd_padding = odd_padding if self.odd_padding != (0, 0, 0, 0): self.re_new_handle = True def forward(self, x, W, b=None): # other code # ...... if training: if self.handle.bias_term: self.inputs = (x, W, b) # record x, W, b else: self.inputs = (x, W) # other code # ...... if (type(self.handle) != singa.ConvHandle): return singa.GpuConvForward(x, W, b, self.handle) else: return singa.CpuConvForward(x, W, b, self.handle) def backward(self, dy): if (type(self.handle) != singa.ConvHandle): dx = singa.GpuConvBackwardx(dy, self.inputs[1], self.inputs[0], self.handle) dW = singa.GpuConvBackwardW(dy, self.inputs[0], self.inputs[1], self.handle) db = singa.GpuConvBackwardb( dy, self.inputs[2], self.handle) if self.handle.bias_term else None else: dx = singa.CpuConvBackwardx(dy, self.inputs[1], self.inputs[0], self.handle) dW = singa.CpuConvBackwardW(dy, self.inputs[0], self.inputs[1], self.handle) db = singa.CpuConvBackwardb( dy, self.inputs[2], self.handle) if self.handle.bias_term else None if self.odd_padding != (0, 0, 0, 0): dx = utils.handle_odd_pad_bwd(dx, self.odd_padding) if db: return dx, dW, db else: return dx, dW
Tensor GpuConvBackwardx(const Tensor &dy, const Tensor &W, const Tensor &x, const CudnnConvHandle &cch) { CHECK_EQ(dy.device()->lang(), kCuda); Tensor dx; dx.ResetLike(x); dy.device()->Exec( /* * dx is a local variable so it's captured by value * dy is an intermediate tensor and isn't recorded on the python side * W is an intermediate tensor but it's recorded on the python side * chh is a variable and it's recorded on the python side */ [dx, dy, &W, &cch](Context *ctx) mutable { Block *wblock = W.block(), *dyblock = dy.block(), *dxblock = dx.block(); float alpha = 1.f, beta = 0.f; cudnnConvolutionBackwardData( ctx->cudnn_handle, &alpha, cch.filter_desc, wblock->data(), cch.y_desc, dyblock->data(), cch.conv_desc, cch.bp_data_alg, cch.workspace.block()->mutable_data(), cch.workspace_count * sizeof(float), &beta, cch.x_desc, dxblock->mutable_data()); }, {dy.block(), W.block()}, {dx.block(), cch.workspace.block()}); /* the lambda expression reads the blocks of tensor dy and w * and writes the blocks of tensor dx and chh.workspace */ return dx; }