reference-guide-caffe2dml.md - systemds - Git at Google

 ---
 layout: global
 title: Beginner's Guide for Caffe2DML users
 description: Beginner's Guide for Caffe2DML users
 ---
 <!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to you under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 {% endcomment %}
 -->

 * This will become a table of contents (this text will be scraped).
 {:toc}

 <br/>


 # Layers supported in Caffe2DML

 Caffe2DML to be as compatible with [the Caffe specification](http://caffe.berkeleyvision.org/tutorial/layers.html) as possible.
 The main differences are given below along with the usage guide that mirrors the Caffe specification.

 ## Vision Layers

 ### Convolution Layer

 Invokes [nn/layers/conv2d_builtin.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/conv2d_builtin.dml)
 or [nn/layers/conv2d_depthwise.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/conv2d_depthwise.dml) layer.

 **Required Parameters:**

 - num_output: the number of filters
 - kernel_size (or kernel_h and kernel_w): specifies height and width of each filter

 **Optional Parameters:**

 - bias_term (default true): specifies whether to learn and apply a set of additive biases to the filter outputs
 - pad (or pad_h and pad_w) (default 0): specifies the number of pixels to (implicitly) add to each side of the input
 - stride (or stride_h and stride_w) (default 1): specifies the intervals at which to apply the filters to the input
 - group (g) (default 1): If g > 1, we restrict the connectivity of each filter to a subset of the input.
 Specifically, the input and output channels are separated into g groups,
 and the ith output group channels will be only connected to the ith input group channels.
 Note: we only support depthwise convolution, hence `g` should be divisible by number of channels

 **Parameters that are ignored:**

 - weight_filler: We use the heuristic by He et al., which limits the magnification of inputs/gradients
 during forward/backward passes by scaling unit-Gaussian weights by a factor of sqrt(2/n),
 under the assumption of relu neurons.
 - bias_filler: We use `constant bias_filler` with `value:0`

 **Sample Usage:**
 ```
 layer {
     name: "conv1"
     type: "Convolution"
     bottom: "data"
     top: "conv1"
     # learning rate and decay multipliers for the filters
     param { lr_mult: 1 decay_mult: 1 }
     # learning rate and decay multipliers for the biases
     param { lr_mult: 2 decay_mult: 0 }
     convolution_param {
       num_output: 96     # learn 96 filters
       kernel_size: 11    # each filter is 11x11
       stride: 4          # step 4 pixels between each filter application
       weight_filler {
         type: "xavier" # initialize the filters from a Gaussian
       }
       bias_filler {
         type: "constant" # initialize the biases to zero (0)
         value: 0
       }
     }
   }
  ```

 ### Pooling Layer

 Invokes [nn/layers/max_pool2d_builtin.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/max_pool2d_builtin.dml) layer.

 **Required Parameters:**

 - kernel_size (or kernel_h and kernel_w): specifies height and width of each filter

 **Optional Parameters:**
 - pool (default MAX): the pooling method. Currently, we only support MAX and AVE, not STOCHASTIC.
 - pad (or pad_h and pad_w) (default 0): specifies the number of pixels to (implicitly) add to each side of the input
 - stride (or stride_h and stride_w) (default 1): specifies the intervals at which to apply the filters to the input

 **Sample Usage:**
 ```
 layer {
   name: "pool1"
   type: "Pooling"
   bottom: "conv1"
   top: "pool1"
   pooling_param {
     pool: MAX
     kernel_size: 3 # pool over a 3x3 region
     stride: 2      # step two pixels (in the bottom blob) between pooling regions
   }
 }
 ```


 ### Upsampling Layer

 Invokes [nn/layers/upsample2d.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/upsample2d.dml) layer.

 **Required Parameters:**

 - size_h and size_w: specifies the upsampling factor for rows and columns.

 **Sample Usage:**
 ```
 layer {
   name: "upsample1"
   type: "Upsample"
   bottom: "pool1"
   top: "upsample1"
   upsample_param  {
     size_h = 2
     size_w = 2
   }
 }
 ```

 ### Padding Layer

 Invokes [nn/layers/zero_pad2d.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/zero_pad2d.dml) layer.

 **Optional Parameters:**

 - top_pad: Padding for top side (default: 0).
 - bottom_pad: Padding for bottom side (default: 0).
 - left_pad: Padding for left side (default: 0).
 - right_pad: Padding for right side (default: 0).
 - right_pad: Padding for right side (default: 0).
 - pad_value: value to use for padding (default: 0). Only zero padding supported for now.

 **Sample Usage:**
 ```
 layer {
   name: "padding1"
   type: "Padding"
   bottom: "pool1"
   top: "padding1"
   padding_param  {
     top_pad = 1
     bottom_pad = 1
     left_pad = 1
     right_pad = 1
     pad_value = 0
   }
 }
 ```


 ### Deconvolution Layer

 Invokes [nn/layers/conv2d_transpose.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/conv2d_transpose.dml)
 or [nn/layers/conv2d_transpose_depthwise.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/conv2d_transpose_depthwise.dml) layer.

 **Required Parameters:**

 - num_output: the number of filters
 - kernel_size (or kernel_h and kernel_w): specifies height and width of each filter

 **Optional Parameters:**

 - bias_term (default true): specifies whether to learn and apply a set of additive biases to the filter outputs
 - pad (or pad_h and pad_w) (default 0): specifies the number of pixels to (implicitly) add to each side of the input
 - stride (or stride_h and stride_w) (default 1): specifies the intervals at which to apply the filters to the input
 - group (g) (default 1): If g > 1, we restrict the connectivity of each filter to a subset of the input.
 Specifically, the input and output channels are separated into g groups,
 and the ith output group channels will be only connected to the ith input group channels.
 Note: we only support depthwise convolution, hence `g` should be divisible by number of channels

 **Parameters that are ignored:**

 - weight_filler: We use the heuristic by He et al., which limits the magnification of inputs/gradients
 during forward/backward passes by scaling unit-Gaussian weights by a factor of sqrt(2/n),
 under the assumption of relu neurons.
 - bias_filler: We use `constant bias_filler` with `value:0`

 **Sample Usage:**
 ```
 layer {
   name: "upconv_d5c_u4a"
   type: "Deconvolution"
   bottom: "u5d"
   top: "u4a"
   param {
     lr_mult: 0.0
     decay_mult: 0.0
   }
   convolution_param {
     num_output: 190
     bias_term: false
     pad: 1
     kernel_size: 4
     group: 190
     stride: 2
     weight_filler {
       type: "bilinear"
     }
   }
 }
 ```

 ## Recurrent Layers

 ### RNN Layer

 In a simple RNN, the output of the previous timestep is fed back in as an additional input at the current timestep.

 Invokes [nn/layers/rnn.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/rnn.dml) layer.

 **Required Parameters:**

 - num_output: number of output
 - return_sequences: Whether to return output at all timesteps, or just for the final timestep.

 **Sample Usage:**
 ```
 layer {
         top: "rnn_1"
         recurrent_param {
                 return_sequences: false
                 num_output: 32
         }
         type: "RNN"
         name: "rnn_1"
         bottom: "rnn_1_input"
 }
 ```

 ### LSTM Layer

 In an LSTM, an internal cell state is maintained, additive
 interactions operate over the cell state at each timestep, and
 some amount of this cell state is exposed as output at each
 timestep.  Additionally, the output of the previous timestep is fed
 back in as an additional input at the current timestep.

 Invokes [nn/layers/lstm.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/lstm.dml) layer.

 **Required Parameters:**

 - num_output: number of output
 - return_sequences: Whether to return output at all timesteps, or just for the final timestep.

 **Sample Usage:**
 ```
 layer {
         top: "lstm_1"
         recurrent_param {
                 return_sequences: false
                 num_output: 32
         }
         type: "LSTM"
         name: "lstm_1"
         bottom: "lstm_1_input"
 }
 ```

 ## Common Layers

 ### Inner Product / Fully Connected Layer

 Invokes [nn/layers/affine.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/affine.dml) layer.

 **Required Parameters:**

 - num_output: the number of filters

 **Parameters that are ignored:**
 - weight_filler (default type: 'constant' value: 0): We use the heuristic by He et al., which limits the magnification
 of inputs/gradients during forward/backward passes by scaling unit-Gaussian weights by a factor of sqrt(2/n), under the
 assumption of relu neurons.
 - bias_filler (default type: 'constant' value: 0): We use the default type and value.
 - bias_term (default true): specifies whether to learn and apply a set of additive biases to the filter outputs. We use `bias_term=true`.

 **Sample Usage:**
 ```
 layer {
   name: "fc8"
   type: "InnerProduct"
   # learning rate and decay multipliers for the weights
   param { lr_mult: 1 decay_mult: 1 }
   # learning rate and decay multipliers for the biases
   param { lr_mult: 2 decay_mult: 0 }
   inner_product_param {
     num_output: 1000
     weight_filler {
       type: "xavier"
     }
     bias_filler {
       type: "constant"
       value: 0
     }
   }
   bottom: "fc7"
   top: "fc8"
 }
 ```

 ### Dropout Layer

 Invokes [nn/layers/dropout.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/dropout.dml) layer.

 **Optional Parameters:**

 - dropout_ratio(default = 0.5): dropout ratio

 **Sample Usage:**
 ```
 layer {
   name: "drop1"
   type: "Dropout"
   bottom: "relu3"
   top: "drop1"
   dropout_param {
     dropout_ratio: 0.5
   }
 }
 ```

 ## Normalization Layers

 ### BatchNorm Layer

 This is used in combination with Scale layer.

 Invokes [nn/layers/batch_norm2d.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/batch_norm2d.dml) layer.

 **Optional Parameters:**
 - moving_average_fraction (default = .999): Momentum value for moving averages. Typical values are in the range of [0.9, 0.999].
 - eps (default = 1e-5): Smoothing term to avoid divide by zero errors. Typical values are in the range of [1e-5, 1e-3].

 **Parameters that are ignored:**
 - use_global_stats: If false, normalization is performed over the current mini-batch
 and global statistics are accumulated (but not yet used) by a moving average.
 If true, those accumulated mean and variance values are used for the normalization.
 By default, it is set to false when the network is in the training phase and true when the network is in the testing phase.

 **Sample Usage:**
 ```
 layer {
 	bottom: "conv1"
 	top: "conv1"
 	name: "bn_conv1"
 	type: "BatchNorm"
 	batch_norm_param {
 		use_global_stats: true
 	}
 }
 layer {
 	bottom: "conv1"
 	top: "conv1"
 	name: "scale_conv1"
 	type: "Scale"
 	scale_param {
 		bias_term: true
 	}
 }
 ```

 ## Activation / Neuron Layers

 In general, activation / Neuron layers are element-wise operators, taking one bottom blob and producing one top blob of the same size.
 In the layers below, we will ignore the input and out sizes as they are identical.

 ### ReLU / Rectified-Linear Layer

 Invokes [nn/layers/relu.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/relu.dml) layer.

 **Parameters that are ignored:**
 - negative_slope (default 0): specifies whether to leak the negative part by multiplying it with the slope value rather than setting it to 0.

 **Sample Usage:**
 ```
 layer {
   name: "relu1"
   type: "ReLU"
   bottom: "conv1"
   top: "conv1"
 }
 ```

 ### TanH Layer

 Invokes [nn/layers/tanh.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/tanh.dml) layer.

 **Sample Usage:**
 ```
 layer {
   name: "tanh1"
   type: "TanH"
   bottom: "conv1"
   top: "conv1"
 }
 ```

 ### Sigmoid Layer

 Invokes [nn/layers/sigmoid.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/sigmoid.dml) layer.

 **Sample Usage:**
 ```
 layer {
   name: "sigmoid1"
   type: "Sigmoid"
   bottom: "conv1"
   top: "conv1"
 }
 ```


 ### Threshold Layer

 Computes `X > threshold`

 **Parameters that are ignored:**
 - threshold (default: 0):Strictly positive values

 **Sample Usage:**
 ```
 layer {
   name: "threshold1"
   type: "Threshold"
   bottom: "conv1"
   top: "conv1"
 }
 ```

 ## Utility Layers

 ### Flatten Layer

 The Flatten layer is a utility layer that flattens an input of shape n * c * h * w to a simple vector output of shape n * (c*h*w).


 **Sample Usage:**
 ```
 layer {
         name: "flatten_1"
         type: "Flatten"
         bottom: "max_pooling2d_2"
         top: "flatten_1"
 }
 ```

 ### Eltwise Layer

 Element-wise operations such as product or sum between two blobs.

 **Parameters that are ignored:**
 - operation(default: SUM): element-wise operation. only SUM supported for now.
 - table_prod_grad(default: true): Whether to use an asymptotically slower (for >2 inputs) but stabler method
 of computing the gradient for the PROD operation. (No effect for SUM op.)

 **Sample Usage:**
 ```
 layer {
 	bottom: "res2a_branch1"
 	bottom: "res2a_branch2c"
 	top: "res2a"
 	name: "res2a"
 	type: "Eltwise"
 }
 ```

 ### Concat Layer

 **Inputs:**
 - `n_i * c_i * h * w` for each input blob i from 1 to K.

 **Outputs:**
 - out: Outputs, of shape
   - if axis = 0: `(n_1 + n_2 + ... + n_K) * c_1 * h * w`, and all input `c_i` should be the same.
   - if axis = 1: `n_1 * (c_1 + c_2 + ... + c_K) * h * w`, and all input `n_i` should be the same.

 **Optional Parameters:**
 - axis (default: 1): The axis along which to concatenate.

 **Sample Usage:**
 ```
 layer {
   name: "concat_d5cc_u5a-b"
   type: "Concat"
   bottom: "u5a"
   bottom: "d5c"
   top: "u5b"
 }
 ```

 ### Softmax Layer

 Invokes [nn/layers/softmax.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/softmax.dml) layer.

 Computes the forward pass for a softmax classifier.  The inputs
 are interpreted as unnormalized, log-probabilities for each of
 N examples, and the softmax function transforms them to normalized
 probabilities.

 This can be interpreted as a generalization of the sigmoid
 function to multiple classes.

 `probs_ij = e^scores_ij / sum(e^scores_i)`

 **Parameters that are ignored:**
 - axis (default: 1): The axis along which to perform the softmax.

 **Sample Usage:**
 ```
 layer {
   name: "sm"
   type: "Softmax"
   bottom: "score"
   top: "sm"
 }
 ```

 ## Loss Layers

 Loss drives learning by comparing an output to a target and assigning cost to minimize.
 The loss itself is computed by the forward pass and the gradient w.r.t. to the loss is computed by the backward pass.

 ### Softmax with Loss Layer

 The softmax loss layer computes the multinomial logistic loss of the softmax of its inputs.
 It’s conceptually identical to a softmax layer followed by a multinomial logistic loss layer, but provides a more numerically stable gradient.

 Invokes [nn/layers/softmax.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/softmax.dml)
 and [nn/layers/cross_entropy_loss.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/cross_entropy_loss.dml)
 for classification problems.

 For image segmentation problems, invokes [nn/layers/softmax2d_loss.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/softmax2d_loss.dml) layer.

 **Sample Usage:**
 ```
 layer {
   name: "loss"
   type: "SoftmaxWithLoss"
   bottom: "ip2"
   bottom: "label"
   top: "loss"
 }
 ```

 ### Euclidean layer

 The Euclidean loss layer computes the sum of squares of differences of its two inputs.

 Invokes [nn/layers/l2_loss.dml](https://github.com/apache/systemml/blob/master/scripts/nn/layers/l2_loss.dml) layer.

 **Sample Usage:**
 ```
 layer {
   name: "loss"
   type: "EuclideanLoss"
   bottom: "ip2"
   bottom: "label"
   top: "loss"
 }
 ```


 # Frequently asked questions

 #### What is the purpose of Caffe2DML API ?

 Most deep learning experts are more likely to be familiar with the Caffe's specification
 rather than DML language. For these users, the Caffe2DML API reduces the learning curve to using SystemDS.
 Instead of requiring the users to write a DML script for training, fine-tuning and testing the model,
 Caffe2DML takes as an input a network and solver specified in the Caffe specification
 and automatically generates the corresponding DML.

 #### With Caffe2DML, does SystemDS now require Caffe to be installed ?

 Absolutely not. We only support Caffe's API for convenience of the user as stated above.
 Since the Caffe's API is specified in the protobuf format, we are able to generate the java parser files
 and donot require Caffe to be installed. This is also true for Tensorboard feature of Caffe2DML.

 ```
 Dml.g4      ---> antlr  ---> DmlLexer.java, DmlListener.java, DmlParser.java ---> parse foo.dml
 caffe.proto ---> protoc ---> target/generated-sources/caffe/Caffe.java       ---> parse caffe_network.proto, caffe_solver.proto
 ```

 Again, the SystemDS engine doesnot invoke (or depend on) Caffe for any of its runtime operators.
 Since the grammar files for the respective APIs (i.e. `caffe.proto`) are used by SystemDS,
 we include their licenses in our jar files.

 #### How can I speedup the training with Caffe2DML ?

 - Enable native BLAS to improve the performance of CP convolution and matrix multiplication operators.
 If you are using OpenBLAS, please ensure that it was built with `USE_OPENMP` flag turned on.
 For more detail see http://apache.github.io/systemml/native-backend

 ```python
 caffe2dmlObject.setConfigProperty("sysml.native.blas", "auto")
 ```

 - Turn on the experimental codegen feature. This should help reduce unnecessary allocation cost after every binary operation.

 ```python
 caffe2dmlObject.setConfigProperty("sysml.codegen.enabled", "true").setConfigProperty("sysml.codegen.plancache", "true")
 ```

 - Tuned the [Garbage Collector](http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning).

 - Enable GPU support (described below).

 #### How to enable GPU support in Caffe2DML ?

 To be consistent with other mllearn algorithms, we recommend that you use following method instead of setting
 the `solver_mode` in solver file.

 ```python
 # The below method tells SystemDS optimizer to use a GPU-enabled instruction if the operands fit in the GPU memory
 caffe2dmlObject.setGPU(True)
 # The below method tells SystemDS optimizer to always use a GPU-enabled instruction irrespective of the memory requirement
 caffe2dmlObject.setForceGPU(True)
 ```

 #### What is lr_policy in the solver specification ?

 The parameter `lr_policy` specifies the learning rate decay policy. Caffe2DML supports following policies:
 - `fixed`: always return `base_lr`.
 - `step`: return `base_lr * gamma ^ (floor(iter / step))`
 - `exp`: return `base_lr * gamma ^ iter`
 - `inv`: return `base_lr * (1 + gamma * iter) ^ (- power)`
 - `poly`: the effective learning rate follows a polynomial decay, to be zero by the max_iter. return `base_lr (1 - iter/max_iter) ^ (power)`
 - `sigmoid`: the effective learning rate follows a sigmod decay return b`ase_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))`


 The parameters `base_lr` and  `lr_policy` are required and other parameters are optional:
 ```
 lr_policy: "step" # learning rate policy: drop the learning rate in "steps"
                   # by a factor of gamma every stepsize iterations (required)
 base_lr: 0.01     # begin training at a learning rate of 0.01 (required)
 gamma: 0.95       # drop the learning rate by the given factor (optional, default value: 0.95)
 stepsize: 100000  # drop the learning rate every 100K iterations (optional, default value: 100000)
 power: 0.75       # (optional, default value: 0.75)
 ```

 #### How do I regularize weight matrices in the neural network ?

 The user can specify the type of regularization using the parameter `regularization_type` in the solver file.
 The valid values are `L2` (default) and `L1`.
 Caffe2DML then invokes the backward function of the layers `nn/layers/l2_reg.dml` and `nn/layers/l1_reg.dml` respectively.
 The regularation strength is set using the property `weight_decay` in the solver file:
 ```
 regularization_type: "L2"
 weight_decay: 5e-4
 ```

 Like learning rate, you can customize the regularation strength of a given layer by specifying the property `decay_mult` in the network file:
 ```
 param { lr_mult: 1 decay_mult: 1 }
 ```

 #### How to set batch size ?

 Batch size is set in `data_param` of the Data layer:

 ```
 layer {
   name: "mnist"
   type: "Data"
   top: "data"
   top: "label"
   data_param {
     source: "mnist_train"
     batch_size: 64
     backend: LMDB
   }
 }
 ```

 #### How to set maximum number of iterations for training ?

 The maximum number of iterations can be set in the solver specification

 ```bash
 # The maximum number of iterations
 max_iter: 2000
 ```

 #### How to set the size of the validation dataset ?

 The size of the validation dataset is determined by the parameters `test_iter` and the batch size. For example: If the batch size is 64 and
 `test_iter` is 10, then the validation size is 640. This setting generates following DML code internally:

 ```python
 num_images = nrow(y_full)
 BATCH_SIZE = 64
 num_validation = 10 * BATCH_SIZE
 X = X_full[(num_validation+1):num_images,]; y = y_full[(num_validation+1):num_images,]
 X_val = X_full[1:num_validation,]; y_val = y_full[1:num_validation,]
 num_images = nrow(y)
 ```

 #### How to monitor loss via command-line ?

 To monitor loss, please set following parameters in the solver specification

 ```
 # Display training loss and accuracy every 100 iterations
 display: 100
 # Carry out validation every 500 training iterations and display validation loss and accuracy.
 test_iter: 10
 test_interval: 500
 ```

 #### How to pass a single jpeg image to Caffe2DML for prediction ?

 To convert a jpeg into NumPy matrix, you can use the [pillow package](https://pillow.readthedocs.io/) and
 SystemDS's  `convertImageToNumPyArr` utility function. The below pyspark code demonstrates the usage:

 ```python
 from PIL import Image
 import systemml as sml
 from systemml.mllearn import Caffe2DML
 img_shape = (3, 224, 224)
 input_image = sml.convertImageToNumPyArr(Image.open(img_file_path), img_shape=img_shape)
 resnet = Caffe2DML(sqlCtx, solver='ResNet_50_solver.proto', weights='ResNet_50_pretrained_weights', input_shape=img_shape)
 resnet.predict(input_image)
 ```

 #### How to prepare a directory of jpeg images for training with Caffe2DML ?

 The below pyspark code assumes that the input dataset has 2 labels `cat` and `dogs` and the filename has these labels as prefix.
 We iterate through the directory and convert each jpeg image into pyspark.ml.linalg.Vector using pyspark.
 These vectors are stored as DataFrame and randomized using Spark SQL's `orderBy(rand())` function.
 The DataFrame is then saved in parquet format to reduce the cost of preprocessing for repeated training.

 ```python
 from systemml.mllearn import Caffe2DML
 from pyspark.sql import SQLContext
 import numpy as np
 import urllib, os, scipy.ndimage
 from pyspark.ml.linalg import Vectors
 from pyspark import StorageLevel
 import systemml as sml
 from pyspark.sql.functions import rand
 # ImageNet specific parameters
 img_shape = (3, 224, 224)
 train_dir = '/home/biuser/dogs_vs_cats/train'
 def getLabelFeatures(filename):
 	from PIL import Image
 	vec = Vectors.dense(sml.convertImageToNumPyArr(Image.open(os.path.join(train_dir, filename)), img_shape=img_shape)[0,:])
 	if filename.lower().startswith('cat'):
 		return (1, vec)
 	elif filename.lower().startswith('dog'):
 		return (2, vec)
 	else:
 		raise ValueError('Expected the filename to start with either cat or dog')
 list_jpeg_files = os.listdir(train_dir)
 # 10 files per partition
 train_df = sc.parallelize(list_jpeg_files, int(len(list_jpeg_files)/10)).map(lambda filename : getLabelFeatures(filename)).toDF(['label', 'features']).orderBy(rand())
 # Optional: but helps seperates conversion-related from training
 # Alternatively, this dataframe can be passed directly to `caffe2dml_model.fit(train_df)`
 train_df.write.parquet('kaggle-cats-dogs.parquet')
 ```

 An alternative way to load images into a PySpark DataFrame for prediction, is to use MLLib's LabeledPoint class:

 ```python
 list_jpeg_files = os.listdir(train_dir)
 train_df = sc.parallelize(list_jpeg_files, int(len(list_jpeg_files)/10)).map(lambda filename : LabeledPoint(0, sml.convertImageToNumPyArr(Image.open(os.path.join(train_dir, filename)), img_shape=img_shape)[0,:])).toDF().select('features')
 # Note: convertVectorColumnsToML has an additional serialization cost
 train_df = MLUtils.convertVectorColumnsToML(train_df)
 ```


 #### Can I use Caffe2DML via Scala ?

 Though we recommend using Caffe2DML via its Python interfaces, it is possible to use it by creating an object of the class
 `org.apache.sysml.api.dl.Caffe2DML`. It is important to note that Caffe2DML's scala API is packaged in `systemml-*-extra.jar`.

 #### How can I get summary information of my network ?


 ```python
 lenet.summary()
 ```

 Output:

 ```
 +-----+---------------+--------------+------------+---------+-----------+---------+
 | Name|           Type|        Output|      Weight|     Bias|        Top|   Bottom|
 +-----+---------------+--------------+------------+---------+-----------+---------+
 |mnist|           Data| (, 1, 28, 28)|            |         |mnist,mnist|         |
 |conv1|    Convolution|(, 32, 28, 28)|   [32 X 25]| [32 X 1]|      conv1|    mnist|
 |relu1|           ReLU|(, 32, 28, 28)|            |         |      relu1|    conv1|
 |pool1|        Pooling|(, 32, 14, 14)|            |         |      pool1|    relu1|
 |conv2|    Convolution|(, 64, 14, 14)|  [64 X 800]| [64 X 1]|      conv2|    pool1|
 |relu2|           ReLU|(, 64, 14, 14)|            |         |      relu2|    conv2|
 |pool2|        Pooling|  (, 64, 7, 7)|            |         |      pool2|    relu2|
 |  ip1|   InnerProduct| (, 512, 1, 1)|[3136 X 512]|[1 X 512]|        ip1|    pool2|
 |relu3|           ReLU| (, 512, 1, 1)|            |         |      relu3|      ip1|
 |drop1|        Dropout| (, 512, 1, 1)|            |         |      drop1|    relu3|
 |  ip2|   InnerProduct|  (, 10, 1, 1)|  [512 X 10]| [1 X 10]|        ip2|    drop1|
 | loss|SoftmaxWithLoss|  (, 10, 1, 1)|            |         |       loss|ip2,mnist|
 +-----+---------------+--------------+------------+---------+-----------+---------+
 ```

 #### How can I view the script generated by Caffe2DML ?

 To view the generated DML script (and additional debugging information), please set the `debug` parameter to True.

 ```python
 lenet.set(debug=True)
 ```

 Output:
 ```
 001|debug = TRUE
 002|source("nn/layers/softmax.dml") as softmax
 003|source("nn/layers/cross_entropy_loss.dml") as cross_entropy_loss
 004|source("nn/layers/conv2d_builtin.dml") as conv2d_builtin
 005|source("nn/layers/relu.dml") as relu
 006|source("nn/layers/max_pool2d_builtin.dml") as max_pool2d_builtin
 007|source("nn/layers/affine.dml") as affine
 008|source("nn/layers/dropout.dml") as dropout
 009|source("nn/optim/sgd_momentum.dml") as sgd_momentum
 010|source("nn/layers/l2_reg.dml") as l2_reg
 011|X_full_path = ifdef($X, " ")
 012|X_full = read(X_full_path)
 013|y_full_path = ifdef($y, " ")
 014|y_full = read(y_full_path)
 015|num_images = nrow(y_full)
 016|# Convert to one-hot encoding (Assumption: 1-based labels)
 017|y_full = table(seq(1,num_images,1), y_full, num_images, 10)
 018|weights = ifdef($weights, " ")
 019|# Initialize the layers and solvers
 020|X_full = X_full * 0.00390625
 021|BATCH_SIZE = 64
 022|[conv1_weight,conv1_bias] = conv2d_builtin::init(32,1,5,5)
 023|[conv2_weight,conv2_bias] = conv2d_builtin::init(64,32,5,5)
 024|[ip1_weight,ip1_bias] = affine::init(3136,512)
 025|[ip2_weight,ip2_bias] = affine::init(512,10)
 026|conv1_weight_v = sgd_momentum::init(conv1_weight)
 027|conv1_bias_v = sgd_momentum::init(conv1_bias)
 028|conv2_weight_v = sgd_momentum::init(conv2_weight)
 029|conv2_bias_v = sgd_momentum::init(conv2_bias)
 030|ip1_weight_v = sgd_momentum::init(ip1_weight)
 031|ip1_bias_v = sgd_momentum::init(ip1_bias)
 032|ip2_weight_v = sgd_momentum::init(ip2_weight)
 033|ip2_bias_v = sgd_momentum::init(ip2_bias)
 034|num_validation = 10 * BATCH_SIZE
 035|# Sanity check to ensure that validation set is not too large
 036|if(num_validation > ceil(0.3 * num_images)) {
 037|    max_test_iter = floor(ceil(0.3 * num_images) / BATCH_SIZE)
 038|    stop("Too large validation size. Please reduce test_iter to " + max_test_iter)
 039|}
 040|X = X_full[(num_validation+1):num_images,]; y = y_full[(num_validation+1):num_images,]; X_val = X_full[1:num_validation,]; y_val = y_full[1:num_validation,]; num_images = nrow(y)
 041|num_iters_per_epoch = ceil(num_images / BATCH_SIZE)
 042|max_epochs = ceil(2000 / num_iters_per_epoch)
 043|iter = 0
 044|lr = 0.01
 045|for(e in 1:max_epochs) {
 046|    for(i in 1:num_iters_per_epoch) {
 047|            beg = ((i-1) * BATCH_SIZE) %% num_images + 1; end = min(beg + BATCH_SIZE - 1, num_images); Xb = X[beg:end,]; yb = y[beg:end,];
 048|            iter = iter + 1
 049|            # Perform forward pass
 050|            [out3,ignoreHout_3,ignoreWout_3] = conv2d_builtin::forward(Xb,conv1_weight,conv1_bias,1,28,28,5,5,1,1,2,2)
 051|            out4 = relu::forward(out3)
 052|            [out5,ignoreHout_5,ignoreWout_5] = max_pool2d_builtin::forward(out4,32,28,28,2,2,2,2,0,0)
 053|            [out6,ignoreHout_6,ignoreWout_6] = conv2d_builtin::forward(out5,conv2_weight,conv2_bias,32,14,14,5,5,1,1,2,2)
 054|            out7 = relu::forward(out6)
 055|            [out8,ignoreHout_8,ignoreWout_8] = max_pool2d_builtin::forward(out7,64,14,14,2,2,2,2,0,0)
 056|            out9 = affine::forward(out8,ip1_weight,ip1_bias)
 057|            out10 = relu::forward(out9)
 058|            [out11,mask11] = dropout::forward(out10,0.5,-1)
 059|            out12 = affine::forward(out11,ip2_weight,ip2_bias)
 060|            out13 = softmax::forward(out12)
 061|            # Perform backward pass
 062|            dProbs = cross_entropy_loss::backward(out13,yb); dOut13 = softmax::backward(dProbs,out12); dOut13_12 = dOut13; dOut13_2 = dOut13;
 063|            [dOut12,ip2_dWeight,ip2_dBias] = affine::backward(dOut13_12,out11,ip2_weight,ip2_bias); dOut12_11 = dOut12;
 064|            dOut11 = dropout::backward(dOut12_11,out10,0.5,mask11); dOut11_10 = dOut11;
 065|            dOut10 = relu::backward(dOut11_10,out9); dOut10_9 = dOut10;
 066|            [dOut9,ip1_dWeight,ip1_dBias] = affine::backward(dOut10_9,out8,ip1_weight,ip1_bias); dOut9_8 = dOut9;
 067|            dOut8 = max_pool2d_builtin::backward(dOut9_8,7,7,out7,64,14,14,2,2,2,2,0,0); dOut8_7 = dOut8;
 068|            dOut7 = relu::backward(dOut8_7,out6); dOut7_6 = dOut7;
 069|            [dOut6,conv2_dWeight,conv2_dBias] = conv2d_builtin::backward(dOut7_6,14,14,out5,conv2_weight,conv2_bias,32,14,14,5,5,1,1,2,2); dOut6_5 = dOut6;
 070|            dOut5 = max_pool2d_builtin::backward(dOut6_5,14,14,out4,32,28,28,2,2,2,2,0,0); dOut5_4 = dOut5;
 071|            dOut4 = relu::backward(dOut5_4,out3); dOut4_3 = dOut4;
 072|            [dOut3,conv1_dWeight,conv1_dBias] = conv2d_builtin::backward(dOut4_3,28,28,Xb,conv1_weight,conv1_bias,1,28,28,5,5,1,1,2,2); dOut3_2 = dOut3;
 073|            # Update the parameters
 074|            conv1_dWeight_reg = l2_reg::backward(conv1_weight, 5.000000237487257E-4)
 075|            conv1_dWeight = conv1_dWeight + conv1_dWeight_reg
 076|            [conv1_weight,conv1_weight_v] = sgd_momentum::update(conv1_weight,conv1_dWeight,(lr * 1.0),0.8999999761581421,conv1_weight_v)
 077|            [conv1_bias,conv1_bias_v] = sgd_momentum::update(conv1_bias,conv1_dBias,(lr * 2.0),0.8999999761581421,conv1_bias_v)
 078|            conv2_dWeight_reg = l2_reg::backward(conv2_weight, 5.000000237487257E-4)
 079|            conv2_dWeight = conv2_dWeight + conv2_dWeight_reg
 080|            [conv2_weight,conv2_weight_v] = sgd_momentum::update(conv2_weight,conv2_dWeight,(lr * 1.0),0.8999999761581421,conv2_weight_v)
 081|            [conv2_bias,conv2_bias_v] = sgd_momentum::update(conv2_bias,conv2_dBias,(lr * 2.0),0.8999999761581421,conv2_bias_v)
 082|            ip1_dWeight_reg = l2_reg::backward(ip1_weight, 5.000000237487257E-4)
 083|            ip1_dWeight = ip1_dWeight + ip1_dWeight_reg
 084|            [ip1_weight,ip1_weight_v] = sgd_momentum::update(ip1_weight,ip1_dWeight,(lr * 1.0),0.8999999761581421,ip1_weight_v)
 085|            [ip1_bias,ip1_bias_v] = sgd_momentum::update(ip1_bias,ip1_dBias,(lr * 2.0),0.8999999761581421,ip1_bias_v)
 086|            ip2_dWeight_reg = l2_reg::backward(ip2_weight, 5.000000237487257E-4)
 087|            ip2_dWeight = ip2_dWeight + ip2_dWeight_reg
 088|            [ip2_weight,ip2_weight_v] = sgd_momentum::update(ip2_weight,ip2_dWeight,(lr * 1.0),0.8999999761581421,ip2_weight_v)
 089|            [ip2_bias,ip2_bias_v] = sgd_momentum::update(ip2_bias,ip2_dBias,(lr * 2.0),0.8999999761581421,ip2_bias_v)
 090|            # Compute training loss & accuracy
 091|            if(iter  %% 100 == 0) {
 092|                    loss = 0
 093|                    accuracy = 0
 094|                    tmp_loss = cross_entropy_loss::forward(out13,yb)
 095|                    loss = loss + tmp_loss
 096|                    true_yb = rowIndexMax(yb)
 097|                    predicted_yb = rowIndexMax(out13)
 098|                    accuracy = mean(predicted_yb == true_yb)*100
 099|                    training_loss = loss
 100|                    training_accuracy = accuracy
 101|                    print("Iter:" + iter + ", training loss:" + training_loss + ", training accuracy:" + training_accuracy)
 102|                    if(debug) {
 103|                            num_rows_error_measures = min(10, ncol(yb))
 104|                            error_measures = matrix(0, rows=num_rows_error_measures, cols=5)
 105|                            for(class_i in 1:num_rows_error_measures) {
 106|                                    tp = sum( (true_yb == predicted_yb) * (true_yb == class_i) )
 107|                                    tp_plus_fp = sum( (predicted_yb == class_i) )
 108|                                    tp_plus_fn = sum( (true_yb == class_i) )
 109|                                    precision = tp / tp_plus_fp
 110|                                    recall = tp / tp_plus_fn
 111|                                    f1Score = 2*precision*recall / (precision+recall)
 112|                                    error_measures[class_i,1] = class_i
 113|                                    error_measures[class_i,2] = precision
 114|                                    error_measures[class_i,3] = recall
 115|                                    error_measures[class_i,4] = f1Score
 116|                                    error_measures[class_i,5] = tp_plus_fn
 117|                            }
 118|                            print("class    \tprecision\trecall  \tf1-score\tnum_true_labels\n" + toString(error_measures, decimal=7, sep="\t"))
 119|                    }
 120|            }
 121|            # Compute validation loss & accuracy
 122|            if(iter  %% 500 == 0) {
 123|                    loss = 0
 124|                    accuracy = 0
 125|                    validation_loss = 0
 126|                    validation_accuracy = 0
 127|                    for(iVal in 1:num_iters_per_epoch) {
 128|                            beg = ((iVal-1) * BATCH_SIZE) %% num_validation + 1; end = min(beg + BATCH_SIZE - 1, num_validation); Xb = X_val[beg:end,]; yb = y_val[beg:end,];
 129|                            # Perform forward pass
 130|                            [out3,ignoreHout_3,ignoreWout_3] = conv2d_builtin::forward(Xb,conv1_weight,conv1_bias,1,28,28,5,5,1,1,2,2)
 131|                            out4 = relu::forward(out3)
 132|                            [out5,ignoreHout_5,ignoreWout_5] = max_pool2d_builtin::forward(out4,32,28,28,2,2,2,2,0,0)
 133|                            [out6,ignoreHout_6,ignoreWout_6] = conv2d_builtin::forward(out5,conv2_weight,conv2_bias,32,14,14,5,5,1,1,2,2)
 134|                            out7 = relu::forward(out6)
 135|                            [out8,ignoreHout_8,ignoreWout_8] = max_pool2d_builtin::forward(out7,64,14,14,2,2,2,2,0,0)
 136|                            out9 = affine::forward(out8,ip1_weight,ip1_bias)
 137|                            out10 = relu::forward(out9)
 138|                            [out11,mask11] = dropout::forward(out10,0.5,-1)
 139|                            out12 = affine::forward(out11,ip2_weight,ip2_bias)
 140|                            out13 = softmax::forward(out12)
 141|                            tmp_loss = cross_entropy_loss::forward(out13,yb)
 142|                            loss = loss + tmp_loss
 143|                            true_yb = rowIndexMax(yb)
 144|                            predicted_yb = rowIndexMax(out13)
 145|                            accuracy = mean(predicted_yb == true_yb)*100
 146|                            validation_loss = validation_loss + loss
 147|                            validation_accuracy = validation_accuracy + accuracy
 148|                    }
 149|                    validation_accuracy = validation_accuracy / num_iters_per_epoch
 150|                    print("Iter:" + iter + ", validation loss:" + validation_loss + ", validation accuracy:" + validation_accuracy)
 151|            }
 152|    }
 153|    # Learning rate
 154|    lr = (0.009999999776482582 * 0.949999988079071^e)
 155|}

 Iter:100, training loss:0.24014199350958168, training accuracy:87.5
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       3.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 3.0000000       0.8888889       0.8888889       0.8888889       9.0000000
 4.0000000       0.7500000       0.7500000       0.7500000       4.0000000
 5.0000000       0.7500000       1.0000000       0.8571429       3.0000000
 6.0000000       0.8333333       1.0000000       0.9090909       5.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 8.0000000       0.8571429       0.7500000       0.8000000       8.0000000
 9.0000000       1.0000000       0.5714286       0.7272727       7.0000000
 10.0000000      0.7272727       0.8888889       0.8000000       9.0000000

 Iter:200, training loss:0.09555593867171894, training accuracy:98.4375
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       10.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       3.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       9.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 6.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 7.0000000       1.0000000       0.6666667       0.8000000       3.0000000
 8.0000000       1.0000000       1.0000000       1.0000000       9.0000000
 9.0000000       0.8571429       1.0000000       0.9230769       6.0000000
 10.0000000      1.0000000       1.0000000       1.0000000       3.0000000

 Iter:300, training loss:0.058686794512570216, training accuracy:98.4375
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       9.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 6.0000000       1.0000000       0.8750000       0.9333333       8.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 8.0000000       1.0000000       1.0000000       1.0000000       2.0000000
 9.0000000       0.8888889       1.0000000       0.9411765       8.0000000
 10.0000000      1.0000000       1.0000000       1.0000000       8.0000000

 Iter:400, training loss:0.08742103541529415, training accuracy:96.875
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 2.0000000       0.8000000       1.0000000       0.8888889       8.0000000
 3.0000000       1.0000000       0.8333333       0.9090909       6.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 6.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 8.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 9.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 10.0000000      1.0000000       0.9230769       0.9600000       13.0000000

 Iter:500, training loss:0.05873836245880005, training accuracy:98.4375
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       3.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       9.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 6.0000000       1.0000000       0.8571429       0.9230769       7.0000000
 7.0000000       0.8571429       1.0000000       0.9230769       6.0000000
 8.0000000       1.0000000       1.0000000       1.0000000       9.0000000
 9.0000000       1.0000000       1.0000000       1.0000000       10.0000000
 10.0000000      1.0000000       1.0000000       1.0000000       5.0000000

 Iter:500, validation loss:260.1580978627665, validation accuracy:96.43954918032787
 Iter:600, training loss:0.07584116043829209, training accuracy:98.4375
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 6.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 8.0000000       1.0000000       0.9230769       0.9600000       13.0000000
 9.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 10.0000000      0.8333333       1.0000000       0.9090909       5.0000000

 Iter:700, training loss:0.07973166944626336, training accuracy:98.4375
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 6.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       10.0000000
 8.0000000       0.8000000       1.0000000       0.8888889       4.0000000
 9.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 10.0000000      1.0000000       0.9166667       0.9565217       12.0000000

 Iter:800, training loss:0.0063778595034221855, training accuracy:100.0
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       9.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 6.0000000       1.0000000       1.0000000       1.0000000       9.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 8.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 9.0000000       1.0000000       1.0000000       1.0000000       2.0000000
 10.0000000      1.0000000       1.0000000       1.0000000       6.0000000

 Iter:900, training loss:0.019673112167879484, training accuracy:100.0
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       3.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       3.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 6.0000000       1.0000000       1.0000000       1.0000000       10.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 8.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 9.0000000       1.0000000       1.0000000       1.0000000       12.0000000
 10.0000000      1.0000000       1.0000000       1.0000000       7.0000000

 Iter:1000, training loss:0.06137978002508307, training accuracy:96.875
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       8.0000000
 4.0000000       0.8333333       0.8333333       0.8333333       6.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       5.0000000
 6.0000000       1.0000000       1.0000000       1.0000000       10.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       3.0000000
 8.0000000       0.8888889       0.8888889       0.8888889       9.0000000
 9.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 10.0000000      1.0000000       1.0000000       1.0000000       4.0000000

 Iter:1000, validation loss:238.62301345198944, validation accuracy:97.02868852459017
 Iter:1100, training loss:0.023325103696013115, training accuracy:100.0
 class           precision       recall          f1-score        num_true_labels
 1.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 2.0000000       1.0000000       1.0000000       1.0000000       10.0000000
 3.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 4.0000000       1.0000000       1.0000000       1.0000000       4.0000000
 5.0000000       1.0000000       1.0000000       1.0000000       2.0000000
 6.0000000       1.0000000       1.0000000       1.0000000       10.0000000
 7.0000000       1.0000000       1.0000000       1.0000000       7.0000000
 8.0000000       1.0000000       1.0000000       1.0000000       6.0000000
 9.0000000       1.0000000       1.0000000       1.0000000       9.0000000
 10.0000000      1.0000000       1.0000000       1.0000000       6.0000000
 ...
 ```

 #### Design document of Caffe2DML

 Caffe2DML is designed to fit well into the mllearn framework. Hence, the key methods that were to be implemented are:
 - `getTrainingScript` for the `Estimator` class.
 - `getPredictionScript` for the `Model` class.

 These methods should be the starting point of any developer to understand the DML generated for training and prediction respectively.

 To simplify the DML generation in `getTrainingScript` and `getPredictionScript method`, we use DMLGenerator interface.
 This interface generates DML string for common operations such as loops (such as if, for, while) as well as built-in functions (read, write), etc.
 Also, this interface helps in "code reading" of the Caffe2DML class.

 Here is an analogy for SystemDS developers to think of various moving components of Caffe2DML:
 - Like `Dml.g4` in the `org.apache.sysml.parser.dml` package, `caffe.proto` in the `src/main/proto/caffe` directory
 is used to generate classes to parse the input files.

 ```
 Dml.g4      ---> antlr  ---> DmlLexer.java, DmlListener.java, DmlParser.java
 caffe.proto ---> protoc ---> target/generated-sources/caffe/Caffe.java
 ```

 - Just like the classes generated by Dml.g4 are used to parse input DML file,
 the `target/generated-sources/caffe/Caffe.java` class is used to parse the input caffe network/deploy prototxt and solver files.

 - You can think of `.caffemodel` file as DML file with matrix values encoded in it (please see below example).
 So it is possible to read `.caffemodel` file with the `Caffe.java` class. This is done in Utils.scala's `readCaffeNet` method.

 ```
 X = matrix("1.2 3.5 0.999 7.123", rows=2, cols=2)
 ...
 ```

 - Just like we convert the AST generated by antlr into our DMLProgram representation, we convert
 caffe's abstraction into the below given mapping classes for layer, solver and learning rate.
 These mapping classes maps the corresponding Caffe abstraction to the SystemDS-NN library.
 This greatly simplifies adding new layers into Caffe2DML:
 ```
 trait CaffeLayer {
   // Any layer that wants to reuse SystemDS-NN has to override following methods that help in generating the DML for the given layer:
   def sourceFileName:String;
   def init(dmlScript:StringBuilder):Unit;
   def forward(dmlScript:StringBuilder, isPrediction:Boolean):Unit;
   def backward(dmlScript:StringBuilder, outSuffix:String):Unit;
   ...
 }
 trait CaffeSolver {
   def sourceFileName:String;
   def update(dmlScript:StringBuilder, layer:CaffeLayer):Unit;
   def init(dmlScript:StringBuilder, layer:CaffeLayer):Unit;
 }
 ```

 To simplify the traversal of the network, we created a Network interface:
 ```
 trait Network {
   def getLayers(): List[String]
   def getCaffeLayer(layerName:String):CaffeLayer
   def getBottomLayers(layerName:String): Set[String]
   def getTopLayers(layerName:String): Set[String]
   def getLayerID(layerName:String): Int
 }
 ```

 One of the key design restriction of Caffe2DML is that every layer is identified uniquely by its name.
 This restriction simplifies the code significantly.
 To shield from network files that violates this restriction, Caffe2DML performs rewrites in CaffeNetwork class (search for condition 1-5 in Caffe2DML class).

 Like Caffe, Caffe2DML also expects the layers to be in sorted order.