versions/1.0.0/_sources/architecture/overview.txt - mxnet-site - Git at Google

 # MXNet System Architecture

 ![System Overview](https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/mxnet/system/overview.png)

 This figure shows the major modules and components of the MXNet system and their interaction. The modules are:

 - Runtime Dependency Engine: Schedules and executes the
   operations according to their read/write dependency.
 - Storage Allocator: Efficiently allocates and recycles memory blocks
   on host (CPU) and devices (GPUs).
 - Resource Manager: Manages global resources, such as the random number generator
   and temporal space.
 - NDArray: Dynamic, asynchronous n-dimensional arrays,
   which provide flexible imperative programs for MXNet.
 - Symbolic Execution: Static symbolic graph executor,
   which provides efficient symbolic graph execution and optimization.
 - Operator: Operators that define static forward and gradient
   calculation (backprop).
 - SimpleOp: Operators that extend NDArray operators and symbolic operators
   in a unified fashion.
 - Symbol Construction: Symbolic construction, which provides a way to construct
   a computation graph (net configuration).
 - KVStore: Key-value store interface for efficient parameter synchronization.
 - Data Loading(IO): Efficient distributed data loading and augmentation.

 # MXNet System Components

 ## Execution Engine

 You can use MXNet's engine not only for deep learning,
 but for any domain-specific problem.
 It's designed to solve a general problem:
 execute a bunch of functions following their dependencies.
 Execution of any two functions with dependencies should be serialized.
 To boost performance, functions with no dependencies *can* be executed in parallel.
 For a general discussion of this topic,
 see our [notes on the dependency engine](note_engine.md).

 ### Interface

 The following API is the core interface for the execution engine:

 ```c++
     virtual void PushSync(Fn exec_fun, Context exec_ctx,
                           std::vector<VarHandle> const& const_vars,
                           std::vector<VarHandle> const& mutate_vars) = 0;
 ```
 This API allows you to push a function (`exec_fun`),
 along with its context information and dependencies, to the engine.
 `exec_ctx` is the context information in which the `exec_fun` should be executed,
 `const_vars` denotes the variables that the function reads from,
 and `mutate_vars` are the variables to be modified.
 The engine provides the following guarantee:

 >*The execution of any two functions
 that modify a common variable
 is serialized in their push order.*

 ### Function

 The function type of the engine is:

 ```c++
     using Fn = std::function<void(RunContext)>;
 ```
 `RunContext` contains runtime information, which is determined by the engine:

 ```c++
     struct RunContext {
         // stream pointer which could be safely cast to
         // cudaStream_t* type
 	    void *stream;
     };
 ```
 Alternatively, you could use `mxnet::engine::DAGEngine::Fn`, which has the same type definition.

 All of the functions are executed by the engine's internal threads.
 In such a model, it's usually not a good idea to push *blocking* functions
 to the engine (usually for dealing with I/O tasks like disk, web service, UI, etc.)
 because it will occupy the execution thread and reduce total throughput.
 In that case, we provide another *asynchronous* function type:

 ```c++
     using Callback = std::function<void()>;
     using AsyncFn = std::function<void(RunContext, Callback)>;
 ```
 In the `AsyncFn` function, you can pass the heavy part to your own threads
 and safely exit the body of the function.
 The engine doesn't consider the function finished
 until the `Callback` function is called.

 ### Context

 You can specify the `Context` of the function to be executed within.
 This usually includes whether the function should be run on a CPU or a GPU,
 and if you specify a GPU, which GPU to use.
 `Context` is different from `RunContext`.
 `Context` contains device type (GPU/CPU) and device id,
  while `RunContext` contains information that can be decided only during runtime,
  for example, on which stream the function should be executed.

 ### VarHandle

 `VarHandle` is used to specify the dependencies of functions.
 The MXNet engine is designed to be decoupled from other MXNet modules.
 So `VarHandle` is like an engine-provided token you use
 to represent the external resources the functions can use or modify.
 It's designed to be lightweight, so creating,
 deleting, or copying a variable incurs little overhead.
 Upon pushing the functions, you need to specify the variables
 that will be used (immutable) in the `const_vars` vector,
 and the variables that will be modified (mutable) in the `mutate_vars` vector.
 The engine uses one rule for resolving the dependencies among functions:

 >*The execution of any two functions when one of them modifies at least one common variable is serialized in their push order.*

 For example, if `Fn1` and `Fn2` both mutate `V2` then `Fn2`
 is guaranteed to be executed after `Fn1`
 if `Fn2` is pushed after `Fn1`.
 On the other hand, if `Fn1` and `Fn2` both use `V2`,
 their actual execution order could be random.

 This design allows the engine to schedule *state-mutating* operations in a manner
 that minimizes calls to allocate new memory.
 For example, the weight update function in DNN
 can now use the `+=` operator
 to update the weights in place,
 rather than generating a new weight array each time.

 To create a variable, use the `NewVar()` API.
 To delete a variable, use the `PushDelete` API.

 ### Push and Wait

 *All `Push` APIs are asynchronous.* The API call returns immediately
 regardless of whether the pushed `Fn` is finished or not.
 This allows the engine to start computing at the same time
 as the user thread is pushing functions.
 `Push` APIs are not thread-safe.
 To be specific, only one thread should make engine API calls at a time.

 If you want to wait for a specific `Fn` to finish,
 include a callback function in the closure,
 and call the function at the end of your `Fn`.

 If you want to wait for all `Fn`s
 that involve (use or mutate) a certain variable to finish,
 use the `WaitForVar(var)` API.

 If you want to wait for all pushed `Fn`s to finish,
 use the `WaitForAll()` API.

 ### Save Object Creation Cost

 In some cases, you need to push several functions to the engine for a long period of time.
 If the computation of these functions is light,
 the overhead of copying lambdas and creating use/mutate variable lists becomes relatively high.
 We provide an API to create an `OprHandle` beforehand:

 ```c++
     virtual OprHandle NewOperator(AsyncFn fn,
                                   std::vector<VarHandle> const& const_vars,
                                   std::vector<VarHandle> const& mutate_vars) = 0;
 ```
 You can keep pushing the `OprHandle` without repeatedly creating them:

 ```c++
     virtual void Push(OprHandle op, Context exec_ctx) = 0;
 ```
 To delete it, call the `DeleteOperator(OprHandle op)` API.
 Ensure that the operator has finished computing before calling this API.


 ### API Reference

 ```eval_rst
     .. doxygenclass:: mxnet::Engine
        :members:
 ```

 ## Operators in MXNet

 In MXNet, an operator is a class that contains both actual computation logic
 and auxiliary information that can aid the system in performing optimizations,
 like in-place updates and auto-derivatives.
 To understand the remainder of the document,
 we recommend that you familiarize yourself with the `mshadow` library,
 because all operators compute on the tensor-like structure `mshadow::TBlob`
 provided by the system during runtime.

 MXNet's operator interface allows you to:

 * Reduce memory allocation cost by specifying in-place updates.
 * Hide some internal arguments from Python to make it cleaner.
 * Define the relationships among input tensors and output tensors,
 which allows the system to perform shape checking for you.
 * Acquire additional temporary spaces from the system
 to perform computation (e.g., calling `cudnn` routines).

 ### Operator Interface

 `Forward` is the core operator interface:

 ```c++
     virtual void Forward(const OpContext &ctx,
                          const std::vector<TBlob> &in_data,
                          const std::vector<OpReqType> &req,
                          const std::vector<TBlob> &out_data,
                          const std::vector<TBlob> &aux_states) = 0;
 ```
 The `OpContext` structure is:

 ```c++
            struct OpContext {
              int is_train;
              RunContext run_ctx;
              std::vector<Resource> requested;
            }
 ```
 It describes whether the operator is in the train or test phase,
 which device the operator should be run on (in `run_ctx`),
 and requested resources (covered in the following sections).

 - `in_data` and `out_data` represent the input and output tensors, respectively.
 All of the tensor spaces have been allocated by the system.
 - `req` denotes how the computation results are written into the `out_data`.
 In other words, `req.size() == out_data.size()` and `req[i]`
 correspond to the write type of `out_data[i]`.

 - The `OpReqType` is defined as:

 ```c++
            enum OpReqType {
              kNullOp,
              kWriteTo,
              kWriteInplace,
              kAddTo
            };
 ```
   Normally, the types of all `out_data` should be `kWriteTo`,
   meaning that the provided `out_data` tensor is a *raw* memory block,
   so the operator should write results directly into it.
   In some cases, for example when calculating the `gradient` tensor,
   it would be great if we could accumulate the result,
   rather than directly overwrite the tensor contents
   so that  no extra space needs to be created each time.
   In such a case, the corresponding `req` type is set as `kAddTo`,
   indicating that a `+=` should be called.

 - `aux_states` is intentionally designed for auxiliary tensors used to help computation. Currently, it is useless.

 Aside from the `Forward` operator, you could optionally implement the `Backward` interface:

 ```c++
     virtual void Backward(const OpContext &ctx,
                           const std::vector<TBlob> &out_grad,
                           const std::vector<TBlob> &in_data,
                           const std::vector<TBlob> &out_data,
                           const std::vector<OpReqType> &req,
                           const std::vector<TBlob> &in_grad,
                           const std::vector<TBlob> &aux_states);
 ```
 This interface follows the same design principle as the `Forward` interface,
 except that `out_grad`, `in_data`, and `out_data` are given,
 and the operator computes `in_grad` as the results.
  The naming strategy is similar to Torch's convention,
  and can be summarized in following figure:

 [input/output semantics figure]

 Some operators might not require all of the following:
 `out_grad`, `in_data` and `out_data`.
 You can specify these dependencies with the `DeclareBackwardDependency` interface in `OperatorProperty`.

 ### Operator Property

 One convolution might have several implementations,
 and you might want to switch among them to achieve the best performance.
 Therefore, we separate the operator *semantic* interfaces
 from the implementation interface (`Operator` class)
 into the `OperatorProperty` class.
 The `OperatorProperty` interface consists of:

 * **InferShape:**

 ```c++
            virtual bool InferShape(std::vector<TShape> *in_shape,
                                    std::vector<TShape> *out_shape,
                                    std::vector<TShape> *aux_shape) const = 0;
 ```

 This interface has two purposes:
 * Tell the system the size of each input and output tensor,
   so it can allocate space for them before the `Forward` and `Backward` call.
 * Perform a size check to make sure that there isn't an obvious error before running.
   The shape in `in_shape` is set by the system
   (from the `out_shape` of the previous operators).
   It returns `false` when there is not enough information
   to infer shapes or throws an error when the shape is inconsistent.

 * **Request Resources:** Operations like `cudnnConvolutionForward` need a work space for computation.
 If the system can manage that, it could then perform optimizations,
 like reuse the space, and so on.
 MXNet defines two interfaces to achieve this:

 ```c++
            virtual std::vector<ResourceRequest> ForwardResource(
                const std::vector<TShape> &in_shape) const;
            virtual std::vector<ResourceRequest> BackwardResource(
                const std::vector<TShape> &in_shape) const;
 ```
   The `ResourceRequest` structure (in `resource.h`) currently contains only a type flag:

 ```c++
            struct ResourceRequest {
              enum Type {
                kRandom,  // get a mshadow::Random<xpu> object
                kTempSpace,  // request temporary space
              };
              Type type;
            };
 ```
   If `ForwardResource` and `BackwardResource` return non-empty arrays,
   the system offers the corresponding resources through the `ctx` parameter
   in the `Forward` and `Backward` interface of `Operator`.
   Basically, to access those resources, simply write:

 ```c++
            auto tmp_space_res = ctx.requested[kTempSpace].get_space(some_shape, some_stream);
            auto rand_res = ctx.requested[kRandom].get_random(some_stream);
 ```
   For an example, see `src/operator/cudnn_convolution-inl.h`.

 * **Backward dependency:** Let's look at two different operator signatures
 (we name all of the arguments for demonstration purposes):

 ```c++
            void FullyConnectedForward(TBlob weight, TBlob in_data, TBlob out_data);
            void FullyConnectedBackward(TBlob weight, TBlob in_data, TBlob out_grad, TBlob in_grad);

            void PoolingForward(TBlob in_data, TBlob out_data);
            void PoolingBackward(TBlob in_data, TBlob out_data, TBlob out_grad, TBlob in_grad);
 ```
   Note that `out_data` in `FullyConnectedForward`
   is not used by `FullyConnectedBackward`,
   while `PoolingBackward` requires all of the arguments of `PoolingForward`.
   Therefore, for `FullyConnectedForward`,
   the `out_data` tensor once consumed could be safely freed
   because the backward function will not need it.
   This provides a chance for the system to collect some tensors
   as garbage as soon as possible.
   To specify this situation, we provide an interface:

 ```c++
           virtual std::vector<int> DeclareBackwardDependency(
                const std::vector<int> &out_grad,
                const std::vector<int> &in_data,
                const std::vector<int> &out_data) const;
 ```
   The `int` element of the argument vector is an ID
   to distinguish different arrays.
   Let's see how this interface specifies different dependencies
   for `FullyConnected` and `Pooling`:

  ```c++
            std::vector<int> FullyConnectedProperty::DeclareBackwardDependency(
                const std::vector<int> &out_grad,
                const std::vector<int> &in_data,
                const std::vector<int> &out_data) const {
              return {out_grad[0], in_data[0]};  // NOTE: out_data[0] is NOT included
            }
            std::vector<int> PoolingProperty::DeclareBackwardDependency(
                const std::vector<int> &out_grad,
                const std::vector<int> &in_data,
                const std::vector<int> &out_data) const {
              return {out_grad[0], in_data[0], out_data[0]};
            }
 ```

 * **In place Option:** To further save the cost of memory allocation,
 you can use in-place updates.
 They are appropriate for element-wise operations
 when the input tensor and output tensor have the same shape.
 You specify and in-place update with the following interface:

 ```c++
            virtual std::vector<std::pair<int, void*>>    ElewiseOpProperty::ForwardInplaceOption(
                const std::vector<int> &in_data,
                const std::vector<void*> &out_data) const {
              return { {in_data[0], out_data[0]} };
            }
            virtual std::vector<std::pair<int, void*>> ElewiseOpProperty::BackwardInplaceOption(
                const std::vector<int> &out_grad,
                const std::vector<int> &in_data,
                const std::vector<int> &out_data,
                const std::vector<void*> &in_grad) const {
              return { {out_grad[0], in_grad[0]} }
            }
 ```
   This tells the system that the `in_data[0]` and `out_data[0]` tensors could share the same memory spaces during `Forward`, and so do `out_grad[0]` and `in_grad[0]` during `Backward`.

   >**Important:** Even if you use the preceding specification, it's *not* guaranteed that the input and output tensors will share the same space. In fact, this is only a suggestion for the system, which makes the final decision. However, in either case, the decision is completely transparent to you, so the actual `Forward` and `Backward` implementation does not need to consider that.

 * **Expose Operator to Python:** Because of the restrictions of C++, you need user to implement following interfaces:

 ```c++
            // initial the property class from a list of key-value string pairs
            virtual void Init(const vector<pair<string, string>> &kwargs) = 0;
            // return the parameters in a key-value string map
            virtual map<string, string> GetParams() const = 0;
            // return the name of arguments (for generating signature in python)
            virtual vector<string> ListArguments() const;
            // return the name of output values
            virtual vector<string> ListOutputs() const;
            // return the name of auxiliary states
            virtual vector<string> ListAuxiliaryStates() const;
            // return the number of output values
            virtual int NumOutputs() const;
            // return the number of visible outputs
            virtual int NumVisibleOutputs() const;
 ```

 ### Create an Operator from the Operator Property

  `OperatorProperty` includes all *semantic* attributes of an operation. It's also responsible for creating the `Operator` pointer for actual computation.

 #### Create Operator
 Implement the following interface in `OperatorProperty`:

 ```c++
     virtual Operator* CreateOperator(Context ctx) const = 0;
 ```
 For example:

 ```c++
     class ConvolutionOp {
      public:
       void Forward( ... ) { ... }
       void Backward( ... ) { ... }
     };
     class ConvolutionOpProperty : public OperatorProperty {
      public:
       Operator* CreateOperator(Context ctx) const {
         return new ConvolutionOp;
       }
     };
 ```

 #### Parametrize Operator
 When implementing a convolution operator, you need to know the kernel size,
 the stride size, padding size, and so on.
 These parameters should be passed to the operator
 before any `Forward` or `Backward` interface is called.
 To do so, you could define a `ConvolutionParam` structure, as follows:

 ```c++
     #include <dmlc/parameter.h>
     struct ConvolutionParam : public dmlc::Parameter<ConvolutionParam> {
       TShape kernel, stride, pad;
       uint32_t num_filter, num_group, workspace;
       bool no_bias;
     };
 ```
 Put it in `ConvolutionOpProperty`, and pass it to the operator class during construction:

 ```c++
     class ConvolutionOp {
      public:
       ConvolutionOp(ConvolutionParam p): param_(p) {}
       void Forward( ... ) { ... }
       void Backward( ... ) { ... }
      private:
       ConvolutionParam param_;
     };
     class ConvolutionOpProperty : public OperatorProperty {
      public:
       void Init(const vector<pair<string, string>& kwargs) {
         // initialize param_ using kwargs
       }
       Operator* CreateOperator(Context ctx) const {
         return new ConvolutionOp(param_);
       }
      private:
       ConvolutionParam param_;
     };
 ```

 #### Register the Operator Property Class and the Parameter Class to MXNet
 Use the following macros to register the parameter structure and the operator property class to MXNet:

 ```c++
     DMLC_REGISTER_PARAMETER(ConvolutionParam);
     MXNET_REGISTER_OP_PROPERTY(Convolution, ConvolutionOpProperty);
 ```
 The first argument is the name string, the second is the property class name.

 ### Interface Summary

 We've almost covered the entire interface required to define a new operator. Let's do a recap:

 * Use the `Operator` interface to write your computation logic (`Forward` and `Backward`).
 * Use the `OperatorProperty` interface to:
   - Pass the parameter to the operator class (you can use the `Init` interface).
   - Create an operator using the `CreateOperator` interface.
   - Correctly implement the operator description interface, such as the names of arguments, etc.
   - Correctly implement the `InferShape` interface to set the output tensor shape.
   - [Optional] If additional resources are needed, check `ForwardResource` and `BackwardResource`.
   - [Optional] If `Backward` doesn't need all of the input and output of `Forward`, check `DeclareBackwardDependency`.
   - [Optional] If in-place update is supported, check `ForwardInplaceOption` and `BackwardInplaceOption`.
 * Register the `OperatorProperty` class and the parameter class.

 ## Unifying the NDArray Operator and Symbolic Operator
 NDArray operations are similar to symbolic operations,
 except that sometimes you can't write in place to the operands
 without a complete dependency graph.
 However, the logic underlying NDArray and symbolic operations are almost identical.
 *SimpleOp*, a new unified operator API,
 unifies different invoking processes
 and returns to the fundamental elements of operators.
 Because most mathematical operators attend to one or two operands,
 and more operands make dependency-related optimization useful,
 the unified operator is specifically designed for unary and binary operations.

 Consider the elements of an operation.
 Ideally, you need only functions and derivatives
 to describe an operation.
 Let's restrict that to the space of unary and binary operations.
 How do we classify all operations to maximize the possibility
 of in-place write optimization?
 Note that you can separate functions by the number of operands.
 Derivatives are a bit more complex.
 To construct a dependency graph, you need to know whether output value,
 input data, or neither are needed alongside head gradient.
 Gradient functions in the unified API are differentiated
 by the types of operands it takes for calculation.

 Before you learn more about the SimpleOp interface,
  we recommend that you review the
  [mshadow library guide](https://github.com/dmlc/mshadow/tree/master/guide)
  because  calculations will be done in the `mshadow::TBlob` structure.

 In the following example, we'll create an operator
 functioning as a smooth l1 loss,
 which is a mixture of l1 loss and l2 loss. The loss itself can be written as:

 ```
     loss = outside_weight .* f(inside_weight .* (data - label))
     grad = outside_weight .* inside_weight .* f'(inside_weight .* (data - label))
 ```
  `.*` stands for element-wise multiplication, and `f`, `f'` is the smooth l1 loss function,
 which we are assuming is in `mshadow` for now.
 At first glance, it's impossible to implement
 this particular loss as a unary or binary operator.
 But we have automatic differentiation in symbolic execution.
 That simplifies the loss to `f` and `f'` directly.
 This loss is no more complex than a `sin` or an `abs` function,
 and can certainly be implemented as a unary operator.

 ## SimpleOp: The Unified Operator API
 ### Define Shapes
 The `mshadow` library requires explicit memory allocation.
 As a consequence, all data shapes
 must be provided before any calculation occurs.
  Before we proceed with defining functions and gradient,
 let's check input data shape consistency and provide output shape.

 ```cpp
     typedef TShape (*UnaryShapeFunction)(const TShape& src,
                                          const EnvArguments& env);
     typedef TShape (*BinaryShapeFunction)(const TShape&                                         const TShape& rhs,lhs,

                                           const EnvArguments& env);
 ```
 You can use `mshadow::TShape` to check input data shape and designate output data shape.
 If you don't define this function, the default output shape is the same as the input shape.
 In the case of a binary operator, the shape of `lhs` and `rhs` is checked as the same by default.

 You can also use shape functions to check if any additional arguments and resources are present.
 Refer to the additional usages of `EnvArguments` to accomplish this.

 Before we start on our smooth l1 loss example, we define a `XPU` to `cpu` or `gpu` in the header
 `smooth_l1_unary-inl.h` implementation so that we reuse the same code in `smooth_l1_unary.cc` and
 `smooth_l1_unary.cu`.

 ```cpp
     #include <mxnet/operator_util.h>
     #if defined(__CUDACC__)
     #define XPU gpu
     #else
     #define XPU cpu
     #endif
 ```
 In our smooth l1 loss example, it's okay to use the default behavior whereby the output has the same shape as the source.
 Written explicitly, it is:

 ```cpp
     inline TShape SmoothL1Shape_(const TShape& src,
                                  const EnvArguments& env) {
       return TShape(src);
 ```

 ### Define Functions
 Create a unary or binary function with one output: `mshadow::TBlob`.

 ```cpp
     typedef void (*UnaryFunction)(const TBlob& src,
                                   const EnvArguments& env,
                                   TBlob* ret,
                                   OpReqType req,
                                   RunContext ctx);
     typedef void (*BinaryFunction)(const TBlob& lhs,
                                    const TBlob& rhs,
                                    const EnvArguments& env,
                                    TBlob* ret,
                                    OpReqType req,
                                    RunContext ctx);
 ```
 * Functions are differentiated by the types of input arguments.
 * `RunContext ctx` contains information needed during runtime for execution.

 ```cpp
         struct RunContext {
           void *stream;  // the stream of the device, can be NULL or Stream<gpu>* in GPU mode
           template<typename xpu> inline mshadow::Stream<xpu>* get_stream() // get mshadow stream from Context
         }  // namespace mxnet
 ```
   `mshadow::stream<xpu> *s = ctx.get_stream<xpu>();` is an example of obtaining a stream from `ctx`.
 * `OpReqType req` denotes how computation results are written into `ret`.

 ```cpp
         enum OpReqType {
           kNullOp,  // no operation, do not write anything
           kWriteTo,  // write gradient to provided space
           kWriteInplace,  // perform an in-place write
           kAddTo  // add to the provided space
         };
 ```
   A macro is defined in `operator_util.h` for a simplified use of `OpReqType`.
   `ASSIGN_DISPATCH(out, req, exp)` checks `req` and performs an assignment.

 In our smooth l1 loss example, we use `UnaryFunction` to define the function of this operator.

 ```cpp
     template<typename xpu>
     void SmoothL1Forward_(const TBlob& src,
                           const EnvArguments& env,
                           TBlob *ret,
                           OpReqType req,
                           RunContext ctx) {
       using namespace mshadow;
       using namespace mshadow::expr;
       mshadow::Stream<xpu> *s = ctx.get_stream<xpu>();
       real_t sigma2 = env.scalar * env.scalar;
       MSHADOW_TYPE_SWITCH(ret->type_flag_, DType, {
         mshadow::Tensor<xpu, 2, DType> out = ret->get<xpu, 2, DType>(s);
         mshadow::Tensor<xpu, 2, DType> in = src.get<xpu, 2, DType>(s);
         ASSIGN_DISPATCH(out, req,
                         F<mshadow_op::smooth_l1_loss>(in, ScalarExp<DType>(sigma2)));
       });
     }
 ```
 After obtaining `mshadow::Stream` from `RunContext`, we get `mshadow::Tensor` from `mshadow::TBlob`.
 `mshadow::F` is a shortcut to initiate a `mshadow` expression. The macro `MSHADOW_TYPE_SWITCH(type, DType, ...)`
 handles details on different types, and the macro `ASSIGN_DISPATCH(out, req, exp)` checks `OpReqType` and
 performs actions accordingly. `sigma2` is a special parameter in this loss, which we will cover later.

 ### Define Gradients (Optional)
 Create a gradient function with various types of inputs.

 ```cpp
     // depending only on out_grad
     typedef void (*UnaryGradFunctionT0)(const OutputGrad& out_grad,
                                         const EnvArguments& env,
                                         TBlob* in_grad,
                                         OpReqType req,
                                         RunContext ctx);
     // depending only on out_value
     typedef void (*UnaryGradFunctionT1)(const OutputGrad& out_grad,
                                         const OutputValue& out_value,
                                         const EnvArguments& env,
                                         TBlob* in_grad,
                                         OpReqType req,
                                          RunContext ctx);
     // depending only on in_data
     typedef void (*UnaryGradFunctionT2)(const OutputGrad& out_grad,
                                         const Input0& in_data0,
                                         const EnvArguments& env,
                                         TBlob* in_grad,
                                         OpReqType req,
                                         RunContext ctx);
 ```
 Gradient functions of binary operators have similar structures, except that `Input`, `TBlob`, and `OpReqType`
 are doubled.

 `GradFunctionArgument`

   `Input0`, `Input`, `OutputValue`, and `OutputGrad` all share the structure of `GradFunctionArgument`,
   which is defined as:

   ```cpp
       struct GradFunctionArgument {
           TBlob data;
       }
   ```

 In our smooth l1 loss example, note that it's an `f'(x)`,
 which utilizes input for the gradient calculation,
 so the `UnaryGradFunctionT2` is suitable.
 To enable the chain rule of the gradient,
 we also need to multiply `out_grad` from the top to the result of `in_grad`.

 ```cpp
     template<typename xpu>
     void SmoothL1BackwardUseIn_(const OutputGrad& out_grad,
                                 const Input0& in_data0,
                                 const EnvArguments& env,
                                 TBlob *in_grad,
                                 OpReqType req,
                                 RunContext ctx) {
       using namespace mshadow;
       using namespace mshadow::expr;
       mshadow::Stream<xpu> *s = ctx.get_stream<xpu>();
       real_t sigma2 = env.scalar * env.scalar;
       MSHADOW_TYPE_SWITCH(in_grad->type_flag_, DType, {
         mshadow::Tensor<xpu, 2, DType> src = in_data0.data.get<xpu, 2, DType>(s);
         mshadow::Tensor<xpu, 2, DType> ograd = out_grad.data.get<xpu, 2, DType>(s);
         mshadow::Tensor<xpu, 2, DType> igrad = in_grad->get<xpu, 2, DType>(s);
          ASSIGN_DISPATCH(igrad, req,
                         ograd * F<mshadow_op::smooth_l1_gradient>(src, ScalarExp<DType>(sigma2)));
       });
     }
 ```

 ### Register SimpleOp to MXNet
 After creating the shape, function, and gradient, restore them into both an NDArray operator and
 a symbolic operator. To simplify this process, use the registration macro defined in `operator_util.h`.

 ```cpp
     MXNET_REGISTER_SIMPLE_OP(Name, DEV)
     .set_shape_function(Shape)
     .set_function(DEV::kDevMask, Function<XPU>, SimpleOpInplaceOption)
     .set_gradient(DEV::kDevMask, Gradient<XPU>, SimpleOpInplaceOption)
     .describe("description");
 ```
 `SimpleOpInplaceOption` is defined as:

 ```cpp
     enum SimpleOpInplaceOption {
       kNoInplace,  // do not allow inplace in arguments
       kInplaceInOut,  // allow inplace in with out (unary)
       kInplaceOutIn,  // allow inplace out_grad with in_grad (unary)
       kInplaceLhsOut,  // allow inplace left operand with out (binary)
       kInplaceOutLhs  // allow inplace out_grad with lhs_grad (binary)
     };
 ```

 In our example, we have a gradient function that relies on input data, so the function can't be written in
 place. The output gradient has no purpose after gradient computation, so the gradient can be written in place.

 ```cpp
     MXNET_REGISTER_SIMPLE_OP(smooth_l1, XPU)
     .set_function(XPU::kDevMask, SmoothL1Forward_<XPU>, kNoInplace)
     .set_gradient(XPU::kDevMask, SmoothL1BackwardUseIn_<XPU>, kInplaceOutIn)
     .set_enable_scalar(true)
     .describe("Calculate Smooth L1 Loss(lhs, scalar)");
 ```
 Remember from the discussion of shape functions that a default behavior without `set_shape_function` forces the inputs
 (if they're binary) to be the same shape and yield the same shape for output. We'll discuss `set_enable_scalar` later.

 ### NDArray Operator Summary
 * Create a shape function for determining the output shape.
 * Create a function as the forward routine by choosing a suitable function type.
 * Create a gradient as the backward routine by choosing a suitable gradient type.
 * Register the operator using the registration process.

 ## Additional Information on SimpleOp
 ### Using SimpleOp on EnvArguments
 Some operations might need a scalar as input, such as a  gradient scale, a set of keyword arguments
 controlling behavior, or a temporary space to speed up calculations.`EnvArguments` provides additional arguments and resources to make calculations more scalable
 and efficient.

 ```cpp
     struct EnvArguments {
       real_t scalar;  // scalar argument, if enabled
       std::vector<std::pair<std::string, std::string> > kwargs;  // keyword arguments
       std::vector<Resource> resource;  // pointer to the resources requested
     };
 ```

 More registration parameters are required to enable these additional features. To prevent confusion on parameters, `scalar` and `kwargs`
 can't be present at the same time. To enable `scalar`, use
 `set_enable_scalar(bool enable_scalar)` in registration. Then, in forward functions and gradients, the `scalar` can be accessed from `env.scalar` as in the function parameter `EnvArguments env`.

 To enable `kwargs`, use `set_enable_kwargs(bool enable_kwargs)` in registration. Then, in forward
 functions and gradients, additional arguments are contained in `env.kwarg`, which is defined as
 `std::vector<std::pair<std::string, std::string> >`. Use the DMLC parameter structure to
 simplify parsing keyword arguments. For more details, see the [guide on parameter structure](https://github.com/dmlc/dmlc-core/blob/master/doc/parameter.md).

 Additional resources like `mshadow::Random<xpu>` and temporary memory space can also be requested and
 accessed from `EnvArguments.resource`. The registration routine is `set_resource_request(ResourceRequest req)`
 or `set_resource_request(const std::vector<ResourceRequest>)`, where `mxnet::ResourceRequest` is defined as:

 ```cpp
     struct ResourceRequest {
       enum Type {  // Resource type, indicating what the pointer type is
         kRandom,  // mshadow::Random<xpu> object
         kTempSpace  // A dynamic temp space that can be arbitrary size
       };
       Type type;  // type of resources
     };
 ```
 Registration will request the declared resource requests from `mxnet::ResourceManager`, and place resources
 in `std::vector<Resource> resource` in `EnvArguments`. To access resources, use the following:

 ```cpp
     auto tmp_space_res = env.resources[0].get_space(some_shape, some_stream);
     auto rand_res = env.resources[0].get_random(some_stream);
 ```
 For an example, see `src/operator/loss_binary_op-inl.h`.

 In our smooth l1 loss example, a scalar input is needed to mark the turning point of a loss function. Therefore,
 in the registration process, we use `set_enable_scalar(true)`, and use `env.scalar` in function and gradient
 declarations.

 ### Crafting a Tensor Operation
 Because computation utilizes the `mshadow` library and we sometimes don't have functions readily available, we
 can craft tensor operations in operator implementations. If you define such functions as element-wise, you
 can implement them as a `mxnet::op::mshadow_op`. `src/operator/mshadow_op.h` that contains a lot of `mshadow_op`,
 for example. `mshadow_op` are expression mappers. They deal with the scalar case of desired functions. For details, see
 [mshadow expression API guide](https://github.com/dmlc/mshadow/tree/master/doc).

 If an operation can't be done in an element-wise way, like the softmax loss and gradient, then you need to create a new tensor operation. You need to create as `mshadow` function and as `mshadow::cuda`
 function directly. For details, see the `mshadow` library. For an example, see `src/operator/roi_pooling.cc`.

 In our smooth l1 loss example, we create two mappers, namely the scalar cases of smooth l1 loss and gradient.

 ```cpp
     namespace mshadow_op {
     struct smooth_l1_loss {
       // a is x, b is sigma2
       MSHADOW_XINLINE static real_t Map(real_t a, real_t b) {
         if (a > 1.0f / b) {
           return a - 0.5f / b;
         } else if (a < -1.0f / b) {
           return -a - 0.5f / b;
         } else {
           return 0.5f * a * a * b;
         }
       }
     };
     }
 ```
 The gradient, which can be found in `src/operator/smooth_l1_unary-inl.h`, is similar.

 ### Beyond Two Operands
 The new unified API is designed to fulfill the fundamentals of an operation. For operators with more than two inputs,
 more than one output, or that need more features, see the original [Operator API](http://mxnet.io/architecture/overview.html#operators-in-mxnet).