docs/python_docs/python/tutorials/getting-started/crash-course/3-autograd.md - mxnet - Git at Google

 <!--- Licensed to the Apache Software Foundation (ASF) under one -->
 <!--- or more contributor license agreements.  See the NOTICE file -->
 <!--- distributed with this work for additional information -->
 <!--- regarding copyright ownership.  The ASF licenses this file -->
 <!--- to you under the Apache License, Version 2.0 (the -->
 <!--- "License"); you may not use this file except in compliance -->
 <!--- with the License.  You may obtain a copy of the License at -->

 <!---   http://www.apache.org/licenses/LICENSE-2.0 -->

 <!--- Unless required by applicable law or agreed to in writing, -->
 <!--- software distributed under the License is distributed on an -->
 <!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
 <!--- KIND, either express or implied.  See the License for the -->
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->

 # Step 3: Automatic differentiation with autograd

 In this step, you learn how to use the MXNet `autograd` package to perform
 gradient calculations.

 ## Basic use

 To get started, import the `autograd` package with the following code.

 ```{.python .input}
 from mxnet import np, npx
 from mxnet import autograd
 npx.set_np()
 ```

 As an example, you could differentiate a function $f(x) = 2 x^2$ with respect to
 parameter $x$. For Autograd, you can start by assigning an initial value of $x$,
 as follows:

 ```{.python .input}
 x = np.array([[1, 2], [3, 4]])
 x
 ```

 After you compute the gradient of $f(x)$ with respect to $x$, you need a place
 to store it. In MXNet, you can tell a ndarray that you plan to store a gradient
 by invoking its `attach_grad` method, as shown in the following example.

 ```{.python .input}
 x.attach_grad()
 ```

 Next, define the function $y=f(x)$. To let MXNet store $y$, so that you can
 compute gradients later, use the following code to put the definition inside an
 `autograd.record()` scope.

 ```{.python .input}
 with autograd.record():
     y = 2 * x * x
 ```

 You can invoke back propagation (backprop) by calling `y.backward()`. When $y$
 has more than one entry, `y.backward()` is equivalent to `y.sum().backward()`.

 ```{.python .input}
 y.backward()
 ```

 Next, verify whether this is the expected output. Note that $y=2x^2$ and
 $\frac{dy}{dx} = 4x$, which should be `[[4, 8],[12, 16]]`. Check the
 automatically computed results.

 ```{.python .input}
 x.grad
 ```

 Now you get to dive into `y.backward()` by first discussing a bit on gradients. As
 alluded to earlier `y.backward()` is equivalent to `y.sum().backward()`.

 ```{.python .input}
 with autograd.record():
     y = np.sum(2 * x * x)
 y.backward()
 x.grad
 ```

 Additionally, you can only run backward once. Unless you use the flag
 `retain_graph` to be `True`.

 ```{.python .input}
 with autograd.record():
     y = np.sum(2 * x * x)
 y.backward(retain_graph=True)
 print(x.grad)
 print("Since you have retained your previous graph you can run backward again")
 y.backward()
 print(x.grad)

 try:
     y.backward()
 except:
     print("However, you can't do backward twice unless you retain the graph.")
 ```

 ## Custom MXNet ndarray operations

 In order to understand the `backward()` method it is beneficial to first
 understand how you can create custom operations. MXNet operators are classes
 with a forward and backward method. Where the number of args in `backward()`
 must equal the number of items returned in the `forward()` method. Additionally,
 the number of arguments in the `forward()` method must match the number of
 output arguments from `backward()`. You can modify the gradients in backward to
 return custom gradients. For instance, below you can return a different gradient then
 the actual derivative.

 ```{.python .input}
 class MyFirstCustomOperation(autograd.Function):
     def __init__(self):
         super().__init__()

     def forward(self,x,y):
         return 2 * x, 2 * x * y, 2 * y

     def backward(self, dx, dxy, dy):
         """
         The input number of arguments must match the number of outputs from forward.
         Furthermore, the number of output arguments must match the number of inputs from forward.
         """
         return x, y
 ```

 Now you can use the first custom operation you have built.

 ```{.python .input}
 x = np.random.uniform(-1, 1, (2, 3))
 y = np.random.uniform(-1, 1, (2, 3))
 x.attach_grad()
 y.attach_grad()
 with autograd.record():
     z = MyFirstCustomOperation()
     z1, z2, z3 = z(x, y)
     out = z1 + z2 + z3
 out.backward()
 print(np.array_equiv(x.asnumpy(), x.asnumpy()))
 print(np.array_equiv(y.asnumpy(), y.asnumpy()))
 ```

 Alternatively, you may want to have a function which is different depending on
 if you are training or not.

 ```{.python .input}
 def my_first_function(x):
     if autograd.is_training(): # Return something else when training
         return(4 * x)
     else:
         return(x)
 ```

 ```{.python .input}
 y = my_first_function(x)
 print(np.array_equiv(y.asnumpy(), x.asnumpy()))
 with autograd.record(train_mode=False):
     y = my_first_function(x)
 y.backward()
 print(x.grad)
 with autograd.record(train_mode=True): # train_mode = True by default
     y = my_first_function(x)
 y.backward()
 print(x.grad)
 ```

 You could create functions with `autograd.record()`.

 ```{.python .input}
 def my_second_function(x):
     with autograd.record():
         return(2 * x)
 ```

 ```{.python .input}
 y = my_second_function(x)
 y.backward()
 print(x.grad)
 ```

 You can also combine multiple functions.

 ```{.python .input}
 y = my_second_function(x)
 with autograd.record():
     z = my_second_function(y) + 2
 z.backward()
 print(x.grad)
 ```

 Additionally, MXNet records the execution trace and computes the gradient
 accordingly. The following function `f` doubles the inputs until its `norm`
 reaches 1000. Then it selects one element depending on the sum of its elements.

 ```{.python .input}
 def f(a):
     b = a * 2
     while np.abs(b).sum() < 1000:
         b = b * 2
     if b.sum() >= 0:
         c = b[0]
     else:
         c = b[1]
     return c
 ```

 In this example, you record the trace and feed in a random value.

 ```{.python .input}
 a = np.random.uniform(size=2)
 a.attach_grad()
 with autograd.record():
     c = f(a)
 c.backward()
 ```

 You can see that `b` is a linear function of `a`, and `c` is chosen from `b`.
 The gradient with respect to `a` be will be either `[c/a[0], 0]` or `[0,
 c/a[1]]`, depending on which element from `b` is picked. You see the results of
 this example with this code:

 ```{.python .input}
 a.grad == c / a
 ```

 As you can notice there are 3 values along the dimension 0, so taking a `mean`
 along this axis is the same as summing that axis and multiplying by `1/3`.

 ## Advanced MXNet ndarray operations with Autograd

 You can control gradients for different ndarray operations. For instance,
 perhaps you want to check that the gradients are propagating properly?
 the `attach_grad()` method automatically detaches itself from the gradient.
 Therefore, the input up until y will no longer look like it has `x`. To
 illustrate this notice that `x.grad` and `y.grad` is not the same in the second
 example.

 ```{.python .input}
 with autograd.record():
     y = 3 * x
     y.attach_grad()
     z = 4 * y + 2 * x
 z.backward()
 print(x.grad)
 print(y.grad)
 ```

 Is not the same as:

 ```{.python .input}
 with autograd.record():
     y = 3 * x
     z = 4 * y + 2 * x
 z.backward()
 print(x.grad)
 print(y.grad)
 ```

 ## Next steps

 Learn how to initialize weights, choose loss function, metrics and optimizers for training your neural network [Step 4: Necessary components
 to train the neural network](./4-components.ipynb).
	<!--- Licensed to the Apache Software Foundation (ASF) under one -->
	<!--- or more contributor license agreements. See the NOTICE file -->
	<!--- distributed with this work for additional information -->
	<!--- regarding copyright ownership. The ASF licenses this file -->
	<!--- to you under the Apache License, Version 2.0 (the -->
	<!--- "License"); you may not use this file except in compliance -->
	<!--- with the License. You may obtain a copy of the License at -->

	<!--- http://www.apache.org/licenses/LICENSE-2.0 -->

	<!--- Unless required by applicable law or agreed to in writing, -->
	<!--- software distributed under the License is distributed on an -->
	<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
	<!--- KIND, either express or implied. See the License for the -->
	<!--- specific language governing permissions and limitations -->
	<!--- under the License. -->

	# Step 3: Automatic differentiation with autograd

	In this step, you learn how to use the MXNet `autograd` package to perform
	gradient calculations.

	## Basic use

	To get started, import the `autograd` package with the following code.

	```{.python .input}
	from mxnet import np, npx
	from mxnet import autograd
	npx.set_np()
	```

	As an example, you could differentiate a function $f(x) = 2 x^2$ with respect to
	parameter $x$. For Autograd, you can start by assigning an initial value of $x$,
	as follows:

	```{.python .input}
	x = np.array([[1, 2], [3, 4]])
	x
	```

	After you compute the gradient of $f(x)$ with respect to $x$, you need a place
	to store it. In MXNet, you can tell a ndarray that you plan to store a gradient
	by invoking its `attach_grad` method, as shown in the following example.

	```{.python .input}
	x.attach_grad()
	```

	Next, define the function $y=f(x)$. To let MXNet store $y$, so that you can
	compute gradients later, use the following code to put the definition inside an
	`autograd.record()` scope.

	```{.python .input}
	with autograd.record():
	y = 2 * x * x
	```

	You can invoke back propagation (backprop) by calling `y.backward()`. When $y$
	has more than one entry, `y.backward()` is equivalent to `y.sum().backward()`.

	```{.python .input}
	y.backward()
	```

	Next, verify whether this is the expected output. Note that $y=2x^2$ and
	$\frac{dy}{dx} = 4x$, which should be `[[4, 8],[12, 16]]`. Check the
	automatically computed results.

	```{.python .input}
	x.grad
	```

	Now you get to dive into `y.backward()` by first discussing a bit on gradients. As
	alluded to earlier `y.backward()` is equivalent to `y.sum().backward()`.

	```{.python .input}
	with autograd.record():
	y = np.sum(2 * x * x)
	y.backward()
	x.grad
	```

	Additionally, you can only run backward once. Unless you use the flag
	`retain_graph` to be `True`.

	```{.python .input}
	with autograd.record():
	y = np.sum(2 * x * x)
	y.backward(retain_graph=True)
	print(x.grad)
	print("Since you have retained your previous graph you can run backward again")
	y.backward()
	print(x.grad)

	try:
	y.backward()
	except:
	print("However, you can't do backward twice unless you retain the graph.")
	```

	## Custom MXNet ndarray operations

	In order to understand the `backward()` method it is beneficial to first
	understand how you can create custom operations. MXNet operators are classes
	with a forward and backward method. Where the number of args in `backward()`
	must equal the number of items returned in the `forward()` method. Additionally,
	the number of arguments in the `forward()` method must match the number of
	output arguments from `backward()`. You can modify the gradients in backward to
	return custom gradients. For instance, below you can return a different gradient then
	the actual derivative.

	```{.python .input}
	class MyFirstCustomOperation(autograd.Function):
	def __init__(self):
	super().__init__()

	def forward(self,x,y):
	return 2 * x, 2 * x * y, 2 * y

	def backward(self, dx, dxy, dy):
	"""
	The input number of arguments must match the number of outputs from forward.
	Furthermore, the number of output arguments must match the number of inputs from forward.
	"""
	return x, y
	```

	Now you can use the first custom operation you have built.

	```{.python .input}
	x = np.random.uniform(-1, 1, (2, 3))
	y = np.random.uniform(-1, 1, (2, 3))
	x.attach_grad()
	y.attach_grad()
	with autograd.record():
	z = MyFirstCustomOperation()
	z1, z2, z3 = z(x, y)
	out = z1 + z2 + z3
	out.backward()
	print(np.array_equiv(x.asnumpy(), x.asnumpy()))
	print(np.array_equiv(y.asnumpy(), y.asnumpy()))
	```

	Alternatively, you may want to have a function which is different depending on
	if you are training or not.

	```{.python .input}
	def my_first_function(x):
	if autograd.is_training(): # Return something else when training
	return(4 * x)
	else:
	return(x)
	```

	```{.python .input}
	y = my_first_function(x)
	print(np.array_equiv(y.asnumpy(), x.asnumpy()))
	with autograd.record(train_mode=False):
	y = my_first_function(x)
	y.backward()
	print(x.grad)
	with autograd.record(train_mode=True): # train_mode = True by default
	y = my_first_function(x)
	y.backward()
	print(x.grad)
	```

	You could create functions with `autograd.record()`.

	```{.python .input}
	def my_second_function(x):
	with autograd.record():
	return(2 * x)
	```

	```{.python .input}
	y = my_second_function(x)
	y.backward()
	print(x.grad)
	```

	You can also combine multiple functions.

	```{.python .input}
	y = my_second_function(x)
	with autograd.record():
	z = my_second_function(y) + 2
	z.backward()
	print(x.grad)
	```

	Additionally, MXNet records the execution trace and computes the gradient
	accordingly. The following function `f` doubles the inputs until its `norm`
	reaches 1000. Then it selects one element depending on the sum of its elements.

	```{.python .input}
	def f(a):
	b = a * 2
	while np.abs(b).sum() < 1000:
	b = b * 2
	if b.sum() >= 0:
	c = b[0]
	else:
	c = b[1]
	return c
	```

	In this example, you record the trace and feed in a random value.

	```{.python .input}
	a = np.random.uniform(size=2)
	a.attach_grad()
	with autograd.record():
	c = f(a)
	c.backward()
	```

	You can see that `b` is a linear function of `a`, and `c` is chosen from `b`.
	The gradient with respect to `a` be will be either `[c/a[0], 0]` or `[0,
	c/a[1]]`, depending on which element from `b` is picked. You see the results of
	this example with this code:

	```{.python .input}
	a.grad == c / a
	```

	As you can notice there are 3 values along the dimension 0, so taking a `mean`
	along this axis is the same as summing that axis and multiplying by `1/3`.

	## Advanced MXNet ndarray operations with Autograd

	You can control gradients for different ndarray operations. For instance,
	perhaps you want to check that the gradients are propagating properly?
	the `attach_grad()` method automatically detaches itself from the gradient.
	Therefore, the input up until y will no longer look like it has `x`. To
	illustrate this notice that `x.grad` and `y.grad` is not the same in the second
	example.

	```{.python .input}
	with autograd.record():
	y = 3 * x
	y.attach_grad()
	z = 4 * y + 2 * x
	z.backward()
	print(x.grad)
	print(y.grad)
	```

	Is not the same as:

	```{.python .input}
	with autograd.record():
	y = 3 * x
	z = 4 * y + 2 * x
	z.backward()
	print(x.grad)
	print(y.grad)
	```

	## Next steps

	Learn how to initialize weights, choose loss function, metrics and optimizers for training your neural network [Step 4: Necessary components
	to train the neural network](./4-components.ipynb).