docs/tutorials/using-hamilton-in-a-notebook.rst - hamilton - Git at Google

 ============================
 Using Hamilton in a notebook
 ============================

 A quick five minute primer on using Hamilton in a notebook environment.

 This tutorial can also be found `published on TDS <https://towardsdatascience.com/how-to-iterate-with-hamilton-in-a-notebook-8ec0f85851ed>`_.

 Step 1 — Install Jupyter & Hamilton
 -----------------------------------

 I assume you already have this step set up. But just in case you don’t:

 .. code-block:: bash

     pip install notebook
     pip install sf-hamilton

 Then to start the notebook server it should just be:

 .. code-block:: bash

     jupyter notebook

 Step 2 — Set up the files
 -------------------------

 #. Start up your Jupyter notebook.
 #. Go to the directory where you want your notebook and Hamilton function module(s) to live.
 #. Create a python file(s). Do that by going to “New > text file”. It’ll open a “file” editor view. Name the file and give it a ``.py`` extension. Once you save it, you’ll see that jupyter now provides python syntax highlighting. Keep this tab open, so you can flip back to it to edit this file. See :ref:`google-colab-help` if this is proving burdensome for you.
 #. Start up a notebook that you will use in another browser tab.

 Step 3 — The basic process of iteration
 ---------------------------------------

 At a high level, you will be switching back and forth between your tabs. You will add functions to your Hamilton
 function python module, and then import/reimport that module into your notebook to get the changes. From there you will
 then use Hamilton as usual to run and execute things and the notebook for all the standard things you use notebooks for.

 Let’s walk through an example.

 Here’s a function I added to our Hamilton function module. I named the module ``some_functions.py`` (obviously choose a
 better name for your situation).

 .. code-block:: python

     import pandas as pd

     def avg_3wk_spend(spend: pd.Series) -> pd.Series:
         """Rolling 3 week average spend."""
         print("foo") # will use this to prove it reloaded!
         return spend.rolling(3).mean()

 And here’s what I set up in my notebook to be able to use Hamilton and import this module:

 Cell 1: This just imports the base things we need; see the pro-tip at the bottom of this page for how to automatically reload changes.

 .. code-block:: python

     import importlib
     import pandas as pd
     from hamilton import driver

 Cell 2: Import your Hamilton function module(s)

 .. code-block:: python

     # import your hamilton function module(s) here
     import some_functions

 Cell 3: Run this cell anytime you make and save changes to ``some_functions.py``

 .. code-block:: python

     # use this to reload the module after making changes to it.
     importlib.reload(some_functions)

 What this will do is reload the module, and therefore make sure the code is up to date for you to use.

 Cell 4: Use Hamilton

 .. code-block:: python

     config = {}
     dr = driver.Driver(config, some_functions)
     input_data = {'spend': pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])}
     df = dr.execute(['avg_3wk_spend'], inputs=input_data)

 You should see ``foo`` printed as an output after running this cell.

 Okay, so let’s now say we’re iterating on our Hamilton functions. Go to your Hamilton function module
 (``some_functions.py`` in this example) in your other browser tab, and change the ``print("foo")`` to something else,
 e.g. ``print("foo-bar")``. Save the file — it should look something like this:

 .. code-block:: python

     def avg_3wk_spend(spend: pd.Series) -> pd.Series:
         """Rolling 3 week average spend."""
         print("foo-bar")
         return spend.rolling(3).mean()

 Go back to your notebook, and re-run Cell 3 & Cell 4. You should now see a different output printed, e.g. ``foo-bar``.

 Congratulations! You just managed to iterate on Hamilton using a Jupyter notebook!

 **To summarize** this is how things ended up looking on my end:

 * Here’s what my ``some_functions.py`` file looks like:

 .. image:: https://miro.medium.com/max/500/1\*iwbLF1dzfyX2ZxJqV7a\_YQ.png

 * Here’s what my notebook looks like:

 .. image:: https://miro.medium.com/max/680/1\*xNtsl3KtWdRjM6FbuaPr2w.png

 .. _google-colab-help:

 Help: I am using Google Colab and I can't do the above
 ------------------------------------------------------

 Since the ``1.8.0`` release, you now have the ability to inline define functions with your driver that can be used to
 build a DAG. `We strongly recommend only using this approach when absolutely necessary` — it’s very easy to build
 spaghetti code this way.

 For example, say we want to add a function to compute the logarithm of ``avg_3wk_spend`` and not add it to
 ``some_functions.py``, we can do the following steps directly in our notebook:

 .. code-block:: python

     # Step 1 - define function
     import numpy as np

     def log_avg_3wk_spend(avg_3wk_spend: pd.Series) -> pd.Series:
         """Simple function taking the logarithm of spend over signups."""
         return np.log(avg_3wk_spend)

 We then have to create a "temporary python module" to house it in. We do this by importing ``ad_hoc_utils`` and then
 calling the ``create_temporary_module`` function, passing in the functions we want, and providing a name for the module
 we're creating.

 .. code-block:: python

     # Step 2 - create a temporary modeul to house all notebook functions
     from hamilton import ad_hoc_utils
     temp_module = ad_hoc_utils.create_temporary_module(
          log_avg_3wk_spend, module_name='function_example')

 You can now treat ``temp_module`` like a python module and pass it to your driver and use Hamilton like normal:

 .. code-block:: python

     # Step 3 - add the module to the driver and continue as usual
     dr = driver.Driver(config, some_functions, temp_module)
     df = dr.execute(['avg_3wk_spend', 'log_avg_3wk_spend'], inputs=input_data)

 Caveat with this approach:
 ##########################

 Using a "temporary python module" will not enable scaling of computation by using Ray, Dask, or Pandas on Spark. So we
 suggest only using this approach for development purposes only.

 Pro-tip: You can import functions directly
 ------------------------------------------

 The nice thing about forcing Hamilton functions into a module, is that it’s very easy to re-use in another context. E.g.
 another notebook, or directly.

 For example, it is easy to directly use the functions in the notebook, like so:

 .. code-block:: python

     some_functions.avg_3wk_spend(pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))

 Which calls the ``avg_3wk_spend`` function we defined in the ``some_functions.py`` module.

 Pro-tip: You can use ipython magic to autoreload code
 -----------------------------------------------------

 Open a Python module and a Jupyter notebook side-to-side, and then add
 `%autoreload ipython magic <https://ipython.org/ipython-doc/3/config/extensions/autoreload.html>`_ to the notebook to
 auto-reload the cell:

 .. code-block:: python

     from hamilton.driver import Driver
     import my_module  # data transformation module that I have open in other tab

     %load_ext autoreload   # load extension
     %autoreload 1  # configure autoreload to only affect specified files
     %aimport my_module  # specify my_module to be reloaded

     hamilton_driver = Driver({}, my_module)
     hamilton_driver.execute(['desired_output1', 'desired_output2'])

 You'd then follow the following process:

 #. Write your data transformation in the open python module
 #. In the notebook, instantiate a Hamilton driver and test the DAG with a small subset of data.
 #. Because of %autoreload, the module is reimported with the latest changes each time the Hamilton DAG is executed. This approach prevents out-of-order notebook executions, and functions always reside in clean .py files.

 Credit: `Thierry Jean's blog post <https://medium.com/@thijean/the-perks-of-creating-dataflows-with-hamilton-36e8c56dd2a>`_.
	============================
	Using Hamilton in a notebook
	============================

	A quick five minute primer on using Hamilton in a notebook environment.

	This tutorial can also be found `published on TDS <https://towardsdatascience.com/how-to-iterate-with-hamilton-in-a-notebook-8ec0f85851ed>`_.

	Step 1 — Install Jupyter & Hamilton
	-----------------------------------

	I assume you already have this step set up. But just in case you don’t:

	.. code-block:: bash

	pip install notebook
	pip install sf-hamilton

	Then to start the notebook server it should just be:

	.. code-block:: bash

	jupyter notebook

	Step 2 — Set up the files
	-------------------------

	#. Start up your Jupyter notebook.
	#. Go to the directory where you want your notebook and Hamilton function module(s) to live.
	#. Create a python file(s). Do that by going to “New > text file”. It’ll open a “file” editor view. Name the file and give it a ``.py`` extension. Once you save it, you’ll see that jupyter now provides python syntax highlighting. Keep this tab open, so you can flip back to it to edit this file. See :ref:`google-colab-help` if this is proving burdensome for you.
	#. Start up a notebook that you will use in another browser tab.

	Step 3 — The basic process of iteration
	---------------------------------------

	At a high level, you will be switching back and forth between your tabs. You will add functions to your Hamilton
	function python module, and then import/reimport that module into your notebook to get the changes. From there you will
	then use Hamilton as usual to run and execute things and the notebook for all the standard things you use notebooks for.

	Let’s walk through an example.

	Here’s a function I added to our Hamilton function module. I named the module ``some_functions.py`` (obviously choose a
	better name for your situation).

	.. code-block:: python

	import pandas as pd

	def avg_3wk_spend(spend: pd.Series) -> pd.Series:
	"""Rolling 3 week average spend."""
	print("foo") # will use this to prove it reloaded!
	return spend.rolling(3).mean()

	And here’s what I set up in my notebook to be able to use Hamilton and import this module:

	Cell 1: This just imports the base things we need; see the pro-tip at the bottom of this page for how to automatically reload changes.

	.. code-block:: python

	import importlib
	import pandas as pd
	from hamilton import driver

	Cell 2: Import your Hamilton function module(s)

	.. code-block:: python

	# import your hamilton function module(s) here
	import some_functions

	Cell 3: Run this cell anytime you make and save changes to ``some_functions.py``

	.. code-block:: python

	# use this to reload the module after making changes to it.
	importlib.reload(some_functions)

	What this will do is reload the module, and therefore make sure the code is up to date for you to use.

	Cell 4: Use Hamilton

	.. code-block:: python

	config = {}
	dr = driver.Driver(config, some_functions)
	input_data = {'spend': pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])}
	df = dr.execute(['avg_3wk_spend'], inputs=input_data)

	You should see ``foo`` printed as an output after running this cell.

	Okay, so let’s now say we’re iterating on our Hamilton functions. Go to your Hamilton function module
	(``some_functions.py`` in this example) in your other browser tab, and change the ``print("foo")`` to something else,
	e.g. ``print("foo-bar")``. Save the file — it should look something like this:

	.. code-block:: python

	def avg_3wk_spend(spend: pd.Series) -> pd.Series:
	"""Rolling 3 week average spend."""
	print("foo-bar")
	return spend.rolling(3).mean()

	Go back to your notebook, and re-run Cell 3 & Cell 4. You should now see a different output printed, e.g. ``foo-bar``.

	Congratulations! You just managed to iterate on Hamilton using a Jupyter notebook!

	To summarize this is how things ended up looking on my end:

	* Here’s what my ``some_functions.py`` file looks like:

	.. image:: https://miro.medium.com/max/500/1\*iwbLF1dzfyX2ZxJqV7a\_YQ.png

	* Here’s what my notebook looks like:

	.. image:: https://miro.medium.com/max/680/1\*xNtsl3KtWdRjM6FbuaPr2w.png

	.. _google-colab-help:

	Help: I am using Google Colab and I can't do the above
	------------------------------------------------------

	Since the ``1.8.0`` release, you now have the ability to inline define functions with your driver that can be used to
	build a DAG. `We strongly recommend only using this approach when absolutely necessary` — it’s very easy to build
	spaghetti code this way.

	For example, say we want to add a function to compute the logarithm of ``avg_3wk_spend`` and not add it to
	``some_functions.py``, we can do the following steps directly in our notebook:

	.. code-block:: python

	# Step 1 - define function
	import numpy as np

	def log_avg_3wk_spend(avg_3wk_spend: pd.Series) -> pd.Series:
	"""Simple function taking the logarithm of spend over signups."""
	return np.log(avg_3wk_spend)

	We then have to create a "temporary python module" to house it in. We do this by importing ``ad_hoc_utils`` and then
	calling the ``create_temporary_module`` function, passing in the functions we want, and providing a name for the module
	we're creating.

	.. code-block:: python

	# Step 2 - create a temporary modeul to house all notebook functions
	from hamilton import ad_hoc_utils
	temp_module = ad_hoc_utils.create_temporary_module(
	log_avg_3wk_spend, module_name='function_example')

	You can now treat ``temp_module`` like a python module and pass it to your driver and use Hamilton like normal:

	.. code-block:: python

	# Step 3 - add the module to the driver and continue as usual
	dr = driver.Driver(config, some_functions, temp_module)
	df = dr.execute(['avg_3wk_spend', 'log_avg_3wk_spend'], inputs=input_data)

	Caveat with this approach:
	##########################

	Using a "temporary python module" will not enable scaling of computation by using Ray, Dask, or Pandas on Spark. So we
	suggest only using this approach for development purposes only.

	Pro-tip: You can import functions directly
	------------------------------------------

	The nice thing about forcing Hamilton functions into a module, is that it’s very easy to re-use in another context. E.g.
	another notebook, or directly.

	For example, it is easy to directly use the functions in the notebook, like so:

	.. code-block:: python

	some_functions.avg_3wk_spend(pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))

	Which calls the ``avg_3wk_spend`` function we defined in the ``some_functions.py`` module.

	Pro-tip: You can use ipython magic to autoreload code
	-----------------------------------------------------

	Open a Python module and a Jupyter notebook side-to-side, and then add
	`%autoreload ipython magic <https://ipython.org/ipython-doc/3/config/extensions/autoreload.html>`_ to the notebook to
	auto-reload the cell:

	.. code-block:: python

	from hamilton.driver import Driver
	import my_module # data transformation module that I have open in other tab

	%load_ext autoreload # load extension
	%autoreload 1 # configure autoreload to only affect specified files
	%aimport my_module # specify my_module to be reloaded

	hamilton_driver = Driver({}, my_module)
	hamilton_driver.execute(['desired_output1', 'desired_output2'])

	You'd then follow the following process:

	#. Write your data transformation in the open python module
	#. In the notebook, instantiate a Hamilton driver and test the DAG with a small subset of data.
	#. Because of %autoreload, the module is reimported with the latest changes each time the Hamilton DAG is executed. This approach prevents out-of-order notebook executions, and functions always reside in clean .py files.

	Credit: `Thierry Jean's blog post <https://medium.com/@thijean/the-perks-of-creating-dataflows-with-hamilton-36e8c56dd2a>`_.