basics.md - hamilton - Git at Google

 # Hamilton Basics

 There are two parts to Hamilton:

 1. Hamilton Functions.

    Hamilton Functions are what you, the end user write.

 2. Hamilton Driver.

    Once you've written your functions, you will need to use the Hamilton Driver to build the DAG and orchestrate
    execution.

 Let's dive deeper into these parts below, but first a word on terminology.

 We use the following terms interchangeably, e.g. a ____ in Hamilton is ... :

 * column
 * variable
 * node
 * function

 That's because we're representing columns as functions, which are parts of a directed acyclic graph. That is
  a column is a part of a dataframe. To compute a column we write a function that has input variables. From these functions
 we create a DAG and represent each function as a node, linking each input variable by an edge to its respective node.

 ## Hamilton Functions
 Using Hamilton is all about writing functions. From these functions a dataframe is constructed for you at execution time.

 A simple (but rather contrived) example of what Hamilton does that adds two numbers is as follows:

 ```python
 def _sum(*vars):
     """Helper function to sum numbers.
     This is here to demonstrate that functions starting with _ do not get processed by hamilton.
     """
     return sum(vars)

 def sum_a_b(a: int, b: int) -> int:
     """Adds a and b together
     :param a: The first number to add
     :param b: The second number to add
     :return: The sum of a and b
     """
     return _sum(a,b) # Delegates to a helper function
 ```

 While this looks like a simple python function, there are a few components to note:
 1. The function name `sum_a_b` is a globally unique key. In the DAG there can only be one function named `sum_a_b`.
    While this is not optimal for functionality reuse, it makes it extremely easy to learn exactly how a node in the DAG is generated,
    and separate out that logic for debugging/iterating.
 2. The function `sum_a_b` depends on two upstream nodes -- `a` and `b`. This means that these values must either be:
     * Defined by another function
     * Passed in by the user as a configuration variable (see `Hamilton Driver Code` below)
 3. The function `sum_a_b` makes full use of the python type-hint system. This is required in Hamilton,
    as it allows us to type-check the inputs and outputs to match with upstream producers and downstream consumers. In this case,
    we know that the input `a` has to be an integer, the input `b` has to also be an integer, and anything that declares `sum_a_b` as an input
    has to declare it as an integer.
 4. Standard python documentation is a first-class citizen. As we have a 1:1 relationship between python functions and
    nodes, each function documentation also describes a piece of business logic.
 5. Functions that start with _ are ignored, and not included in the DAG. Hamilton tries to make use of every function
    in a module, so this allows us to easily indicate helper functions that won't become part of the DAG.


 ### Python Types & Hamilton

 Hamilton makes use of python's type-hinting feature to check compatibility between function outputs and function inputs. However,
 this is not particularly sophisticated, largely due to the lack of available tooling in python. Thus, generic types do not function correctly.
 The following will not work:

 ```python
 def some_func() -> Dict[str, int]:
     return {1: 2}
 ```

 The following will both work:
 ```python
 def some_func() -> Dict:
     return {1: 2}
 ```

 ```python
 def some_func() -> dict:
     return {1: 2}
 ```

 While this is unfortunate, the typing API in python is not yet sophisticated enough to rely on accurate subclass validation.

 ## Hamilton Driver Code
 For documentation on the actual Hamilton Driver code, we invite the reader to [read the Driver class source code](/hamilton/driver.py) directly.

 At a high level, the driver code does two things:

 1. Create a Directed Acyclic Graph (DAG) from functions you define.
    ```python
    from hamilton import driver
    dr = driver.Driver(config, *modules_to_load)  # this creates the DAG from the modules you pass in.
    ```
 2. It orchestrates execution given expected output and provided input.
    ```python
    df = dr.execute(final_vars, overrides, display_graph)  # this executes the DAG appropriately to create the dataframe.
    ```

 The driver object also has a few other methods, e.g. `display_all_functions()`, `list_available_variables()`, but they're
 really only used for debugging purposes.

 Let's dive into the driver constructor call, and the execute method.

 ### Constructor Call to Driver()
 The constructor call is pretty simple. Each constructor call sets up a DAG for execution given some configuration.
 So if you want to change something about the DAG, very likely you'll need to create a new Driver() object.

 #### config: Dict[str, Any], e.g. Configuration
 The configuration is used not just to feed data to the DAG, but also to determine the structure of the DAG.
 As such, it is passed in to the constructor, and used during DAG creation. This enables such decorators like @config.when.

 Otherwise the contents of the _config_ dictionary should include all the inputs required for whatever final output you
 want to create. The configuration dictionary should not be used for overriding what Hamilton will compute.
 To do this, use the `override` parameter as part of the `execute()` -- see below.

 #### \*modules: ModuleType
 This can be any number of modules. We traverse the modules in the order they are provided.

 ### Driver.execute()
 The execute function determines the DAG walk required to get the requisite final variables (aka columns) that you want
 in the dataframe. It also ensures that you have provided everything to execute properly.

 Once it executes it uses a dictionary to memoize results, so that everything is only computed once. It executes the DAG
 via a recursive depth-first-traversal, which leads to the possibility (although highly unlikely) of hitting python
 recursion depth errors. If that happens, the culprit is almost always a circular reference in the graph. We suggest
 displaying the DAG to verify this.

 To help speed up development of new or existing Hamilton Functions, we enable you to _override_ parts of the DAG. What
 this means is that before calling `execute()`, you have computed some result that you want to use instead of what Hamilton
 would produce. To do so, you just pass in a dictionary of `{'col_name': YOUR_VALUE}` as the overrides argument to the
 execute function.

 To visualize the DAG that would be executed, pass the flag `display_graph=True` to execute. It will render an image in a pdf format.

 # Backstory
 For the backstory on Hamilton we invite you to watch ~9 minute lightning talk on it that we gave at the apply conference:
 [video](https://www.youtube.com/watch?v=B5Zp_30Knoo), [slides](https://www.slideshare.net/StefanKrawczyk/hamilton-a-micro-framework-for-creating-dataframes).
	# Hamilton Basics

	There are two parts to Hamilton:

	1. Hamilton Functions.

	Hamilton Functions are what you, the end user write.

	2. Hamilton Driver.

	Once you've written your functions, you will need to use the Hamilton Driver to build the DAG and orchestrate
	execution.

	Let's dive deeper into these parts below, but first a word on terminology.

	We use the following terms interchangeably, e.g. a ____ in Hamilton is ... :

	* column
	* variable
	* node
	* function

	That's because we're representing columns as functions, which are parts of a directed acyclic graph. That is
	a column is a part of a dataframe. To compute a column we write a function that has input variables. From these functions
	we create a DAG and represent each function as a node, linking each input variable by an edge to its respective node.

	## Hamilton Functions
	Using Hamilton is all about writing functions. From these functions a dataframe is constructed for you at execution time.

	A simple (but rather contrived) example of what Hamilton does that adds two numbers is as follows:

	```python
	def _sum(*vars):
	"""Helper function to sum numbers.
	This is here to demonstrate that functions starting with _ do not get processed by hamilton.
	"""
	return sum(vars)

	def sum_a_b(a: int, b: int) -> int:
	"""Adds a and b together
	:param a: The first number to add
	:param b: The second number to add
	:return: The sum of a and b
	"""
	return _sum(a,b) # Delegates to a helper function
	```

	While this looks like a simple python function, there are a few components to note:
	1. The function name `sum_a_b` is a globally unique key. In the DAG there can only be one function named `sum_a_b`.
	While this is not optimal for functionality reuse, it makes it extremely easy to learn exactly how a node in the DAG is generated,
	and separate out that logic for debugging/iterating.
	2. The function `sum_a_b` depends on two upstream nodes -- `a` and `b`. This means that these values must either be:
	* Defined by another function
	* Passed in by the user as a configuration variable (see `Hamilton Driver Code` below)
	3. The function `sum_a_b` makes full use of the python type-hint system. This is required in Hamilton,
	as it allows us to type-check the inputs and outputs to match with upstream producers and downstream consumers. In this case,
	we know that the input `a` has to be an integer, the input `b` has to also be an integer, and anything that declares `sum_a_b` as an input
	has to declare it as an integer.
	4. Standard python documentation is a first-class citizen. As we have a 1:1 relationship between python functions and
	nodes, each function documentation also describes a piece of business logic.
	5. Functions that start with _ are ignored, and not included in the DAG. Hamilton tries to make use of every function
	in a module, so this allows us to easily indicate helper functions that won't become part of the DAG.


	### Python Types & Hamilton

	Hamilton makes use of python's type-hinting feature to check compatibility between function outputs and function inputs. However,
	this is not particularly sophisticated, largely due to the lack of available tooling in python. Thus, generic types do not function correctly.
	The following will not work:

	```python
	def some_func() -> Dict[str, int]:
	return {1: 2}
	```

	The following will both work:
	```python
	def some_func() -> Dict:
	return {1: 2}
	```

	```python
	def some_func() -> dict:
	return {1: 2}
	```

	While this is unfortunate, the typing API in python is not yet sophisticated enough to rely on accurate subclass validation.

	## Hamilton Driver Code
	For documentation on the actual Hamilton Driver code, we invite the reader to [read the Driver class source code](/hamilton/driver.py) directly.

	At a high level, the driver code does two things:

	1. Create a Directed Acyclic Graph (DAG) from functions you define.
	```python
	from hamilton import driver
	dr = driver.Driver(config, *modules_to_load) # this creates the DAG from the modules you pass in.
	```
	2. It orchestrates execution given expected output and provided input.
	```python
	df = dr.execute(final_vars, overrides, display_graph) # this executes the DAG appropriately to create the dataframe.
	```

	The driver object also has a few other methods, e.g. `display_all_functions()`, `list_available_variables()`, but they're
	really only used for debugging purposes.

	Let's dive into the driver constructor call, and the execute method.

	### Constructor Call to Driver()
	The constructor call is pretty simple. Each constructor call sets up a DAG for execution given some configuration.
	So if you want to change something about the DAG, very likely you'll need to create a new Driver() object.

	#### config: Dict[str, Any], e.g. Configuration
	The configuration is used not just to feed data to the DAG, but also to determine the structure of the DAG.
	As such, it is passed in to the constructor, and used during DAG creation. This enables such decorators like @config.when.

	Otherwise the contents of the _config_ dictionary should include all the inputs required for whatever final output you
	want to create. The configuration dictionary should not be used for overriding what Hamilton will compute.
	To do this, use the `override` parameter as part of the `execute()` -- see below.

	#### \*modules: ModuleType
	This can be any number of modules. We traverse the modules in the order they are provided.

	### Driver.execute()
	The execute function determines the DAG walk required to get the requisite final variables (aka columns) that you want
	in the dataframe. It also ensures that you have provided everything to execute properly.

	Once it executes it uses a dictionary to memoize results, so that everything is only computed once. It executes the DAG
	via a recursive depth-first-traversal, which leads to the possibility (although highly unlikely) of hitting python
	recursion depth errors. If that happens, the culprit is almost always a circular reference in the graph. We suggest
	displaying the DAG to verify this.

	To help speed up development of new or existing Hamilton Functions, we enable you to _override_ parts of the DAG. What
	this means is that before calling `execute()`, you have computed some result that you want to use instead of what Hamilton
	would produce. To do so, you just pass in a dictionary of `{'col_name': YOUR_VALUE}` as the overrides argument to the
	execute function.

	To visualize the DAG that would be executed, pass the flag `display_graph=True` to execute. It will render an image in a pdf format.

	# Backstory
	For the backstory on Hamilton we invite you to watch ~9 minute lightning talk on it that we gave at the apply conference:
	[video](https://www.youtube.com/watch?v=B5Zp_30Knoo), [slides](https://www.slideshare.net/StefanKrawczyk/hamilton-a-micro-framework-for-creating-dataframes).