blob: c0bc6dd4964e9ddef49733910e45f0eb8d87a2a3 [file] [log] [blame] [view]
# Hamilton Basics
There are two parts to Hamilton:
1. Hamilton Functions.
Hamilton Functions are what you, the end user write.
2. Hamilton Driver.
Once you've written your functions, you will need to use the Hamilton Driver to build the DAG and orchestrate
execution.
Let's dive deeper into these parts below, but first a word on terminology.
We use the following terms interchangeably, e.g. a ____ in Hamilton is ... :
* column
* variable
* node
* function
That's because we're representing columns as functions, which are parts of a directed acyclic graph. That is
a column is a part of a dataframe. To compute a column we write a function that has input variables. From these functions
we create a DAG and represent each function as a node, linking each input variable by an edge to its respective node.
## Hamilton Functions
Using Hamilton is all about writing functions. From these functions a dataframe is constructed for you at execution time.
A simple (but rather contrived) example of what Hamilton does that adds two numbers is as follows:
```python
def _sum(*vars):
"""Helper function to sum numbers.
This is here to demonstrate that functions starting with _ do not get processed by hamilton.
"""
return sum(vars)
def sum_a_b(a: int, b: int) -> int:
"""Adds a and b together
:param a: The first number to add
:param b: The second number to add
:return: The sum of a and b
"""
return _sum(a,b) # Delegates to a helper function
```
While this looks like a simple python function, there are a few components to note:
1. The function name `sum_a_b` is a globally unique key. In the DAG there can only be one function named `sum_a_b`.
While this is not optimal for functionality reuse, it makes it extremely easy to learn exactly how a node in the DAG is generated,
and separate out that logic for debugging/iterating.
2. The function `sum_a_b` depends on two upstream nodes -- `a` and `b`. This means that these values must either be:
* Defined by another function
* Passed in by the user as a configuration variable (see `Hamilton Driver Code` below)
3. The function `sum_a_b` makes full use of the python type-hint system. This is required in Hamilton,
as it allows us to type-check the inputs and outputs to match with upstream producers and downstream consumers. In this case,
we know that the input `a` has to be an integer, the input `b` has to also be an integer, and anything that declares `sum_a_b` as an input
has to declare it as an integer.
4. Standard python documentation is a first-class citizen. As we have a 1:1 relationship between python functions and
nodes, each function documentation also describes a piece of business logic.
5. Functions that start with _ are ignored, and not included in the DAG. Hamilton tries to make use of every function
in a module, so this allows us to easily indicate helper functions that won't become part of the DAG.
### Python Types & Hamilton
Hamilton makes use of python's type-hinting feature to check compatibility between function outputs and function inputs. However,
this is not particularly sophisticated, largely due to the lack of available tooling in python. Thus, generic types do not function correctly.
The following will not work:
```python
def some_func() -> Dict[str, int]:
return {1: 2}
```
The following will both work:
```python
def some_func() -> Dict:
return {1: 2}
```
```python
def some_func() -> dict:
return {1: 2}
```
While this is unfortunate, the typing API in python is not yet sophisticated enough to rely on accurate subclass validation.
## Hamilton Driver Code
For documentation on the actual Hamilton Driver code, we invite the reader to [read the Driver class source code](/hamilton/driver.py) directly.
At a high level, the driver code does two things:
1. Create a Directed Acyclic Graph (DAG) from functions you define.
```python
from hamilton import driver
dr = driver.Driver(config, *modules_to_load) # this creates the DAG from the modules you pass in.
```
2. It orchestrates execution given expected output and provided input.
```python
df = dr.execute(final_vars, overrides, display_graph) # this executes the DAG appropriately to create the dataframe.
```
The driver object also has a few other methods, e.g. `display_all_functions()`, `list_available_variables()`, but they're
really only used for debugging purposes.
Let's dive into the driver constructor call, and the execute method.
### Constructor Call to Driver()
The constructor call is pretty simple. Each constructor call sets up a DAG for execution given some configuration.
So if you want to change something about the DAG, very likely you'll need to create a new Driver() object.
#### config: Dict[str, Any], e.g. Configuration
The configuration is used not just to feed data to the DAG, but also to determine the structure of the DAG.
As such, it is passed in to the constructor, and used during DAG creation. This enables such decorators like @config.when.
Otherwise the contents of the _config_ dictionary should include all the inputs required for whatever final output you
want to create. The configuration dictionary should not be used for overriding what Hamilton will compute.
To do this, use the `override` parameter as part of the `execute()` -- see below.
#### \*modules: ModuleType
This can be any number of modules. We traverse the modules in the order they are provided.
### Driver.execute()
The execute function determines the DAG walk required to get the requisite final variables (aka columns) that you want
in the dataframe. It also ensures that you have provided everything to execute properly.
Once it executes it uses a dictionary to memoize results, so that everything is only computed once. It executes the DAG
via a recursive depth-first-traversal, which leads to the possibility (although highly unlikely) of hitting python
recursion depth errors. If that happens, the culprit is almost always a circular reference in the graph. We suggest
displaying the DAG to verify this.
To help speed up development of new or existing Hamilton Functions, we enable you to _override_ parts of the DAG. What
this means is that before calling `execute()`, you have computed some result that you want to use instead of what Hamilton
would produce. To do so, you just pass in a dictionary of `{'col_name': YOUR_VALUE}` as the overrides argument to the
execute function.
To visualize the DAG that would be executed, pass the flag `display_graph=True` to execute. It will render an image in a pdf format.
# Backstory
For the backstory on Hamilton we invite you to watch ~9 minute lightning talk on it that we gave at the apply conference:
[video](https://www.youtube.com/watch?v=B5Zp_30Knoo), [slides](https://www.slideshare.net/StefanKrawczyk/hamilton-a-micro-framework-for-creating-dataframes).