blob: 71dbe13a8c18065b3e8750cebd67811bbc4007e3 [file] [log] [blame]
=====================
Running Your Code
=====================
Writing functions is great, but its meaningless if you have no way to execute them. We use "drivers" to execute the code.
We currently have a single Hamilton Driver that is responsible for the following:
#. Crawling your python modules to extract functions to turn into nodes in the DAG.
#. Running the DAG.
#. Assembling the results.
#. Enabling you to visualize the DAG.
It is provided as an easy way for the user to specify the data she wants without dealing with the complexities of DAGs,
function graphs, or nodes.
The basic structure of using the Hamilton Driver is:
.. code-block:: python
from hamilton import driver
from hamilton import base
# 1. Setup config. See the Parameterizing the DAG section for usage
config = {}
# 2. we need to tell hamilton where to load function definitions from
module_name = 'my_functions'
module = importlib.import_module(module_name) # or simply "import my_functions"
# 3. Determine the return type -- default is a pandas.DataFrame.
adapter = base.SimplePythonDataFrameGraphAdapter() # See GraphAdapter docs for more details.
# These all feed into creating the driver & thus DAG.
dr = driver.Driver(config, module, adapter=adapter)
# lastly to get something, we need to call execute
result = dr.execute(['desired_output1', 'desired_output2'])
# pip install sf-hamilton[visualization] for this next line to work:
dr.visualize_execution(['desired_output1', 'desired_output2'], './my_file.dot, {})
Note that the stock Hamilton driver is the API interface to use to execute your Hamilton dataflows. Before
diving into how to call/use the driver more, let's cover DAG parameterization.
.. _parameterizing-the-dag:
Parameterizing the DAG
------------------------------
Static dataflows are only so useful. In the real world, we need to be able to configure both the shape of the DAG and
the inputs to the DAG as part of the Hamilton driver. The default Hamilton driver comes with three input types that you
can control. They both take the form ``Dict[str, Any]``, i.e. a dictionary of string keys that maps to any object type.
#. **config** The config is a dictionary of strings to values. This is passed into the constructor of the Hamilton driver, as it is required to create the DAG. It `also` gets passed into the DAG at runtime, so you have access to parameter values. See :doc:`decorators-overview`, as well as the examples below, for how the config can be used.
#. **inputs** The `runtime inputs` to the DAG. These have to be mutually disjoint from the config -- overriding the config does not make sense here, as the DAG has been constructed assuming fixed configs.
#. **overrides** Values to override nodes in a DAG. During execution, nothing upstream of these are computed.
Let's go through some examples that show you how to write a Hamilton function that allows it to be conditionally used
depending on configuration.
**You have a DAG for region and business line**, where the rolling average for marketing spend is computed differently
(see :doc:`../getting-started/index` for the motivating example). In this case, you'll define the DAG as follows:
.. code-block:: python
@config.when(business_line='CA')
def avg_rolling_spend__CA(spend: pd.Series) -> pd.Series:
"""Rolling average of spend in the canada region."""
return spend.rolling(3).mean()
@config.when(business_line='US')
def avg_rolling_spend__US(spend: pd.Series) -> pd.Series:
"""Rolling average of spend in the US region."""
return spend.rolling(2).mean()
When the graph is compiled, the implementation of ``avg_rolling_spend`` varies based off of the configuration value.
You would construct the driver with ``config={'region' : 'US'}``, to get the desired behavior.
**You want to pass in the region/business line to change the behavior or a transform.** Say you have a big dataframe of
marketing spend with columns representing the region, and also want to filter it out for the individual region. You
would define the transform function as follows.
.. code-block:: python
def avg_rolling_spend(spend_by_country: pd.DataFrame, region: str) -> pd.Series:
"""Rolling average of spend in the specified region."""
return spend_by_country[spend_by_country.region==region].spend
You would execute the driver with ``input={'region' : 'US'}``, to get the desired behavior. You could `also` construct
the DAG with ``config={'region' : 'US'}``.
**You want to override the value of a transform**. In this case, you can just pass this into the execute function of the
driver as overrides. E.G.:
.. code-block:: python
df = dr.execute(
['acquisition_cost'],
overrides={'spend' : pd.Series(
[40, 80, 100, 400, 800, 1000], # what if we increased the marketing spend?
index=pd.date_range("2022-01-01", periods=6, freq="w"))})
Calling Execute()
#################
There are two ways to use ``execute()``:
#. Call it once -- you only request the outputs required. E.g. ``dr.execute(['desired_output1', 'desired_output2'])``
#. Call it in succession by providing it specific inputs, in addition to the outputs required. E.g. ``dr.execute(['desired_output1', 'desired_output2'], inputs={...})``
We recommend using option (1) where possible. Option (2) only makes sense if you want to reuse the dataflow created for
different data sets, or to chunk over large data or iterate over objects, e.g. images or text.
Visualizing Execution
#####################
Hamilton enables you to quickly and easily visualize your entire DAG, as well as the specific execution path to compute
an output. Underneath we default to use `graphviz <https://graphviz.org/>`_ for visualization.
Visualize just execution required to create outputs
***************************************************
.. code-block:: python
dr.visualize_execution(['desired_output1', 'desired_output2'], './my_file.dot', render_args)
In addition to specifying the outputs you desire, you need to provide a path to save the created dot file and image, and
then provide some arguments for rendering -- at minimum, pass in an empty dictionary.
Visualize the entire DAG constructed
************************************
.. code-block:: python
dr.display_all_functions('./my_file.dot', render_args)
You need to provide a path to save the created dot file and image, and then provide some optional arguments for
rendering.
Should I define my own Driver?
------------------------------
The APIs that the Hamilton Driver is built on, are considered internal. So it is possible for you to define your own
driver in place of the stock Hamilton driver, we suggest the following path if you don't like how the current Hamilton
Driver interface is designed:
`Write a "Wrapper" class that delegates to the Hamilton Driver.`
i.e.
.. code-block:: python
from hamilton import driver
class MyCustomDriver(object):
def __init__(self, constructor_arg, ...):
self.constructor_arg = constructor_arg
...
# some internal functions specific to your context
# ...
def my_execute_function(self, arg1, arg2, ...):
"""What actually calls the Hamilton"""
dr = driver.Driver(self.constructor_arg, ...)
df = dr.execute(self.outputs)
return self.augmetn(df)
That way, you can create the right API constructs to invoke Hamilton in your context, and then delegate to the stock
Hamilton Driver. By doing so, it will ensure that your code continues to work, since we intend to honor the Hamilton
Driver APIs with backwards compatibility as much as possible.