We translate the Species distribution modeling from scikit-learn into Hamilton to showcase pipe andpipe_output.
Example of a simple ETL pipeline broken into modules with external couplings.
grids.py or preprocessing.py, where we use @pipe to wrap and inject external functions as Hamilton nodes.train_and_predict.py, where we use @pipe_output to evaluate our model on the individual test and train datasets.We have two complimentary decorators that can help with transforming input / output of a node in the DAG: pipe_input and pipe_output.
We can directly transform input of a node using pipe_input with
@pipe_input( step(baz,...), step(qux,...) ) def foo(bar:Any)->Any: ...
In the above case two nodes baz and qux get created and are inserted between bar and foo; thus we get
foo(qux(baz(bar)))
This can be particularly useful in the following cases:
@does/@parameterize can do this, this presents an easier way to do this, especially in a chain.On the other hand we can also perform transformations on the output of the node with pipe_output with
@pipe_output( step(baz,...) step(qux,...) ) def foo(bar:Any)->Any: ...
In the above case a node gets created from baz() and is appended after foo; thus we get
qux(baz(foo(bar)))
This can be particularly useful when:
model, aka feature engineering hyper-tuning.step(...).when(...) and can choose at execution time which transformation will be applied to the output of a particular node. For example, each step represents a different model and we switch between them with a config dictionary in the Hamilton driver.The original script has been left intact. The only change we made is to add
if __name__ == "__main__": plot_species_distribution() plt.show()
so that it does not run when imported into other scripts.
This script can be run as is for comparison. Actually the external functions construct_grids() and create_species_bunch() we will directly import and use as external functions to showcase how you can use our pipe and pipe_output functionality to “Hamiltonise” external modules.
Is a notebook that contains all the modules and has the execution cells to run the complete code. It also gives you the ability to visualize the DAG.
If you prefer to run code through a shell the same code is also available as a python script.
python -m run.py
The original code and analysis is taken from scikit-learn. We don‘t know them and they don’t know us - but they wrote a neat analysis and we wanted to show you how their procedural code would look like if it was written with the help of Hamilton.
Thanks to the authors for creating it:
License: BSD 3 clause