docs/get-started/why-hamilton.md - hamilton - Git at Google

 # Why use Hamilton?

 There are many choices for building dataflows/pipelines/workflows/ETLs.
 Let's compare Hamilton to some of the other options to help answer this question.

 ## Comparison to Other Frameworks

 There are a lot of frameworks out there, especially in the pipeline space. This section should help you figure out when to
 use Hamilton with another framework, or in place of a framework, or when to use another framework altogether.

 Let's go over some groups of "competitive" or "complimentary" products. For a basic overview,
 see the product matrix on the [homepage](../index.md).

 ### Orchestration Systems
 Examples include:
 - [Airflow](https://airflow.apache.org/)
 - [Metaflow](https://github.com/Netflix/metaflow)
 - [Luigi](https://github.com/spotify/luigi)
 - [dbt](https://www.getdbt.com/)

 Hamilton is not, in itself a macro, i.e. high level, task orchestration system. While it does orchestrate functions,
 and the DAG abstraction is very powerful, it does not provision compute,
 or schedule long-running jobs. Hamilton works well in conjunction with these macro systems.
 Hamilton provides the capabilities of fine-grained lineage, highly readable code, and self-documenting pipelines,
 which many of these systems lack.

 Hamilton can be used within any python orchestration system in the following ways:

 1. _Hamilton DAGs can be called within orchestration system tasks._
 See the [Hamilton + Airflow example](https://blog.dagworks.io/p/supercharge-your-airflow-dag-with). The integration is generally trivial -- all you have to do
 is call out to the hamilton library within your task. If your orchestrator supports python, then you're good to go. Some pseudocode (if your orchestrator handles scripts like airflow):

     ```python
     #my_task.py
     import hamilton
     import my_transformations
     dr = hamilton.driver.Driver({}, my_functions)
     output = dr.execute(['final_var'], inputs=...)
     do_something_with(output)
     ```
 2. _Hamilton DAGs can be broken up to run as components within an orchestration system._
 With the ability to include [overrides](../concepts/driver.rst),
 you can run the DAG on each task, overloading the outputs of the last task + any static inputs/configuration, and pass it into the next task. This is more
 of a manual/power-user feature. Some pseudocode:

     ```python
     #my_task.py
     import hamilton
     import my_functions
     prior_inputs = load_relevant_task_results()
     desired_outputs = ['final_var_1', 'final_var_2']
     inputs = my_inputs
     dr = hamilton.driver.Driver({}, my_functions)
     output = dr.execute(
        desired_outputs,
        inputs=inputs,
        overrides=prior_inputs)
     save_for_later(output)
     ```

 ### Feature Stores

 Examples include:
 - [Hopsworks](https://www.hopsworks.ai/)
 - [Feast](https://feast.dev/)
 - [Tecton](https://tecton.ai/)

 One can think of Hamilton as a being your "feature definition store", where "store" is code + git. While it does
 not provide all the capabilities of a standard feature store, it provides a source of truth for the code that
 generated the features, and can be run in a portable method. *So*, if your desire is just to be able to run the same
 code in different environments, and have an online/offline store of features, you can use hamilton both to save the
 features offline, and generate features online on the fly.

 See the [feature engineering example](../how-tos/use-for-feature-engineering.rst) for more possibilities, as
 well as [blogs on the feature topic](https://blog.dagworks.io/?sort=search&search=features).

 Note that in small cases, you probably don't need a true feature store -- recomputing derived features in an ETL
 and online can be very efficient, as long as you have some database to look values up (or have them passed in).

 Also note that joins and aggregations can get tricky. We often recommend using our "polymorphic function
 definition" i.e. functions decorated with `@config.when`, to either load up the non-online-friendly features
 from a feature store or do an external lookup to simulate an online join.

 We expect Hamilton to play a prominent role in the way feature stores work in the future.

 ### Data Science Ecosystems/ML platforms
 Examples include:
 - [Kedro](https://kedro.org/)
 - [Domino Data Labs](https://www.dominodatalab.com/)
 - [Dataiku](https://www.dataiku.com/)
 - [SageMaker](https://aws.amazon.com/sagemaker/)
 - [Google Cloud Vertex AI Platform](https://cloud.google.com/vertex-ai)
 - etc.

 We've kind of grouped a whole suite of platforms into the same bucket here. These
 tend to have a lot of capabilities all related to ML. Hamilton can be used in conjunction with these
 platforms in a variety of ways. For example, you can use Hamilton to generate features for a model
 that you train in one of these platforms. Or you can use Hamilton to generate a model using the
 platform's compute, and then save the model to the platform's registry.

 ### Registries / Experiment Tracking
 Examples include:
 - [MLflow](https://mlflow.org/)
 - [Weights and Biases](https://wandb.ai/site)
 - [DVC](https://dvc.org/)

 Most pipelines have a "reverse ETL problem" -- they need to get the results of the pipeline into a some
 sort of datastore or registry. Hamilton can be used in conjunction with these tools as the glue code
 that helps everything work together. For example, you can use Hamilton to generate a model
 and then store metrics computed by Hamilton to one of these "destinations".

 There are three main ways to integrate with these tools:
  - inside a function that Hamilton orchestrates
  - outside Hamilton (e.g. in a script that calls Hamilton)
  - using "materializers" (see [materializers](../reference/io/index.rst)) (see [this blog](https://blog.dagworks.io/p/separate-data-io-from-transformation)).

 See this [ML reference post](https://blog.dagworks.io/p/from-dev-to-prod-a-ml-pipeline-reference) for examples of how to use Hamilton with these tools.

 ### Python Dataframe/manipulation Libraries
 Examples include:
 - [pandas](https://pandas.pydata.org/)
 - [dask](https://www.dask.org/)
 - [modin](https://github.com/modin-project/modin)
 - [polars](https://www.pola.rs/)
 - [duckdb](https://duckdb.org/)

 Hamilton works with any python dataframe/manipulation oriented libraries.
 See our [examples folder](https://github.com/dagworks-inc/hamilton/tree/main/examples)
 to see how to use Hamilton with these libraries.


 ### Python "big data" systems
 The following systems are ones that you would resort to using when wanting to scale up your data processing.

 Examples include:
 - [dask](https://www.dask.org/)
 - [ray](https://ray.io/)
 - [pyspark](https://spark.apache.org/docs/latest/api/python/)
 - [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html)

 These all provide capabilities to either (a) express and execute computation over datasets in python or (b)
 parallelize it. Often both. Hamilton has a variety of integrations with these systems. The basics is that Hamilton
 can make use of these systems to execute the DAG using the [GraphAdapter](../reference/graph-adapters/index.rst) abstraction and [Lifecycle Hooks](../reference/lifecycle-hooks/index.rst).

 See our [examples folder](https://github.com/dagworks-inc/hamilton/tree/main/examples)
 to see how to use Hamilton with these systems.
	# Why use Hamilton?

	There are many choices for building dataflows/pipelines/workflows/ETLs.
	Let's compare Hamilton to some of the other options to help answer this question.

	## Comparison to Other Frameworks

	There are a lot of frameworks out there, especially in the pipeline space. This section should help you figure out when to
	use Hamilton with another framework, or in place of a framework, or when to use another framework altogether.

	Let's go over some groups of "competitive" or "complimentary" products. For a basic overview,
	see the product matrix on the [homepage](../index.md).

	### Orchestration Systems
	Examples include:
	- [Airflow](https://airflow.apache.org/)
	- [Metaflow](https://github.com/Netflix/metaflow)
	- [Luigi](https://github.com/spotify/luigi)
	- [dbt](https://www.getdbt.com/)

	Hamilton is not, in itself a macro, i.e. high level, task orchestration system. While it does orchestrate functions,
	and the DAG abstraction is very powerful, it does not provision compute,
	or schedule long-running jobs. Hamilton works well in conjunction with these macro systems.
	Hamilton provides the capabilities of fine-grained lineage, highly readable code, and self-documenting pipelines,
	which many of these systems lack.

	Hamilton can be used within any python orchestration system in the following ways:

	1. _Hamilton DAGs can be called within orchestration system tasks._
	See the [Hamilton + Airflow example](https://blog.dagworks.io/p/supercharge-your-airflow-dag-with). The integration is generally trivial -- all you have to do
	is call out to the hamilton library within your task. If your orchestrator supports python, then you're good to go. Some pseudocode (if your orchestrator handles scripts like airflow):

	```python
	#my_task.py
	import hamilton
	import my_transformations
	dr = hamilton.driver.Driver({}, my_functions)
	output = dr.execute(['final_var'], inputs=...)
	do_something_with(output)
	```
	2. _Hamilton DAGs can be broken up to run as components within an orchestration system._
	With the ability to include [overrides](../concepts/driver.rst),
	you can run the DAG on each task, overloading the outputs of the last task + any static inputs/configuration, and pass it into the next task. This is more
	of a manual/power-user feature. Some pseudocode:

	```python
	#my_task.py
	import hamilton
	import my_functions
	prior_inputs = load_relevant_task_results()
	desired_outputs = ['final_var_1', 'final_var_2']
	inputs = my_inputs
	dr = hamilton.driver.Driver({}, my_functions)
	output = dr.execute(
	desired_outputs,
	inputs=inputs,
	overrides=prior_inputs)
	save_for_later(output)
	```

	### Feature Stores

	Examples include:
	- [Hopsworks](https://www.hopsworks.ai/)
	- [Feast](https://feast.dev/)
	- [Tecton](https://tecton.ai/)

	One can think of Hamilton as a being your "feature definition store", where "store" is code + git. While it does
	not provide all the capabilities of a standard feature store, it provides a source of truth for the code that
	generated the features, and can be run in a portable method. So, if your desire is just to be able to run the same
	code in different environments, and have an online/offline store of features, you can use hamilton both to save the
	features offline, and generate features online on the fly.

	See the [feature engineering example](../how-tos/use-for-feature-engineering.rst) for more possibilities, as
	well as [blogs on the feature topic](https://blog.dagworks.io/?sort=search&search=features).

	Note that in small cases, you probably don't need a true feature store -- recomputing derived features in an ETL
	and online can be very efficient, as long as you have some database to look values up (or have them passed in).

	Also note that joins and aggregations can get tricky. We often recommend using our "polymorphic function
	definition" i.e. functions decorated with `@config.when`, to either load up the non-online-friendly features
	from a feature store or do an external lookup to simulate an online join.

	We expect Hamilton to play a prominent role in the way feature stores work in the future.

	### Data Science Ecosystems/ML platforms
	Examples include:
	- [Kedro](https://kedro.org/)
	- [Domino Data Labs](https://www.dominodatalab.com/)
	- [Dataiku](https://www.dataiku.com/)
	- [SageMaker](https://aws.amazon.com/sagemaker/)
	- [Google Cloud Vertex AI Platform](https://cloud.google.com/vertex-ai)
	- etc.

	We've kind of grouped a whole suite of platforms into the same bucket here. These
	tend to have a lot of capabilities all related to ML. Hamilton can be used in conjunction with these
	platforms in a variety of ways. For example, you can use Hamilton to generate features for a model
	that you train in one of these platforms. Or you can use Hamilton to generate a model using the
	platform's compute, and then save the model to the platform's registry.

	### Registries / Experiment Tracking
	Examples include:
	- [MLflow](https://mlflow.org/)
	- [Weights and Biases](https://wandb.ai/site)
	- [DVC](https://dvc.org/)

	Most pipelines have a "reverse ETL problem" -- they need to get the results of the pipeline into a some
	sort of datastore or registry. Hamilton can be used in conjunction with these tools as the glue code
	that helps everything work together. For example, you can use Hamilton to generate a model
	and then store metrics computed by Hamilton to one of these "destinations".

	There are three main ways to integrate with these tools:
	- inside a function that Hamilton orchestrates
	- outside Hamilton (e.g. in a script that calls Hamilton)
	- using "materializers" (see [materializers](../reference/io/index.rst)) (see [this blog](https://blog.dagworks.io/p/separate-data-io-from-transformation)).

	See this [ML reference post](https://blog.dagworks.io/p/from-dev-to-prod-a-ml-pipeline-reference) for examples of how to use Hamilton with these tools.

	### Python Dataframe/manipulation Libraries
	Examples include:
	- [pandas](https://pandas.pydata.org/)
	- [dask](https://www.dask.org/)
	- [modin](https://github.com/modin-project/modin)
	- [polars](https://www.pola.rs/)
	- [duckdb](https://duckdb.org/)

	Hamilton works with any python dataframe/manipulation oriented libraries.
	See our [examples folder](https://github.com/dagworks-inc/hamilton/tree/main/examples)
	to see how to use Hamilton with these libraries.


	### Python "big data" systems
	The following systems are ones that you would resort to using when wanting to scale up your data processing.

	Examples include:
	- [dask](https://www.dask.org/)
	- [ray](https://ray.io/)
	- [pyspark](https://spark.apache.org/docs/latest/api/python/)
	- [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html)

	These all provide capabilities to either (a) express and execute computation over datasets in python or (b)
	parallelize it. Often both. Hamilton has a variety of integrations with these systems. The basics is that Hamilton
	can make use of these systems to execute the DAG using the [GraphAdapter](../reference/graph-adapters/index.rst) abstraction and [Lifecycle Hooks](../reference/lifecycle-hooks/index.rst).

	See our [examples folder](https://github.com/dagworks-inc/hamilton/tree/main/examples)
	to see how to use Hamilton with these systems.