examples/dbt/README.md - hamilton - Git at Google

 # Hamilton and DBT

 In this example, we're going to show you how easy it is to run Hamilton inside a dbt task. Making use of DBT's exciting new
 [python API](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models), we can blend the two frameworks seamlessly.

 While the two frameworks might look similar at first glance, DBT and Hamilton are actually quite complementary.

 - DBT is best at managing SQL logic and handling materialization, while Hamilton excels at modeling transforms in python
 - DBT contains its own orchestration capabilities, whereas Hamilton often relies on an external framework to run the code
 - DBT does not model micro-level transformations, whereas Hamilton thrives at enabling a user to specify them in a readable, maintainable way.
 - DBT is focused on analytic/warehouse-level transformations, whereas Hamilton can thrive at expressing ML-specific transforms.

 At a high-level, DBT can help you get the data/run large-scale operations in your warehouse,
 while Hamilton can help you make a model out of it.

 To demonstrate this, we've taken one of our favorite examples of writing data science code [xLaszlo's code quality for DS tutorial](https://github.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise),
 and re-written it using a combination of DBT + Hamilton. This models the classic titanic problem.

 In this case we're using [FAL](https://github.com/fal-ai/fal) to help run python in dbt -- it enables us to manage environments,
 import packages happily, etc...

 While the initial example is very simple, it should be enough for you to get started on your own!
 # Running

 To run the example, you'll need to do two things:

 1. Install the dependencies
 ```bash
 # Using pypi
 $ cd examples/dbt
 $ pip install -r requirements.txt
 ```
 2. Execute!
 ```bash
 # Currently this has to be run from within the directory
 $ dbt run
 00:53:20  Running with dbt=1.3.1
 00:53:20  Found 2 models, 0 tests, 0 snapshots, 0 analyses, 292 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
 00:53:20
 00:53:20  Concurrency: 1 threads (target='dev')
 00:53:20
 00:53:20  1 of 2 START sql table model main.raw_passengers ............................... [RUN]
 00:53:20  1 of 2 OK created sql table model main.raw_passengers .......................... [OK in 0.06s]
 00:53:20  2 of 2 START python table model main.predict ................................... [RUN]
 00:53:21  2 of 2 OK created python table model main.predict .............................. [OK in 0.73s]
 00:53:21
 00:53:21  Finished running 2 table models in 0 hours 0 minutes and 0.84 seconds (0.84s).
 00:53:21
 00:53:21  Completed successfully
 00:53:21
 00:53:21  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
 ```

 This will modify a [duckdb file](data/database.duckdb). You can inspect the results using python or your favorite duckdb interface.

 # Details

 We've organized the code into two separate DBT models:
 1. [raw_passengers](models/raw_passengers.sql) This is a simple select and join using duckdb and DBT. Due to the simplicity of DBT -- its just as you would write if it were embedded within a python program, or you were executing SQL on your own!
    It does, however, automatically get materialized.
 2. [train_and_infer](models/train_and_infer.py)
     This uses the data outputted by (1) to do quite a few things:

    - feature engineering to extract a test/train set
    - train a model using the train set
    - run inference over the entire data set

     It outputs the inference set. Note it only runs a subset of the DAG -- we could easily add more tasks that output metrics, etc... We just wanted to keep it simple.
     DBT in python is still in beta, and we'll be opening issues/contributing to get it more advanced! We're especially excited about FAL as it helps solve some of the
     uglier python problems we hit along the way.

 ## Visualizing Execution
 Here is the DAG generated by Hamilton for the above example:

 ![titanic_dbt](titanic_dbt.png)

 # Future Directions

 This is just a start, and we think that Hamilton + DBT have a long/exciting future together. In particular, we could:

 1. Compile Hamilton to DBT for orchestration -- the new [SQL adapter](https://github.com/dagworks-inc/hamilton/issues/197) we're working on would compile nicely to a dbt task.
 2. Add more natural integration -- including a dbt plugin for a hamilton task.
 3. Add more examples with different SQL dialects/different python dialects. _hint_: _we're looking for contributors..._

 If you're excited by any of this, drop on by! Some resources to get you help:
 - [Hamilton slack channel](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1ko5snvxt-7KFPTMJyZTw1T7_Gpxryvw)
 - [DBT support](https://docs.getdbt.com/docs/dbt-support)
 - [xLaszlo's CQ4DS discord](https://discord.gg/8uUZNMCad2)
	# Hamilton and DBT

	In this example, we're going to show you how easy it is to run Hamilton inside a dbt task. Making use of DBT's exciting new
	[python API](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models), we can blend the two frameworks seamlessly.

	While the two frameworks might look similar at first glance, DBT and Hamilton are actually quite complementary.

	- DBT is best at managing SQL logic and handling materialization, while Hamilton excels at modeling transforms in python
	- DBT contains its own orchestration capabilities, whereas Hamilton often relies on an external framework to run the code
	- DBT does not model micro-level transformations, whereas Hamilton thrives at enabling a user to specify them in a readable, maintainable way.
	- DBT is focused on analytic/warehouse-level transformations, whereas Hamilton can thrive at expressing ML-specific transforms.

	At a high-level, DBT can help you get the data/run large-scale operations in your warehouse,
	while Hamilton can help you make a model out of it.

	To demonstrate this, we've taken one of our favorite examples of writing data science code [xLaszlo's code quality for DS tutorial](https://github.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise),
	and re-written it using a combination of DBT + Hamilton. This models the classic titanic problem.

	In this case we're using [FAL](https://github.com/fal-ai/fal) to help run python in dbt -- it enables us to manage environments,
	import packages happily, etc...

	While the initial example is very simple, it should be enough for you to get started on your own!
	# Running

	To run the example, you'll need to do two things:

	1. Install the dependencies
	```bash
	# Using pypi
	$ cd examples/dbt
	$ pip install -r requirements.txt
	```
	2. Execute!
	```bash
	# Currently this has to be run from within the directory
	$ dbt run
	00:53:20 Running with dbt=1.3.1
	00:53:20 Found 2 models, 0 tests, 0 snapshots, 0 analyses, 292 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
	00:53:20
	00:53:20 Concurrency: 1 threads (target='dev')
	00:53:20
	00:53:20 1 of 2 START sql table model main.raw_passengers ............................... [RUN]
	00:53:20 1 of 2 OK created sql table model main.raw_passengers .......................... [OK in 0.06s]
	00:53:20 2 of 2 START python table model main.predict ................................... [RUN]
	00:53:21 2 of 2 OK created python table model main.predict .............................. [OK in 0.73s]
	00:53:21
	00:53:21 Finished running 2 table models in 0 hours 0 minutes and 0.84 seconds (0.84s).
	00:53:21
	00:53:21 Completed successfully
	00:53:21
	00:53:21 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
	```

	This will modify a [duckdb file](data/database.duckdb). You can inspect the results using python or your favorite duckdb interface.

	# Details

	We've organized the code into two separate DBT models:
	1. [raw_passengers](models/raw_passengers.sql) This is a simple select and join using duckdb and DBT. Due to the simplicity of DBT -- its just as you would write if it were embedded within a python program, or you were executing SQL on your own!
	It does, however, automatically get materialized.
	2. [train_and_infer](models/train_and_infer.py)
	This uses the data outputted by (1) to do quite a few things:

	- feature engineering to extract a test/train set
	- train a model using the train set
	- run inference over the entire data set

	It outputs the inference set. Note it only runs a subset of the DAG -- we could easily add more tasks that output metrics, etc... We just wanted to keep it simple.
	DBT in python is still in beta, and we'll be opening issues/contributing to get it more advanced! We're especially excited about FAL as it helps solve some of the
	uglier python problems we hit along the way.

	## Visualizing Execution
	Here is the DAG generated by Hamilton for the above example:

	![titanic_dbt](titanic_dbt.png)

	# Future Directions

	This is just a start, and we think that Hamilton + DBT have a long/exciting future together. In particular, we could:

	1. Compile Hamilton to DBT for orchestration -- the new [SQL adapter](https://github.com/dagworks-inc/hamilton/issues/197) we're working on would compile nicely to a dbt task.
	2. Add more natural integration -- including a dbt plugin for a hamilton task.
	3. Add more examples with different SQL dialects/different python dialects. _hint_: _we're looking for contributors..._

	If you're excited by any of this, drop on by! Some resources to get you help:
	- [Hamilton slack channel](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1ko5snvxt-7KFPTMJyZTw1T7_Gpxryvw)
	- [DBT support](https://docs.getdbt.com/docs/dbt-support)
	- [xLaszlo's CQ4DS discord](https://discord.gg/8uUZNMCad2)