tree: a818511f077f23049b02a5af4f941d847222a5dd [path history] [tgz]
  1. data/
  2. models/
  3. python_transforms/
  4. dbt_project.yml
  5. profiles.yml
  6. README.md
  7. requirements.txt
  8. titanic_dbt.png
examples/dbt/README.md

Hamilton and DBT

In this example, we‘re going to show you how easy it is to run Hamilton inside a dbt task. Making use of DBT’s exciting new python API, we can blend the two frameworks seamlessly.

While the two frameworks might look similar at first glance, DBT and Hamilton are actually quite complementary.

  • DBT is best at managing SQL logic and handling materialization, while Hamilton excels at modeling transforms in python
  • DBT contains its own orchestration capabilities, whereas Hamilton often relies on an external framework to run the code
  • DBT does not model micro-level transformations, whereas Hamilton thrives at enabling a user to specify them in a readable, maintainable way.
  • DBT is focused on analytic/warehouse-level transformations, whereas Hamilton can thrive at expressing ML-specific transforms.

At a high-level, DBT can help you get the data/run large-scale operations in your warehouse, while Hamilton can help you make a model out of it.

To demonstrate this, we've taken one of our favorite examples of writing data science code xLaszlo's code quality for DS tutorial, and re-written it using a combination of DBT + Hamilton. This models the classic titanic problem.

In this case we're using FAL to help run python in dbt -- it enables us to manage environments, import packages happily, etc...

While the initial example is very simple, it should be enough for you to get started on your own!

Running

To run the example, you'll need to do two things:

  1. Install the dependencies
# Using pypi
$ cd examples/dbt
$ pip install -r requirements.txt
  1. Execute!
# Currently this has to be run from within the directory
$ dbt run
00:53:20  Running with dbt=1.3.1
00:53:20  Found 2 models, 0 tests, 0 snapshots, 0 analyses, 292 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:53:20
00:53:20  Concurrency: 1 threads (target='dev')
00:53:20
00:53:20  1 of 2 START sql table model main.raw_passengers ............................... [RUN]
00:53:20  1 of 2 OK created sql table model main.raw_passengers .......................... [OK in 0.06s]
00:53:20  2 of 2 START python table model main.predict ................................... [RUN]
00:53:21  2 of 2 OK created python table model main.predict .............................. [OK in 0.73s]
00:53:21
00:53:21  Finished running 2 table models in 0 hours 0 minutes and 0.84 seconds (0.84s).
00:53:21
00:53:21  Completed successfully
00:53:21
00:53:21  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

This will modify a duckdb file. You can inspect the results using python or your favorite duckdb interface.

Details

We've organized the code into two separate DBT models:

  1. raw_passengers This is a simple select and join using duckdb and DBT. Due to the simplicity of DBT -- its just as you would write if it were embedded within a python program, or you were executing SQL on your own! It does, however, automatically get materialized.

  2. train_and_infer This uses the data outputted by (1) to do quite a few things:

    • feature engineering to extract a test/train set
    • train a model using the train set
    • run inference over the entire data set

    It outputs the inference set. Note it only runs a subset of the DAG -- we could easily add more tasks that output metrics, etc... We just wanted to keep it simple. DBT in python is still in beta, and we‘ll be opening issues/contributing to get it more advanced! We’re especially excited about FAL as it helps solve some of the uglier python problems we hit along the way.

Visualizing Execution

Here is the DAG generated by Hamilton for the above example:

titanic_dbt

Future Directions

This is just a start, and we think that Hamilton + DBT have a long/exciting future together. In particular, we could:

  1. Compile Hamilton to DBT for orchestration -- the new SQL adapter we're working on would compile nicely to a dbt task.
  2. Add more natural integration -- including a dbt plugin for a hamilton task.
  3. Add more examples with different SQL dialects/different python dialects. hint: we're looking for contributors...

If you're excited by any of this, drop on by! Some resources to get you help: