tree: 382ccb10bc152ea54771e94cf6f9af805421676d [path history] [tgz]
  1. test_data/
  2. csv_execution_graph.png
  3. duckdb_execution_graph.png
  4. load_data_csv.py
  5. load_data_duckdb.py
  6. load_data_mock.py
  7. mock_execution_graph.png
  8. prep_data.py
  9. README.md
  10. requirements.txt
  11. run.py
examples/data_loaders/README.md

Data Loaders

Among multiple uses, Hamilton excels at building maintainable, scalable representations of ETLs. If you‘ve read through the other guides, it should be pretty clear how hamilton enables transformations (the T in ETL/ELT). In this example, we’ll talk about an approach to Extracting data, and how Hamilton enables you to build out extracts, in a scalable, pluggable way. For example, being able to switch where data is loaded between development and production is useful, since you might only want a subsample in development, or even load it from a different source. Here we'll show you how you can achieve this without cluttering your code with if else, which will make your dataflow easier to maintain in the long run.

The goal is to show you two things:

  1. How to load data from various sources
  2. How to switch the sources of data you're loading from by swapping out modules, using polymorphism

As such, we have three data loaders to use:

  1. load_data_mock.py: generates mock data on the fly. Meant to represent a unit-testing/quick iteration scenario.
  2. load_data_csv.py: Uses CSV data. Meant to represent more ad-hoc research.
  3. load_data_duckdb.py Uses a duckdb database (saved locally). Meant to represent more production-ready dataflows, as well as demonstrate the ease of working with duckdb.

But this is hardly exclusive, or exhaustive. One can easily imagine loading from snowflake, your custom datawarehouse, hdfs, etc... All by swapping out the data loaders.

The data

The data comes pregenerated in (test_data)[test_data], in both .csv and .duckdb format.

Loading/Analyzing the data

To load/analyze the data, you can run the script run.py

  • python run.py csv reads from the .csv files and runs all the variables
  • python run.py duckdb reads from the duckdb database and runs all the variables
  • python run.py mock creates mock data and runs the pipeline

Note that you, as the user, have to manually handle connections/whatnot for duckdb. We are currently designing the ability to do this natively in hamilton: https://github.com/dagworks-inc/hamilton/issues/197.

Execution Graphs

You'll see that the execution graphs are practically the same, except for the top part. Which is exactly what we want, because the data_loader modules defines the same set of functions, but have different requirements for getting the data!

CSV

csv

DuckDB

duckdb

Mock

mock