tree: 2f035bc54d8cde02838d9361496e605b0478f6e5 [path history] [tgz]
  1. dags/
  2. docs/
  3. plugins/
  4. .env
  5. .gitignore
  6. docker-compose.yaml
  7. Dockerfile
  8. README.md
  9. requirements.txt
examples/airflow/README.md

Hamilton and Airflow

In this example, we're going to show how to run Hamilton within a Airflow task. Both tools are used to author DAGs with Python code, but they operate on different levels:

  • Airflow is an orchestrator written in Python. Its purpose is to launch tasks (which can be of any kind; Python, SQL, Bash, etc.) and make sure they complete.
  • Hamilton is a data transformation framework. It helps developer write Python code that is modular and reusable, and that can be executed as a DAG.

Therefore, Hamilton allows you to write modular and easier to maintain code that can be used within or outside Airflow. This results in simpler Airflow DAG that will improve maintainability and might improve performance.

File organization

  • /dags/hamilton/hamilton_how_to_dag.py shows the basics of Hamilton and illustrates how to integrate with Airflow.
  • /dags/hamilton/absenteeism_prediction_dag.py shows a concrete example of loading data, training a machine learning model, and evaluating it.
  • Each example is powered by Python modules under /plugins/function_modules and /plugins/absenteeism respectively (details about this in the How-to DAG). Under /docs, you'll find the DAG visualization of these Hamilton modules.
  • For the purpose of this example repository, data will be read from and written to /plugins/data since it is easily accessible to Airflow. This shouldn't be used in production settings.

Airflow setup

We will use custom Docker containers based on the official Airflow docker-compose.yaml.

  1. git clone the Hamilton repository
  2. From the terminal, go to the airflow example directory hamilton/examples/airflow/
  3. Create a .env file with your Airflow UID using: echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env
  4. Create the Docker containers using: docker compose up --build This can take a few minutes. Afterwards, your Airflow Docker containers should be running.
  5. Connect via http://localhost:8080. The default username is airflow and the password is airflow

Motivation

  1. Simplify Airflow DAG. Respect the ethos of Airflow reference:

    Make your DAG generate simpler structure. Every task dependency adds additional processing overhead for scheduling and execution. The DAG that has simple linear structure A -> B -> C will experience less delays in task scheduling than DAG that has a deeply nested tree structure with exponentially growing number of depending tasks for example. If you can make your DAGs more linear - where at single point in execution there are as few potential candidates to run among the tasks, this will likely improve overall scheduling performance.

  2. Separate infrastructure from data science concerns. Airflow is responsible for orchestrating tasks and is typically owned handled by data engineers. Hamilton is responsible for authoring clean, reusable, and maintainable data transformations and is typically owned by data scientists. By integrating both tools, these two personas/teams have clearer ownership and version control over their codebases, and it can promote reusability of both Airflow and Hamilton code. Notably, one can replace dynamic Airflow DAG (generated via config) by static Airflow DAG and let Hamilton handle the dynamic data transformation requirements. This better separate the consistent Airflow infrastructure pipeline from the project specific Hamilton data transforms. On the opposite, Hamilton data transforms can be reused in Airflow pipelines to move data to power different initiatives (dashboard, app, API, etc.) greatly improving consistency.

  3. Write efficient Python code efficiently. With Python being the 3rd most popular programming language in 2023, most data professionals should be able to pick up Airflow and Hamilton quickly. However, production systems are composed of multiple services (database, datawarehouse, compute cluster, cache, serverless functions, etc.) most of which are SQL-based or have a Python SDK. This puts strains on engineers that need to learn and maintain this sprawling codebase. For orchestration, Airflow providers standardize interactions between the airflow DAG and external systems using the Python language (see providers). For data transformation, Hamilton graph adapters can automatically convert pandas code to production-grade computation engines such as Ray, Dask, Spark, Pandas on Spark (Koalas), Async Python (see graph adapters (experimental)) These graph adapters automatically provide benefits such as result caching, GPU computing or out-of-core computation.

Notes and limitations

  • The edits to the default Airflow docker-compose.yaml file include:
    • building a custom Airflow image instead of pulling from Docker
    • Adding a graphviz backend to the Airflow image (see Dockerfile)
    • Installing Python packages in the Airflow image (see Dockerfile)