| ========= |
| Dagster |
| ========= |
| |
| Here are some code snippets to compare the macro orchestrator Dagster to the micro orchestrator Hamilton. Hamilton can run inside Dagster, but you wouldn't run Dagster inside Hamilton. |
| |
| While the two have different scope, there's a lot of overlap between the two both in terms of functionality and API. Indeed, Dagster's software-defined assets introduced in 2022 matches Hamilton's declarative approach and should feel familiar to users of either. |
| |
| |
| ------ |
| TL;DR |
| ------ |
| |
| .. list-table:: |
| :widths: 24 39 39 |
| :header-rows: 1 |
| |
| * - Trait |
| - Hamilton |
| - Dagster |
| * - Declarative API |
| - ✅ |
| - ✅ |
| * - Dependencies |
| - Lightweight library with minimal dependencies (``numpy``, ``pandas``, ``typing_inspect``). Minimizes dependency conflicts. |
| - Heavier framework/system with several dependencies (``pydantic``, ``sqlalchemy``, ``requests``, ``Jinja2``, ``protobuf``). ``urllib3`` on which depends ``requests`` introduced breaking changes several times and ``pydantic`` v1 and v2 are incompatible. |
| * - Macro orchestration |
| - DIY or in tandem with Dagster, Airflow, Prefect, Metaflow, etc. |
| - Includes: manual, schedules, sensors, conditional execution |
| * - Micro orchestration (i.e., ``dbt``, ``LangChain``) |
| - Can run anywhere (locally, notebook, macro orchestrator, `FastAPI <https://hamilton.dagworks.io/en/latest/integrations/fastapi/>`_, `Streamlit <https://hamilton.dagworks.io/en/latest/integrations/streamlit/>`_, pyodide, etc.) |
| - ❌ |
| * - Code structure |
| - Since it's micro, there are no restrictions. |
| - Since it's macro, a certain code structure is required to properly package code. The prevalent use of relative imports in the tutorial reduces code reusability. |
| * - LLM applications |
| - Well-suited for `LLM applications <https://blog.dagworks.io/p/retrieval-augmented-generation-reference-arch>`_ since it's a micro orchestration framework. |
| - ❌ |
| * - Lineage |
| - Fine-grained / column-level lineage. Includes `utilities to explore lineage <https://hamilton.dagworks.io/en/latest/how-tos/use-hamilton-for-lineage/>`_. |
| - Coarser operations to reduce orchestration and I/O overhead. |
| * - Visualization |
| - `View the dataflow and produce visual artifacts <https://hamilton.dagworks.io/en/latest/concepts/visualization/>`_. Configurable and supports extensive custom styling. |
| - Export Daster UI in ``.svg``. No styling. |
| * - Run tracking |
| - `DAGWorks <https://docs.dagworks.io/capabilities>`_ (premium) |
| - Dagster UI |
| * - Experiment Managers |
| - Has an `experiment manager plugin <https://blog.dagworks.io/p/building-a-lightweight-experiment>`_ |
| - ❌ |
| * - Materializers |
| - `Data Savers & Loaders <https://hamilton.dagworks.io/en/latest/concepts/materialization/>`_ |
| - `IO Managers <https://docs.dagster.io/_apidocs/io-managers>`_ |
| * - Data validation |
| - `Native validators and pandera plugin <https://hamilton.dagworks.io/en/latest/how-tos/run-data-quality-checks/>`_ |
| - `Asset checks <https://docs.dagster.io/_apidocs/asset-checks>`_ (experimental), `pandera integration <https://docs.dagster.io/integrations/pandera>`_ |
| * - Versioning operations |
| - Nodes and dataflow versions are derived from code. |
| - `Asset code version <https://docs.dagster.io/concepts/assets/software-defined-assets#asset-code-versions>`_ is specified manually. |
| * - Versioning data |
| - Automated code version + data value are used to read from cache or compute new results with `DiskCacheAdapter <https://docs.dagster.io/concepts/assets/software-defined-assets#asset-code-versions>`_ |
| - Manual asset code version + upstream changes are used to `trigger re-materialization <https://docs.dagster.io/concepts/assets/software-defined-assets#asset-code-versions>`_ |
| * - In-memory Execution |
| - Default |
| - `Materialize in-memory <https://docs.dagster.io/_apidocs/io-managers>`_ |
| * - Task-based Execution |
| - `TaskBasedExecutor <https://hamilton.dagworks.io/en/latest/reference/drivers/Driver/#taskbasedgraphexecutor>`_ |
| - Default |
| * - Dynamic branching |
| - `Parallelizable/Collect <https://hamilton.dagworks.io/en/latest/concepts/parallel-task/>`_ |
| - `Mapping/Collect <https://docs.dagster.io/_apidocs/dynamic>`_ |
| * - Hooks |
| - `Lifecycle hooks <https://hamilton.dagworks.io/en/latest/reference/lifecycle-hooks/>`_ (easier to extend) |
| - `Op Hooks <https://docs.dagster.io/concepts/ops-jobs-graphs/op-hooks#op-hooks>`_ |
| * - Plugins |
| - `Spark <https://blog.dagworks.io/p/expressing-pyspark-transformations>`_, Dask, Ray, `Datadog <https://hamilton.dagworks.io/en/latest/reference/lifecycle-hooks/DDOGTracer/>`_, polars, pandera, and more (Hamilton is less restrictive and easier to extend) |
| - `Spark, Dask, polars, pandera, Databricks, Snowflake, Great Expections, and more <https://docs.dagster.io/integrations>`_ (Dagster integrations are more involved to develop) |
| * - Interactive Development |
| - `Jupyter Magic <https://hamilton.dagworks.io/en/latest/how-tos/use-in-jupyter-notebook/#use-hamilton-jupyter-magic>`_, `VSCode extension <https://marketplace.visualstudio.com/items?itemName=ThierryJean.hamilton>`_ |
| - ❌ |
| |
| |
| ---------------------- |
| Dataflow definition |
| ---------------------- |
| |
| .. table:: HackerNews top stories |
| :align: left |
| |
| +------------------------------------------------------------+----------------------------------------------------------+ |
| | Hamilton | Dagster | |
| +============================================================+==========================================================+ |
| | .. literalinclude:: _dagster_snippets/hamilton_dataflow.py | .. literalinclude:: _dagster_snippets/dagster_dataflow.py| |
| | | | |
| +------------------------------------------------------------+----------------------------------------------------------+ |
| | .. image:: _dagster_snippets/hamilton_dataflow.png | .. image:: _dagster_snippets/dagster_dataflow.png | |
| | | | |
| +------------------------------------------------------------+----------------------------------------------------------+ |
| |
| .. list-table:: Key points |
| :widths: 24 39 39 |
| :header-rows: 1 |
| |
| * - Trait |
| - Hamilton |
| - Dagster |
| * - Define operations |
| - Uses the native Python function signature. The dataflow is assembled based on function/parameter names and type annotations. |
| - Uses the ``@asset`` decorator to transform function in operations and specify dependencies by passing functions. |
| * - Data I/O |
| - Loading/Saving is decoupled from the dataflow definition. The code becomes `more portable and facilitates moving from dev to prod <https://blog.dagworks.io/p/separate-data-io-from-transformation>`_. |
| - Each asset code operations is coupled with I/O. Hard-coding this behavior reduces maintainability. |
| * - Lineage |
| - Favors granular operations and fine-grained lineage. For example, ``most_frequent_words()`` operates on a single column and the ``top_25_words_plot`` is its own function. |
| - Favors chunking dataflow into meaningful assets to reduce the orchestration and I/O overhead per operation. Finer lineage is complex to achieve and requires using ``@op``, ``@graph``, ``@job``, and ``@asset`` (`ref <https://docs.dagster.io/guides/dagster/how-assets-relate-to-ops-and-graphs>`_) |
| * - Documentation |
| - Uses the native Python docstrings. Further metadata can be added using the ``@tag`` decorator. |
| - Uses ``MaterializeResult`` to store metadata. |
| |
| ---------------------- |
| Dataflow execution |
| ---------------------- |
| |
| .. table:: HackerNews top stories |
| :align: left |
| |
| +-------------------------------------------------------------+------------------------------------------------------------+ |
| | Hamilton | Dagster | |
| +=============================================================+============================================================+ |
| | .. literalinclude:: _dagster_snippets/hamilton_execution.py | .. literalinclude:: _dagster_snippets/dagster_execution.py | |
| | | | |
| +-------------------------------------------------------------+------------------------------------------------------------+ |
| |
| .. list-table:: Key points |
| :widths: 24 39 39 |
| :header-rows: 1 |
| |
| * - Trait |
| - Hamilton |
| - Dagster |
| * - Execution instructions |
| - Define a ``Driver`` using the ``Builder`` object. It automatically assembles the graph from the dataflow definition found in ``dataflow.py`` |
| - Load assets from Python modules using ``load_assets_from_modules`` then create an asset job by selecting assets to include. Finally, create a ``Definitions`` object to register on the orchestrator. |
| * - Execution plane |
| - ``Driver.materialize()`` executes the dataflow in a Python process. Can be called as a script, `using the CLI <https://blog.dagworks.io/p/a-command-line-tool-to-improve-your>`_, or programmatically. |
| - The `asset job is executed by the orchestrator <https://docs.dagster.io/concepts/assets/asset-jobs>`_, either through Dagster UI, by a scheduler/sensor/trigger, or via the CLI. |
| * - Data I/O |
| - I/O is decoupled from dataflow definition. People responsible for deployment can manage data sources without refactoring the dataflow. (Data I/O can be coupled if wanted.) |
| - Data I/O is coupled with data assets which simplifies the execution code at the code of reusability. |
| * - Framework code |
| - Leverages a maximum of standard Python mechanisms (imports, env variables, etc.). |
| - Most constructs requires Dagster-specific code to leverage protobuf serialization. |
| |
| |
| ---------------- |
| More information |
| ---------------- |
| |
| For a full side-by-side example of Dagster and Hamilton, visit `this GitHub repository <https://github.com/dagworks-inc/hamilton/tree/main/examples/dagster>`_ |
| |
| For more questions, join our `Slack Channel <https://join.slack.com/t/hamilton-opensource/shared_invite/zt-2niepkra8-DGKGf_tTYhXuJWBTXtIs4g>`_! |
| |