blob: 2a26892c465aacb7fa3d425c83c37988f2cf04f9 [file] [log] [blame]
===============
Materialization
===============
So far, we executed our dataflow using the ``Driver.execute()`` method, which can receive an ``inputs`` dictionary and return a ``results`` dictionary (by default). However, you can also execute code with ``Driver.materialize()`` to directly read from / write to external data sources (file, database, cloud data store).
On this page, you'll learn:
- The difference between ``.execute()`` and ``.materialize()``
- Why use materialization
- What are DataSaver and DataLoader objects
- The basics to write your own materializer
Different ways to write the same dataflow
-----------------------------------------
Below are 3 ways to write a dataflow that:
1. loads a dataframe from a parquet file
2. preprocesses the dataframe
3. trains a machine learning model
4. saves the trained model
The first two options use ``Driver.execute()`` and the latter ``Driver.materialize()``. Notice where in the code data is loaded and saved and how it affects the dataflow.
.. table:: Model training
:align: left
+----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+
| Nodes / dataflow context | Driver context | Materialization |
+==============================================+===============================================+========================================================+
| .. literalinclude:: _snippets/node_ctx.py | .. literalinclude:: _snippets/driver_ctx.py | .. literalinclude:: _snippets/materializer_ctx.py |
| | | |
+----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+
| .. image:: _snippets/node_ctx.png | .. image:: _snippets/driver_ctx.png | .. image:: _snippets/materializer_ctx.png |
| :width: 500px | :width: 500px | :width: 500px |
+----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+
As explained previously, ``Driver.execute()`` walks the graph to compute the list of nodes you requested by name. For ``Driver.materialize()``, you give it a list of data savers (``from_``) and data loaders (``to``). Each one will add a node to the dataflow before execution.
.. note::
``Driver.materialize()`` can do everything ``Driver.execute()`` does, and more. It can receive ``inputs`` and ``overrides``. Instead of using ``final_vars``, you can use ``additional_vars`` to request nodes that you don't want to materialize/save.
Why use materialization
-----------------------
Let's compare the benefits of the 3 different approaches
Nodes / dataflow context
~~~~~~~~~~~~~~~~~~~~~~~~
This approach defines data loading and saving as part of the dataflow and uses ``Driver.execute()``. It is usually the simplest approach and the one you should start with.
Benefits
- the functions ``raw_df()`` and ``save_model()`` are transparent as to how they load/save data
- can easily change data location using the strings ``data_path`` and ``model_dir`` as inputs
- all operations are part of the dataflow
Limitations
- need to write a unique function for each loaded parquet file and saved model. To reduce code duplication, one could write a utility function ``_load_parquet()``
- can be too restrictive as to how to load data. Using ``override`` in the ``.execute()`` call can add flexibility.
Driver context
~~~~~~~~~~~~~~
This approach loads and saves data outside the dataflow and uses ``Driver.execute()``. Since the Driver is responsible for executing your dataflow, it makes sense to handle data loading/saving in the context of the "driver code" (e.g., ``run.py``) if they change often.
Benefits
- Driver users is responsible for loading/saving data
- fewer dataflow functions to define and maintain
- the functions for ``raw_df()`` and ``save_model()`` can live in another Python module that you can optionally build the Driver with.
Limitations
- add complexity to the "driver code".
- lose the benefits of Hamilton for loading and saving operations (visualize, lifecycle hook, etc.)
- to add flexibility to data loading/saving, one can adopt the **nodes/dataflow context** approach and add functions with ``@config`` for alternative implementations (see :ref:`config-decorators`).
Materialization
~~~~~~~~~~~~~~~
This approach tries to strike a balance between the two previous methods and uses ``Driver.materialize()``.
Unique benefits
- Use the Hamilton logic to combine nodes (more on that later)
- Get tested code for common data loading and saving out-of-the-box (e.g., JSON, CSV, Parquet, pickle)
- Easily save the same node to multiple formats
Benefits
- Flexibility for Driver users to change data location
- Less dataflow functions to define and maintain
- All operations are part of the dataflow
Limitations
- Writing a custom DataSaver or DataLoader requires more effort than adding a function to the dataflow.
- Adds *some* complexity to the Driver (e.g., ``run.py``).
DataLoader and DataSaver
------------------------
In Hamilton, ``DataLoader`` and ``DataSaver`` are classes that define how to load or save a particular data format. Calling ``Driver.materialize(DataLoader(), DataSaver())`` adds nodes to the dataflow (see visualizations above).
Here are simplified snippets for saving and loading an XGBoost model to/from JSON.
+----------------------------------------------+-----------------------------------------------+
| DataLoader | DataSaver |
+==============================================+===============================================+
| .. literalinclude:: _snippets/data_loader.py | .. literalinclude:: _snippets/data_saver.py |
| | |
+----------------------------------------------+-----------------------------------------------+
To define your own DataSaver and DataLoader, the Hamilton `XGBoost extension <https://github.com/DAGWorks-Inc/hamilton/blob/main/hamilton/plugins/xgboost_extensions.py>`_ provides a good example
``@load_from`` and ``@save_to``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Also, the data loaders and savers power the ``@load_from`` and ``@save_to`` :ref:`loader-saver-decorators`