docs/concepts/materialization.rst - hamilton - Git at Google

 ===============
 Materialization
 ===============

 So far, we executed our dataflow using the ``Driver.execute()`` method, which can receive an ``inputs`` dictionary and return a ``results`` dictionary (by default). However, you can also execute code with ``Driver.materialize()`` to directly read from / write to external data sources (file, database, cloud data store).

 On this page, you'll learn:

 - The difference between ``.execute()`` and ``.materialize()``
 - Why use materialization
 - What are DataSaver and DataLoader objects
 - The basics to write your own materializer

 Different ways to write the same dataflow
 -----------------------------------------

 Below are 3 ways to write a dataflow that:

 1. loads a dataframe from a parquet file
 2. preprocesses the dataframe
 3. trains a machine learning model
 4. saves the trained model

 The first two options use ``Driver.execute()`` and the latter ``Driver.materialize()``. Notice where in the code data is loaded and saved and how it affects the dataflow.

 .. table:: Model training
    :align: left

    +----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+
    | Nodes / dataflow context                     | Driver context                                | Materialization                                        |
    +==============================================+===============================================+========================================================+
    | .. literalinclude:: _snippets/node_ctx.py    | .. literalinclude:: _snippets/driver_ctx.py   | .. literalinclude:: _snippets/materializer_ctx.py      |
    |                                              |                                               |                                                        |
    +----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+
    | .. image:: _snippets/node_ctx.png            | .. image:: _snippets/driver_ctx.png           | .. image:: _snippets/materializer_ctx.png              |
    |    :width: 500px                             |    :width: 500px                              |    :width: 500px                                       |
    +----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+

 As explained previously, ``Driver.execute()`` walks the graph to compute the list of nodes you requested by name. For ``Driver.materialize()``, you give it a list of data savers (``from_``) and data loaders (``to``). Each one will add a node to the dataflow before execution.

 .. note::

     ``Driver.materialize()`` can do everything ``Driver.execute()`` does, and more. It can receive ``inputs`` and ``overrides``. Instead of using ``final_vars``, you can use ``additional_vars`` to request nodes that you don't want to materialize/save.

 Why use materialization
 -----------------------

 Let's compare the benefits of the 3 different approaches

 Nodes / dataflow context
 ~~~~~~~~~~~~~~~~~~~~~~~~

 This approach defines data loading and saving as part of the dataflow and uses ``Driver.execute()``. It is usually the simplest approach and the one you should start with.

 Benefits

 - the functions ``raw_df()`` and ``save_model()`` are transparent as to how they load/save data
 - can easily change data location using the strings ``data_path`` and ``model_dir`` as inputs
 - all operations are part of the dataflow

 Limitations

 - need to write a unique function for each loaded parquet file and saved model. To reduce code duplication, one could write a utility function ``_load_parquet()``
 - can be too restrictive as to how to load data. Using ``override`` in the ``.execute()`` call can add flexibility.

 Driver context
 ~~~~~~~~~~~~~~

 This approach loads and saves data outside the dataflow and uses ``Driver.execute()``. Since the Driver is responsible for executing your dataflow, it makes sense to handle data loading/saving in the context of the "driver code" (e.g., ``run.py``) if they change often.

 Benefits

 - Driver users is responsible for loading/saving data
 - fewer dataflow functions to define and maintain
 - the functions for ``raw_df()`` and ``save_model()`` can live in another Python module that you can optionally build the Driver with.

 Limitations

 - add complexity to the "driver code".
 - lose the benefits of Hamilton for loading and saving operations (visualize, lifecycle hook, etc.)
 - to add flexibility to data loading/saving, one can adopt the **nodes/dataflow context** approach and add functions with ``@config`` for alternative implementations (see :ref:`config-decorators`).


 Materialization
 ~~~~~~~~~~~~~~~

 This approach tries to strike a balance between the two previous methods and uses ``Driver.materialize()``.

 Unique benefits

 - Use the Hamilton logic to combine nodes (more on that later)
 - Get tested code for common data loading and saving out-of-the-box (e.g., JSON, CSV, Parquet, pickle)
 - Easily save the same node to multiple formats

 Benefits

 - Flexibility for Driver users to change data location
 - Less dataflow functions to define and maintain
 - All operations are part of the dataflow

 Limitations

 - Writing a custom DataSaver or DataLoader requires more effort than adding a function to the dataflow.
 - Adds *some* complexity to the Driver (e.g., ``run.py``).

 DataLoader and DataSaver
 ------------------------

 In Hamilton, ``DataLoader`` and ``DataSaver`` are classes that define how to load or save a particular data format. Calling ``Driver.materialize(DataLoader(), DataSaver())`` adds nodes to the dataflow (see visualizations above).

 Here are simplified snippets for saving and loading an XGBoost model to/from JSON.

    +----------------------------------------------+-----------------------------------------------+
    | DataLoader                                   | DataSaver                                     |
    +==============================================+===============================================+
    | .. literalinclude:: _snippets/data_loader.py | .. literalinclude:: _snippets/data_saver.py   |
    |                                              |                                               |
    +----------------------------------------------+-----------------------------------------------+

 To define your own DataSaver and DataLoader, the Hamilton `XGBoost extension <https://github.com/DAGWorks-Inc/hamilton/blob/main/hamilton/plugins/xgboost_extensions.py>`_ provides a good example

 ``@load_from`` and ``@save_to``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Also, the data loaders and savers power the ``@load_from`` and ``@save_to`` :ref:`loader-saver-decorators`
	===============
	Materialization
	===============

	So far, we executed our dataflow using the ``Driver.execute()`` method, which can receive an ``inputs`` dictionary and return a ``results`` dictionary (by default). However, you can also execute code with ``Driver.materialize()`` to directly read from / write to external data sources (file, database, cloud data store).

	On this page, you'll learn:

	- The difference between ``.execute()`` and ``.materialize()``
	- Why use materialization
	- What are DataSaver and DataLoader objects
	- The basics to write your own materializer

	Different ways to write the same dataflow
	-----------------------------------------

	Below are 3 ways to write a dataflow that:

	1. loads a dataframe from a parquet file
	2. preprocesses the dataframe
	3. trains a machine learning model
	4. saves the trained model

	The first two options use ``Driver.execute()`` and the latter ``Driver.materialize()``. Notice where in the code data is loaded and saved and how it affects the dataflow.

	.. table:: Model training
	:align: left

	+----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+
	\| Nodes / dataflow context \| Driver context \| Materialization \|
	+==============================================+===============================================+========================================================+
	\| .. literalinclude:: _snippets/node_ctx.py \| .. literalinclude:: _snippets/driver_ctx.py \| .. literalinclude:: _snippets/materializer_ctx.py \|
	\| \| \| \|
	+----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+
	\| .. image:: _snippets/node_ctx.png \| .. image:: _snippets/driver_ctx.png \| .. image:: _snippets/materializer_ctx.png \|
	\| :width: 500px \| :width: 500px \| :width: 500px \|
	+----------------------------------------------+-----------------------------------------------+--------------------------------------------------------+

	As explained previously, ``Driver.execute()`` walks the graph to compute the list of nodes you requested by name. For ``Driver.materialize()``, you give it a list of data savers (``from_``) and data loaders (``to``). Each one will add a node to the dataflow before execution.

	.. note::

	``Driver.materialize()`` can do everything ``Driver.execute()`` does, and more. It can receive ``inputs`` and ``overrides``. Instead of using ``final_vars``, you can use ``additional_vars`` to request nodes that you don't want to materialize/save.

	Why use materialization
	-----------------------

	Let's compare the benefits of the 3 different approaches

	Nodes / dataflow context
	~~~~~~~~~~~~~~~~~~~~~~~~

	This approach defines data loading and saving as part of the dataflow and uses ``Driver.execute()``. It is usually the simplest approach and the one you should start with.

	Benefits

	- the functions ``raw_df()`` and ``save_model()`` are transparent as to how they load/save data
	- can easily change data location using the strings ``data_path`` and ``model_dir`` as inputs
	- all operations are part of the dataflow

	Limitations

	- need to write a unique function for each loaded parquet file and saved model. To reduce code duplication, one could write a utility function ``_load_parquet()``
	- can be too restrictive as to how to load data. Using ``override`` in the ``.execute()`` call can add flexibility.

	Driver context
	~~~~~~~~~~~~~~

	This approach loads and saves data outside the dataflow and uses ``Driver.execute()``. Since the Driver is responsible for executing your dataflow, it makes sense to handle data loading/saving in the context of the "driver code" (e.g., ``run.py``) if they change often.

	Benefits

	- Driver users is responsible for loading/saving data
	- fewer dataflow functions to define and maintain
	- the functions for ``raw_df()`` and ``save_model()`` can live in another Python module that you can optionally build the Driver with.

	Limitations

	- add complexity to the "driver code".
	- lose the benefits of Hamilton for loading and saving operations (visualize, lifecycle hook, etc.)
	- to add flexibility to data loading/saving, one can adopt the nodes/dataflow context approach and add functions with ``@config`` for alternative implementations (see :ref:`config-decorators`).


	Materialization
	~~~~~~~~~~~~~~~

	This approach tries to strike a balance between the two previous methods and uses ``Driver.materialize()``.

	Unique benefits

	- Use the Hamilton logic to combine nodes (more on that later)
	- Get tested code for common data loading and saving out-of-the-box (e.g., JSON, CSV, Parquet, pickle)
	- Easily save the same node to multiple formats

	Benefits

	- Flexibility for Driver users to change data location
	- Less dataflow functions to define and maintain
	- All operations are part of the dataflow

	Limitations

	- Writing a custom DataSaver or DataLoader requires more effort than adding a function to the dataflow.
	- Adds some complexity to the Driver (e.g., ``run.py``).

	DataLoader and DataSaver
	------------------------

	In Hamilton, ``DataLoader`` and ``DataSaver`` are classes that define how to load or save a particular data format. Calling ``Driver.materialize(DataLoader(), DataSaver())`` adds nodes to the dataflow (see visualizations above).

	Here are simplified snippets for saving and loading an XGBoost model to/from JSON.

	+----------------------------------------------+-----------------------------------------------+
	\| DataLoader \| DataSaver \|
	+==============================================+===============================================+
	\| .. literalinclude:: _snippets/data_loader.py \| .. literalinclude:: _snippets/data_saver.py \|
	\| \| \|
	+----------------------------------------------+-----------------------------------------------+

	To define your own DataSaver and DataLoader, the Hamilton `XGBoost extension <https://github.com/DAGWorks-Inc/hamilton/blob/main/hamilton/plugins/xgboost_extensions.py>`_ provides a good example

	``@load_from`` and ``@save_to``
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Also, the data loaders and savers power the ``@load_from`` and ``@save_to`` :ref:`loader-saver-decorators`