examples/materialization/README.md - hamilton - Git at Google

 # Materialization

 Hamilton's driver allows for ad-hoc materialization. This enables you to take a DAG you already have,
 and save your data to a set of custom locations/url.

 Note that these materializers are _isomorphic_ in nature to the
 [@save_to](https://hamilton.dagworks.io/en/latest/reference/decorators/save_to/)
 decorator. Materializers inject the additional node at runtime, modifying the
 DAG to include a data saver node, and returning the metadata around materialization.

 This framework is meant to be highly pluggable. While the set of available data savers is currently
 limited, we expect folks to build their own materializers (and, hopefully, contribute them back to the community!).


 ## example
 In this example we take the scikit-learn iris_loader pipeline, and materialize outputs to specific
 locations through a driver call. We demonstrate:

 1. Saving model parameters to a json file (using the default json materializer)
 2. Writing a custom data adapters for:
    1. Pickling a model to an object file
    2. Saving confusion matrices to a csv file

 See [run.py](run.py) for the full example.

 In this example we only pass literal values to the materializers. That said, you can use both `source` (to specify the source from an upstream node),
 and `value` (which is the default) to specify literals.


 ## `driver.materialize`

 This will be a high-level overview. For more details,
 see [documentation](https://hamilton.dagworks.io/en/latest/reference/drivers/Driver/#hamilton.driver.Driver.materializehttps://hamilton.dagworks.io/en/latest/reference/drivers/Driver/#hamilton.driver.Driver.materialize).

 `driver.materialize()` does the following:
 1. Processes a list of materializers to create a new DAG
 2. Alters the output to include the materializer nodes
 3. Processes a list of "additional variables" (for debugging) to return intermediary data
 4. Executes the DAG, including the materializers
 5. Returns a tuple of (`materialization metadata`, `additional variables`)

 Materializers each consume:
 1. A `dependencies` list to materialize
 2. A (optional) `combine` parameter to combine the outputs of the dependencies
 (this is required if there are multiple dependencies). This is a [ResultMixin](https://hamilton.dagworks.io/en/latest/concepts/customizing-execution/#result-builders) object
 3. an `id` parameter to identify the materializer, which serves as the nde name in the DAG

 Materializers are referenced by the `to` object in `hamilton.io.materialization`, which utilizes
 dynamic dispatch to create the appropriate materializer.

 These refer to a `DataSaver`, which are keyed by a string (E.G `csv`).
 Multiple data adapters can share the same key, each of which applies to a specific type
 (E.G. pandas dataframe, numpy matrix, polars dataframe). New
 data adapters are registered by calling `hamilton.registry.register_adapter`

 ## Custom Materializers

 To define a custom materializer, all you have to do is implement the `DataSaver` class
 (which will allow use in `save_to` as well.) This is demonstrated in [custom_materializers.py](custom_materializers.py).

 ## `driver.materialize` vs `@save_to`

 `driver.materialize` is an ad-hoc form of `save_to`. You want to use this when you're developing, and
 want to do ad-hoc materialization. When you have a production ETL, you can choose between `save_to` and `materialize`.
 If the save location/structure is unlikely to change, then you might consider using `save_to`. Otherwise, `materialize`
 is an idiomatic way of conducting the maerialization operations that cleanly separates side-effects from transformations.
	# Materialization

	Hamilton's driver allows for ad-hoc materialization. This enables you to take a DAG you already have,
	and save your data to a set of custom locations/url.

	Note that these materializers are _isomorphic_ in nature to the
	[@save_to](https://hamilton.dagworks.io/en/latest/reference/decorators/save_to/)
	decorator. Materializers inject the additional node at runtime, modifying the
	DAG to include a data saver node, and returning the metadata around materialization.

	This framework is meant to be highly pluggable. While the set of available data savers is currently
	limited, we expect folks to build their own materializers (and, hopefully, contribute them back to the community!).


	## example
	In this example we take the scikit-learn iris_loader pipeline, and materialize outputs to specific
	locations through a driver call. We demonstrate:

	1. Saving model parameters to a json file (using the default json materializer)
	2. Writing a custom data adapters for:
	1. Pickling a model to an object file
	2. Saving confusion matrices to a csv file

	See [run.py](run.py) for the full example.

	In this example we only pass literal values to the materializers. That said, you can use both `source` (to specify the source from an upstream node),
	and `value` (which is the default) to specify literals.


	## `driver.materialize`

	This will be a high-level overview. For more details,
	see [documentation](https://hamilton.dagworks.io/en/latest/reference/drivers/Driver/#hamilton.driver.Driver.materializehttps://hamilton.dagworks.io/en/latest/reference/drivers/Driver/#hamilton.driver.Driver.materialize).

	`driver.materialize()` does the following:
	1. Processes a list of materializers to create a new DAG
	2. Alters the output to include the materializer nodes
	3. Processes a list of "additional variables" (for debugging) to return intermediary data
	4. Executes the DAG, including the materializers
	5. Returns a tuple of (`materialization metadata`, `additional variables`)

	Materializers each consume:
	1. A `dependencies` list to materialize
	2. A (optional) `combine` parameter to combine the outputs of the dependencies
	(this is required if there are multiple dependencies). This is a [ResultMixin](https://hamilton.dagworks.io/en/latest/concepts/customizing-execution/#result-builders) object
	3. an `id` parameter to identify the materializer, which serves as the nde name in the DAG

	Materializers are referenced by the `to` object in `hamilton.io.materialization`, which utilizes
	dynamic dispatch to create the appropriate materializer.

	These refer to a `DataSaver`, which are keyed by a string (E.G `csv`).
	Multiple data adapters can share the same key, each of which applies to a specific type
	(E.G. pandas dataframe, numpy matrix, polars dataframe). New
	data adapters are registered by calling `hamilton.registry.register_adapter`

	## Custom Materializers

	To define a custom materializer, all you have to do is implement the `DataSaver` class
	(which will allow use in `save_to` as well.) This is demonstrated in [custom_materializers.py](custom_materializers.py).

	## `driver.materialize` vs `@save_to`

	`driver.materialize` is an ad-hoc form of `save_to`. You want to use this when you're developing, and
	want to do ad-hoc materialization. When you have a production ETL, you can choose between `save_to` and `materialize`.
	If the save location/structure is unlikely to change, then you might consider using `save_to`. Otherwise, `materialize`
	is an idiomatic way of conducting the maerialization operations that cleanly separates side-effects from transformations.