examples/cli/README.md - hamilton - Git at Google

 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 # Apache Hamilton CLI

 The Apache Hamilton CLI allows to build `Driver` objects from the command line.

 # Installation
 Install dependencies with (it only needs `typer`)

 ```pip install sf-hamilton[cli]```

 Test the installation with

 ```hamilton --help```

 # Features

 ## Commands
 - `build`: creates a Apache Hamilton `Driver` from specified modules. It"s useful to validate the dataflow definition
 - `validate`: calls `Driver.validate_execution()` for a set of `inputs` and `overrides` passed through the `--context` option.
 - `view`: calls `dr.display_all_functions()` on the built `Driver`
 - `version`: generates node hashes based on their source code, and a dataflow hash from the collection of node hashes.
 - `diff`: get a diff of added/deleted/edited nodes between the current version of Python modules and another git reference (`default=HEAD`, i.e., the last commited version). You can get a visualization of the diffs

 ## Options
 - all commands receive `MODULES` which is a list of path to Python modules to assembled as a single dataflow
 - all commands receive `--context` (`-ctx`), which is a file (`.py` or `.json`) that include top-level headers (see `config.py` and `config.json` in this repo for example):
     - `HAMILTON_CONFIG`: `typing.Mapping` passed to `driver.Builder.with_config()`
     - `HAMILTON_FINAL_VARS`: `typing.Sequence` passed to `driver.validate_execution(final_vars=...)`
     - `HAMILTON_INPUTS`: `typing.Mapping` passed to `driver.validate_execution(inputs=...)`
     - `HAMILTON_OVERRIDES`: `typing.Mapping` passed to `driver.validate_execution(overrides=...)`
 - Using a `.py` context file provides more flexibility than `.json` to define inputs and overrides objects.
 - all commands receive a `--name` (`-n`), which is used to name the output file (when the command produces a file). If `None`, a file name will be derived from the `MODULES` argument.
 - When using a command that generates a file:
     - passing a file path: will output the file with this name at this location
     - passing a directory: will output the file with the `--name` value (either explicit or default derived from `MODULES`) at this location
     - passing a file path with the name `default`: will output the file with the name replaced by `--name` value at this location. This is useful when you need to specify a type via filename. For example, `hamilton view -o /path/to/default.pdf my_dataflow.py` will create the file `/path/to/my_dataflow.pdf`. (This behavior may change)


 See [DOCS.md](./DOCS.md) for the full references

 # Usage

 - Useful to quickly check a dataflow definition is valid without creating a script or an interactive environment
 - `./example_script.py` shows you how to pipe out CLI responses as JSON objects. Can be useful in a CI pipeline
 - For programmatic use in Python, see `hamilton.cli.commands` Note that this API is experimental and unstable.

 # Examples
 To illustrate, let"s take the file `module_v1.py` which contains a few functions (details aren"t important).

 ```python
 import pandas as pd
 from hamilton.function_modifiers import extract_columns

 def customers_df(customers_path: str = "customers.csv") -> pd.DataFrame:
     """Load the customer dataset."""
     return pd.read_csv(customers_path)

 def orders_df(orders_path: str = "orders.csv") -> pd.DataFrame:
     """Load the orders dataset."""
     return pd.read_csv(orders_path)

 @extract_columns("amount", "age", "country")
 def customers_orders_df(customers_df: pd.DataFrame, orders_df: pd.DataFrame) -> pd.DataFrame:
     """Combine the customers and orders datasets.
     Setting index to (order_id, customer_id)."""
     _df = pd.merge(customers_df, orders_df, on="customer_id")
     _df = _df.set_index(["order_id", "customer_id"])
     return _df

 def orders_per_customer(customers_orders_df: pd.DataFrame) -> pd.Series:
     """Compute the number of orders per customer.
     Outputs series indexed by customer_id."""
     return customers_orders_df.groupby("customer_id").size().rename("orders_per_customer")

 def average_order_by_customer(amount: pd.Series) -> pd.Series:
     """Compute the average order amount per customer.
     Outputs series indexed by customer_id."""
     return amount.groupby("customer_id").mean().rename("average_order_by_customer")

 def customer_summary_table(
     orders_per_customer: pd.Series, average_order_by_customer: pd.Series
 ) -> pd.DataFrame:
     """Our customer summary table definition."""
     return pd.concat([orders_per_customer, average_order_by_customer], axis=1)
 ```

 ## `build`
 ### Command
 Call `build` with a single Python module path.

 ```
 hamilton build module_v1.py
 ```
 ### Response
 Returns the modules included in the dataflow
 ```json
 {"modules": ["module_v1"]}
 ```

 ## `view`
 ### Command
 Call `view` with a single Python module path, and specify the output visualization path.

 ```
 hamilton view --output ./dag.png module_v1.py
 ```
 ### Response
 Returns the modules included in the dataflow
 ```json
 {"path": "/home/tjean/projects/dagworks/hamilton/examples/cli/dag.png"}
 ```
 ![](./dag.png)

 ## `version`
 ### Command
 Call `version` with a single Python module path.

 ```
 hamilton version module_v1.py
 ```
 ### Response
 Returns the hashes for the nodes" function and the dataflow hash (hashes trimmed for readability).
 ```json
 {
   "dataflow_hash": "13b05...",
   "nodes_hash": {
     "age": "18eb2...",
     "amount": "18eb2...",
     "average_order_by_customer": "671e3...",
     "country": "18eb2...",
     "customer_summary_table": "19905...",
     "customers_df": "34f04...",
     "customers_orders_df": "18eb2...",
     "customers_path": "34f04...",
     "orders_df": "452f4...",
     "orders_path": "452f4...",
     "orders_per_distributor": "278b2..."
   }
 }
 ```

 ## `diff`
 After making the following changes to the last functions
 ```python
 # renamed this function
 def orders_per_distributor(customers_orders_df: pd.DataFrame) -> pd.Series:
     """Compute the number of orders per customer.
     Outputs series indexed by customer_id."""
     return customers_orders_df.groupby("customer_id").size().rename("orders_per_distributor")

 # added 1 to the return value
 def average_order_by_customer(amount: pd.Series) -> pd.Series:
     """Compute the average order amount per customer.
     Outputs series indexed by customer_id."""
     return 1 + (amount.groupby("customer_id").mean().rename("average_order_by_customer"))

 # renamed according to `orders_per_distributor`
 def customer_summary_table(
     orders_per_distributor: pd.Series, average_order_by_customer: pd.Series
 ) -> pd.DataFrame:
     """Our customer summary table definition."""
     return pd.concat([orders_per_distributor, average_order_by_customer], axis=1)
 ```

 ### Command
 ```
 hamilton diff --view --output ./diff.png module_v1.py
 ```

 ### Response
 ```json
 {
   "edit": [
     "average_order_by_customer",
     "customer_summary_table"
   ],
   "v1_only": ["orders_per_customer"],
   "v2_only": ["orders_per_distributor"]
 }
 ```

 ![](./diff.png)
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	# Apache Hamilton CLI

	The Apache Hamilton CLI allows to build `Driver` objects from the command line.

	# Installation
	Install dependencies with (it only needs `typer`)

	```pip install sf-hamilton[cli]```

	Test the installation with

	```hamilton --help```

	# Features

	## Commands
	- `build`: creates a Apache Hamilton `Driver` from specified modules. It"s useful to validate the dataflow definition
	- `validate`: calls `Driver.validate_execution()` for a set of `inputs` and `overrides` passed through the `--context` option.
	- `view`: calls `dr.display_all_functions()` on the built `Driver`
	- `version`: generates node hashes based on their source code, and a dataflow hash from the collection of node hashes.
	- `diff`: get a diff of added/deleted/edited nodes between the current version of Python modules and another git reference (`default=HEAD`, i.e., the last commited version). You can get a visualization of the diffs

	## Options
	- all commands receive `MODULES` which is a list of path to Python modules to assembled as a single dataflow
	- all commands receive `--context` (`-ctx`), which is a file (`.py` or `.json`) that include top-level headers (see `config.py` and `config.json` in this repo for example):
	- `HAMILTON_CONFIG`: `typing.Mapping` passed to `driver.Builder.with_config()`
	- `HAMILTON_FINAL_VARS`: `typing.Sequence` passed to `driver.validate_execution(final_vars=...)`
	- `HAMILTON_INPUTS`: `typing.Mapping` passed to `driver.validate_execution(inputs=...)`
	- `HAMILTON_OVERRIDES`: `typing.Mapping` passed to `driver.validate_execution(overrides=...)`
	- Using a `.py` context file provides more flexibility than `.json` to define inputs and overrides objects.
	- all commands receive a `--name` (`-n`), which is used to name the output file (when the command produces a file). If `None`, a file name will be derived from the `MODULES` argument.
	- When using a command that generates a file:
	- passing a file path: will output the file with this name at this location
	- passing a directory: will output the file with the `--name` value (either explicit or default derived from `MODULES`) at this location
	- passing a file path with the name `default`: will output the file with the name replaced by `--name` value at this location. This is useful when you need to specify a type via filename. For example, `hamilton view -o /path/to/default.pdf my_dataflow.py` will create the file `/path/to/my_dataflow.pdf`. (This behavior may change)


	See [DOCS.md](./DOCS.md) for the full references

	# Usage

	- Useful to quickly check a dataflow definition is valid without creating a script or an interactive environment
	- `./example_script.py` shows you how to pipe out CLI responses as JSON objects. Can be useful in a CI pipeline
	- For programmatic use in Python, see `hamilton.cli.commands` Note that this API is experimental and unstable.

	# Examples
	To illustrate, let"s take the file `module_v1.py` which contains a few functions (details aren"t important).

	```python
	import pandas as pd
	from hamilton.function_modifiers import extract_columns

	def customers_df(customers_path: str = "customers.csv") -> pd.DataFrame:
	"""Load the customer dataset."""
	return pd.read_csv(customers_path)

	def orders_df(orders_path: str = "orders.csv") -> pd.DataFrame:
	"""Load the orders dataset."""
	return pd.read_csv(orders_path)

	@extract_columns("amount", "age", "country")
	def customers_orders_df(customers_df: pd.DataFrame, orders_df: pd.DataFrame) -> pd.DataFrame:
	"""Combine the customers and orders datasets.
	Setting index to (order_id, customer_id)."""
	_df = pd.merge(customers_df, orders_df, on="customer_id")
	_df = _df.set_index(["order_id", "customer_id"])
	return _df

	def orders_per_customer(customers_orders_df: pd.DataFrame) -> pd.Series:
	"""Compute the number of orders per customer.
	Outputs series indexed by customer_id."""
	return customers_orders_df.groupby("customer_id").size().rename("orders_per_customer")

	def average_order_by_customer(amount: pd.Series) -> pd.Series:
	"""Compute the average order amount per customer.
	Outputs series indexed by customer_id."""
	return amount.groupby("customer_id").mean().rename("average_order_by_customer")

	def customer_summary_table(
	orders_per_customer: pd.Series, average_order_by_customer: pd.Series
	) -> pd.DataFrame:
	"""Our customer summary table definition."""
	return pd.concat([orders_per_customer, average_order_by_customer], axis=1)
	```

	## `build`
	### Command
	Call `build` with a single Python module path.

	```
	hamilton build module_v1.py
	```
	### Response
	Returns the modules included in the dataflow
	```json
	{"modules": ["module_v1"]}
	```

	## `view`
	### Command
	Call `view` with a single Python module path, and specify the output visualization path.

	```
	hamilton view --output ./dag.png module_v1.py
	```
	### Response
	Returns the modules included in the dataflow
	```json
	{"path": "/home/tjean/projects/dagworks/hamilton/examples/cli/dag.png"}
	```
	![](./dag.png)

	## `version`
	### Command
	Call `version` with a single Python module path.

	```
	hamilton version module_v1.py
	```
	### Response
	Returns the hashes for the nodes" function and the dataflow hash (hashes trimmed for readability).
	```json
	{
	"dataflow_hash": "13b05...",
	"nodes_hash": {
	"age": "18eb2...",
	"amount": "18eb2...",
	"average_order_by_customer": "671e3...",
	"country": "18eb2...",
	"customer_summary_table": "19905...",
	"customers_df": "34f04...",
	"customers_orders_df": "18eb2...",
	"customers_path": "34f04...",
	"orders_df": "452f4...",
	"orders_path": "452f4...",
	"orders_per_distributor": "278b2..."
	}
	}
	```

	## `diff`
	After making the following changes to the last functions
	```python
	# renamed this function
	def orders_per_distributor(customers_orders_df: pd.DataFrame) -> pd.Series:
	"""Compute the number of orders per customer.
	Outputs series indexed by customer_id."""
	return customers_orders_df.groupby("customer_id").size().rename("orders_per_distributor")

	# added 1 to the return value
	def average_order_by_customer(amount: pd.Series) -> pd.Series:
	"""Compute the average order amount per customer.
	Outputs series indexed by customer_id."""
	return 1 + (amount.groupby("customer_id").mean().rename("average_order_by_customer"))

	# renamed according to `orders_per_distributor`
	def customer_summary_table(
	orders_per_distributor: pd.Series, average_order_by_customer: pd.Series
	) -> pd.DataFrame:
	"""Our customer summary table definition."""
	return pd.concat([orders_per_distributor, average_order_by_customer], axis=1)
	```

	### Command
	```
	hamilton diff --view --output ./diff.png module_v1.py
	```

	### Response
	```json
	{
	"edit": [
	"average_order_by_customer",
	"customer_summary_table"
	],
	"v1_only": ["orders_per_customer"],
	"v2_only": ["orders_per_distributor"]
	}
	```

	![](./diff.png)