blob: 346d905d253d18d555f107a2fd6aa3d4a08c7cd4 [file] [log] [blame] [view]
# Hamilton CLI
The Hamilton CLI allows to build `Driver` objects from the command line.
# Installation
Install dependencies with (it only needs `typer`)
```pip install sf-hamilton[cli]```
Test the installation with
```hamilton --help```
# Features
## Commands
- `build`: creates a Hamilton `Driver` from specified modules. It"s useful to validate the dataflow definition
- `validate`: calls `Driver.validate_execution()` for a set of `inputs` and `overrides` passed through the `--context` option.
- `view`: calls `dr.display_all_functions()` on the built `Driver`
- `version`: generates node hashes based on their source code, and a dataflow hash from the collection of node hashes.
- `diff`: get a diff of added/deleted/edited nodes between the current version of Python modules and another git reference (`default=HEAD`, i.e., the last commited version). You can get a visualization of the diffs
## Options
- all commands receive `MODULES` which is a list of path to Python modules to assembled as a single dataflow
- all commands receive `--context` (`-ctx`), which is a file (`.py` or `.json`) that include top-level headers (see `config.py` and `config.json` in this repo for example):
- `HAMILTON_CONFIG`: `typing.Mapping` passed to `driver.Builder.with_config()`
- `HAMILTON_FINAL_VARS`: `typing.Sequence` passed to `driver.validate_execution(final_vars=...)`
- `HAMILTON_INPUTS`: `typing.Mapping` passed to `driver.validate_execution(inputs=...)`
- `HAMILTON_OVERRIDES`: `typing.Mapping` passed to `driver.validate_execution(overrides=...)`
- Using a `.py` context file provides more flexibility than `.json` to define inputs and overrides objects.
- all commands receive a `--name` (`-n`), which is used to name the output file (when the command produces a file). If `None`, a file name will be derived from the `MODULES` argument.
- When using a command that generates a file:
- passing a file path: will output the file with this name at this location
- passing a directory: will output the file with the `--name` value (either explicit or default derived from `MODULES`) at this location
- passing a file path with the name `default`: will output the file with the name replaced by `--name` value at this location. This is useful when you need to specify a type via filename. For example, `hamilton view -o /path/to/default.pdf my_dataflow.py` will create the file `/path/to/my_dataflow.pdf`. (This behavior may change)
See [DOCS.md](./DOCS.md) for the full references
# Usage
- Useful to quickly check a dataflow definition is valid without creating a script or an interactive environment
- `./example_script.py` shows you how to pipe out CLI responses as JSON objects. Can be useful in a CI pipeline
- For programmatic use in Python, see `hamilton.cli.commands` Note that this API is experimental and unstable.
# Examples
To illustrate, let"s take the file `module_v1.py` which contains a few functions (details aren"t important).
```python
import pandas as pd
from hamilton.function_modifiers import extract_columns
def customers_df(customers_path: str = "customers.csv") -> pd.DataFrame:
"""Load the customer dataset."""
return pd.read_csv(customers_path)
def orders_df(orders_path: str = "orders.csv") -> pd.DataFrame:
"""Load the orders dataset."""
return pd.read_csv(orders_path)
@extract_columns("amount", "age", "country")
def customers_orders_df(customers_df: pd.DataFrame, orders_df: pd.DataFrame) -> pd.DataFrame:
"""Combine the customers and orders datasets.
Setting index to (order_id, customer_id)."""
_df = pd.merge(customers_df, orders_df, on="customer_id")
_df = _df.set_index(["order_id", "customer_id"])
return _df
def orders_per_customer(customers_orders_df: pd.DataFrame) -> pd.Series:
"""Compute the number of orders per customer.
Outputs series indexed by customer_id."""
return customers_orders_df.groupby("customer_id").size().rename("orders_per_customer")
def average_order_by_customer(amount: pd.Series) -> pd.Series:
"""Compute the average order amount per customer.
Outputs series indexed by customer_id."""
return amount.groupby("customer_id").mean().rename("average_order_by_customer")
def customer_summary_table(
orders_per_customer: pd.Series, average_order_by_customer: pd.Series
) -> pd.DataFrame:
"""Our customer summary table definition."""
return pd.concat([orders_per_customer, average_order_by_customer], axis=1)
```
## `build`
### Command
Call `build` with a single Python module path.
```
hamilton build module_v1.py
```
### Response
Returns the modules included in the dataflow
```json
{"modules": ["module_v1"]}
```
## `view`
### Command
Call `view` with a single Python module path, and specify the output visualization path.
```
hamilton view --output ./dag.png module_v1.py
```
### Response
Returns the modules included in the dataflow
```json
{"path": "/home/tjean/projects/dagworks/hamilton/examples/cli/dag.png"}
```
![](./dag.png)
## `version`
### Command
Call `version` with a single Python module path.
```
hamilton version module_v1.py
```
### Response
Returns the hashes for the nodes" function and the dataflow hash (hashes trimmed for readability).
```json
{
"dataflow_hash": "13b05...",
"nodes_hash": {
"age": "18eb2...",
"amount": "18eb2...",
"average_order_by_customer": "671e3...",
"country": "18eb2...",
"customer_summary_table": "19905...",
"customers_df": "34f04...",
"customers_orders_df": "18eb2...",
"customers_path": "34f04...",
"orders_df": "452f4...",
"orders_path": "452f4...",
"orders_per_distributor": "278b2..."
}
}
```
## `diff`
After making the following changes to the last functions
```python
# renamed this function
def orders_per_distributor(customers_orders_df: pd.DataFrame) -> pd.Series:
"""Compute the number of orders per customer.
Outputs series indexed by customer_id."""
return customers_orders_df.groupby("customer_id").size().rename("orders_per_distributor")
# added 1 to the return value
def average_order_by_customer(amount: pd.Series) -> pd.Series:
"""Compute the average order amount per customer.
Outputs series indexed by customer_id."""
return 1 + (amount.groupby("customer_id").mean().rename("average_order_by_customer"))
# renamed according to `orders_per_distributor`
def customer_summary_table(
orders_per_distributor: pd.Series, average_order_by_customer: pd.Series
) -> pd.DataFrame:
"""Our customer summary table definition."""
return pd.concat([orders_per_distributor, average_order_by_customer], axis=1)
```
### Command
```
hamilton diff --view --output ./diff.png module_v1.py
```
### Response
```json
{
"edit": [
"average_order_by_customer",
"customer_summary_table"
],
"v1_only": ["orders_per_customer"],
"v2_only": ["orders_per_distributor"]
}
```
![](./diff.png)