The Hamilton CLI allows to build Driver objects from the command line.
Install dependencies with (it only needs typer)
pip install sf-hamilton[cli]
Test the installation with
hamilton --help
build: creates a Hamilton Driver from specified modules. It"s useful to validate the dataflow definitionvalidate: calls Driver.validate_execution() for a set of inputs and overrides passed through the --context option.view: calls dr.display_all_functions() on the built Driverversion: generates node hashes based on their source code, and a dataflow hash from the collection of node hashes.diff: get a diff of added/deleted/edited nodes between the current version of Python modules and another git reference (default=HEAD, i.e., the last commited version). You can get a visualization of the diffsMODULES which is a list of path to Python modules to assembled as a single dataflow--context (-ctx), which is a file (.py or .json) that include top-level headers (see config.py and config.json in this repo for example):HAMILTON_CONFIG: typing.Mapping passed to driver.Builder.with_config()HAMILTON_FINAL_VARS: typing.Sequence passed to driver.validate_execution(final_vars=...)HAMILTON_INPUTS: typing.Mapping passed to driver.validate_execution(inputs=...)HAMILTON_OVERRIDES: typing.Mapping passed to driver.validate_execution(overrides=...).py context file provides more flexibility than .json to define inputs and overrides objects.--name (-n), which is used to name the output file (when the command produces a file). If None, a file name will be derived from the MODULES argument.--name value (either explicit or default derived from MODULES) at this locationdefault: will output the file with the name replaced by --name value at this location. This is useful when you need to specify a type via filename. For example, hamilton view -o /path/to/default.pdf my_dataflow.py will create the file /path/to/my_dataflow.pdf. (This behavior may change)See DOCS.md for the full references
./example_script.py shows you how to pipe out CLI responses as JSON objects. Can be useful in a CI pipelinehamilton.cli.commands Note that this API is experimental and unstable.To illustrate, let“s take the file module_v1.py which contains a few functions (details aren”t important).
import pandas as pd from hamilton.function_modifiers import extract_columns def customers_df(customers_path: str = "customers.csv") -> pd.DataFrame: """Load the customer dataset.""" return pd.read_csv(customers_path) def orders_df(orders_path: str = "orders.csv") -> pd.DataFrame: """Load the orders dataset.""" return pd.read_csv(orders_path) @extract_columns("amount", "age", "country") def customers_orders_df(customers_df: pd.DataFrame, orders_df: pd.DataFrame) -> pd.DataFrame: """Combine the customers and orders datasets. Setting index to (order_id, customer_id).""" _df = pd.merge(customers_df, orders_df, on="customer_id") _df = _df.set_index(["order_id", "customer_id"]) return _df def orders_per_customer(customers_orders_df: pd.DataFrame) -> pd.Series: """Compute the number of orders per customer. Outputs series indexed by customer_id.""" return customers_orders_df.groupby("customer_id").size().rename("orders_per_customer") def average_order_by_customer(amount: pd.Series) -> pd.Series: """Compute the average order amount per customer. Outputs series indexed by customer_id.""" return amount.groupby("customer_id").mean().rename("average_order_by_customer") def customer_summary_table( orders_per_customer: pd.Series, average_order_by_customer: pd.Series ) -> pd.DataFrame: """Our customer summary table definition.""" return pd.concat([orders_per_customer, average_order_by_customer], axis=1)
buildCall build with a single Python module path.
hamilton build module_v1.py
Returns the modules included in the dataflow
{"modules": ["module_v1"]}
viewCall view with a single Python module path, and specify the output visualization path.
hamilton view --output ./dag.png module_v1.py
Returns the modules included in the dataflow
{"path": "/home/tjean/projects/dagworks/hamilton/examples/cli/dag.png"}
versionCall version with a single Python module path.
hamilton version module_v1.py
Returns the hashes for the nodes" function and the dataflow hash (hashes trimmed for readability).
{ "dataflow_hash": "13b05...", "nodes_hash": { "age": "18eb2...", "amount": "18eb2...", "average_order_by_customer": "671e3...", "country": "18eb2...", "customer_summary_table": "19905...", "customers_df": "34f04...", "customers_orders_df": "18eb2...", "customers_path": "34f04...", "orders_df": "452f4...", "orders_path": "452f4...", "orders_per_distributor": "278b2..." } }
diffAfter making the following changes to the last functions
# renamed this function def orders_per_distributor(customers_orders_df: pd.DataFrame) -> pd.Series: """Compute the number of orders per customer. Outputs series indexed by customer_id.""" return customers_orders_df.groupby("customer_id").size().rename("orders_per_distributor") # added 1 to the return value def average_order_by_customer(amount: pd.Series) -> pd.Series: """Compute the average order amount per customer. Outputs series indexed by customer_id.""" return 1 + (amount.groupby("customer_id").mean().rename("average_order_by_customer")) # renamed according to `orders_per_distributor` def customer_summary_table( orders_per_distributor: pd.Series, average_order_by_customer: pd.Series ) -> pd.DataFrame: """Our customer summary table definition.""" return pd.concat([orders_per_distributor, average_order_by_customer], axis=1)
hamilton diff --view --output ./diff.png module_v1.py
{ "edit": [ "average_order_by_customer", "customer_summary_table" ], "v1_only": ["orders_per_customer"], "v2_only": ["orders_per_distributor"] }