| # Decorators |
| |
| While the 1:1 mapping of column -> function implementation is powerful, we've implemented a few decorators to promote |
| business-logic reuse. The decorators we've defined are as follows |
| (source can be found in [function_modifiers](hamilton/function_modifiers.py)): |
| |
| ## @parameterize |
| Expands a single function into n, each of which correspond to a function in which the parameter value is replaced either by: |
| 1. A specified value |
| 2. The value from a specified upstream node. |
| |
| Note that this can take the place of any of the `@parameterize` decorators below. In fact, they delegate to this! |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import parameterize |
| from hamilton.function_modifiers import value, source |
| |
| |
| @parameterize( |
| D_ELECTION_2016_shifted=dict(n_off_date=source('D_ELECTION_2016'), shift_by=value(3)), |
| SOME_OUTPUT_NAME=dict(n_off_date=source('SOME_INPUT_NAME'), shift_by=value(1)), |
| ) |
| def date_shifter(n_off_date: pd.Series, shift_by: int = 1) -> pd.Series: |
| """{one_off_date} shifted by shift_by to create {output_name}""" |
| return n_off_date.shift(shift_by) |
| ``` |
| |
| By choosing `literal` or `upstream`, you can determine the source of your dependency. Note that you can |
| also pass documentation. If you don't, it will use the parameterized docstring. |
| |
| ```python |
| @parameterize( |
| D_ELECTION_2016_shifted=(dict(n_off_date=source('D_ELECTION_2016'), shift_by=value(3)), "D_ELECTION_2016 shifted by 3"), |
| SOME_OUTPUT_NAME=(dict(n_off_date=source('SOME_INPUT_NAME'), shift_by=value(1)),"SOME_INPUT_NAME shifted by 1") |
| ) |
| def date_shifter(n_off_date: pd.Series, shift_by: int=1) -> pd.Series: |
| """{one_off_date} shifted by shift_by to create {output_name}""" |
| return n_off_date.shift(shift_by) |
| ``` |
| |
| |
| |
| ## @parameterize_values (replacing @parametrized) |
| Expands a single function into n, each of which corresponds to a function in which the parameter value is replaced by |
| that *specific value*. |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import parameterize_values |
| import internal_package_with_logic |
| |
| ONE_OFF_DATES = { |
| #output name # doc string # input value to function |
| ('D_ELECTION_2016', 'US Election 2016 Dummy'): '2016-11-12', |
| ('SOME_OUTPUT_NAME', 'Doc string for this thing'): 'value to pass to function', |
| } |
| # parameter matches the name of the argument in the function below |
| @parameterize_values(parameter='one_off_date', assigned_output=ONE_OFF_DATES) |
| def create_one_off_dates(date_index: pd.Series, one_off_date: str) -> pd.Series: |
| """Given a date index, produces a series where a 1 is placed at the date index that would contain that event.""" |
| one_off_dates = internal_package_with_logic.get_business_week(one_off_date) |
| return internal_package_with_logic.bool_to_int(date_index.isin([one_off_dates])) |
| ``` |
| We see here that `parameterized` allows you keep your code DRY by reusing the same function to create multiple |
| distinct outputs. The _parameter_ key word argument has to match one of the arguments in the function. The rest of |
| the arguments are pulled from outside the DAG. The _assigned_output_ key word argument takes in a dictionary of |
| tuple(Output Name, Documentation string) -> value. |
| |
| Note that `@parametrized` is deprecated, and we intend for you to use `@parameterize_vales`. We're consolidating |
| to make the parameterization decorators more consistent! You have plenty of time to migrate, |
| we wont make this a hard change until we have a Hamilton 2.0.0 to release. |
| |
| |
| ## @parameterize_sources (replacing @parameterized_inputs) |
| |
| Expands a single function into _n_, each of which corresponds to a function in which the parameters specified are mapped |
| to the specified inputs. Note this decorator and `@parameterize_values` are quite similar, except that |
| the input here is another DAG node(s), i.e. column/input, rather than a specific scalar/static value. |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import parameterize_sources |
| |
| |
| @parameterize_sources( |
| D_ELECTION_2016_shifted=dict(one_off_date='D_ELECTION_2016'), |
| SOME_OUTPUT_NAME=dict(one_off_date='SOME_INPUT_NAME') |
| ) |
| def date_shifter(one_off_date: pd.Series) -> pd.Series: |
| """{one_off_date} shifted by 1 to create {output_name}""" |
| return one_off_date.shift(1) |
| |
| ``` |
| We see here that `parameterize_sources` allows you to keep your code DRY by reusing the same function to create multiple |
| distinct outputs. The key word arguments passed have to have the following structure: |
| > OUTPUT_NAME = Mapping of function argument to input that should go into it. |
| |
| So in the example, `D_ELECTION_2016_shifted` is an _output_ that will correspond to replacing `one_off_date` with `D_ELECTION_2016`. |
| Then similarly `SOME_OUTPUT_NAME` is an _output_ that will correspond to replacing `one_off_date` with `SOME_INPUT_NAME`. |
| The documentation for both uses the same function doc and will replace values that are templatized with the input |
| parameter names, and the reserved value `output_name`. |
| |
| To help visualize what the above is doing, it is equivalent to writing the following two function definitions: |
| |
| ```python |
| def D_ELECTION_2016_shifted(D_ELECTION_2016: pd.Series) -> pd.Series: |
| """D_ELECTION_2016 shifted by 1 to create D_ELECTION_2016_shifted""" |
| return D_ELECTION_2016.shift(1) |
| |
| def SOME_OUTPUT_NAME(SOME_INPUT_NAME: pd.Series) -> pd.Series: |
| """SOME_INPUT_NAME shifted by 1 to create SOME_OUTPUT_NAME""" |
| return SOME_INPUT_NAME.shift(1) |
| ``` |
| Note that `@parameterized_inputs` is deprecated, and we intend for you to use `@parameterize_sources`. We're consolidating |
| to make the parameterization decorators more consistent! But we will not break your workflow for a long time. |
| |
| *Note*: that the different input variables must all have compatible types with the original decorated input variable. |
| |
| ## Migrating @parameterized* |
| |
| As we've said above, we're planning on deprecating the following: |
| |
| - `@parameterized_inputs` (replaced by `@parameterize_sources`) |
| - `@parametrized` (replaced by `@parameterize_values`, as that's what its really doing) |
| - `@parametrized_input` (deprecated long ago, migrate to `@parameterize_sources` as that is more versatile.) |
| |
| In other words, we're aligning around the following `@parameterize` implementations: |
| |
| - `@parameterize` -- this does everything you want |
| - `@parameterize_values` -- this just changes the values, does not change the input source |
| - `@parameterize_sources`-- this just changes the source of the inputs. We also changed the name from inputs -> sources as it was clearer (values are inputs as well). |
| |
| The only non-drop-in change you'll have to do is for `@parameterized`. We won't update this until `hamilton==2.0.0`, though, |
| so you'll have time to migrate for a while. |
| |
| |
| ## @extract_columns |
| This works on a function that outputs a dataframe, that we want to extract the columns from and make them individually |
| available for consumption. So it expands a single function into _n functions_, each of which take in the output dataframe |
| and output a specific column as named in the `extract_columns` decorator. |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import extract_columns |
| |
| @extract_columns('fiscal_date', 'fiscal_week_name', 'fiscal_month', 'fiscal_quarter', 'fiscal_year') |
| def fiscal_columns(date_index: pd.Series, fiscal_dates: pd.DataFrame) -> pd.DataFrame: |
| """Extracts the fiscal column data. |
| We want to ensure that it has the same spine as date_index. |
| :param fiscal_dates: the input dataframe to extract. |
| :return: |
| """ |
| df = pd.DataFrame({'date_index': date_index}, index=date_index.index) |
| merged = df.join(fiscal_dates, how='inner') |
| return merged |
| ``` |
| Note: if you have a list of columns to extract, then when you call `@extract_columns` you should call it with an |
| asterisk like this: |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import extract_columns |
| |
| @extract_columns(*my_list_of_column_names) |
| def my_func(...) -> pd.DataFrame: |
| """...""" |
| ``` |
| |
| ## @does |
| `@does` is a decorator that allows you to replace the decorated function with the behavior from another |
| function. This allows for easy code-reuse when building repeated logic. You do this by decorating a |
| function with`@does`, which takes in two parameters: |
| 1. `replacing_function` Required -- a function that takes in a "compatible" set of arguments. This means that it |
| will work when passing the corresponding keyword arguments to the decorated function. |
| 2. `**argument_mapping` -- a mapping of arguments from the replacing function to the replacing function. This makes for easy reuse of |
| functions. Confused? See the examples below. |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import does |
| |
| def _sum_series(**series: pd.Series) -> pd.Series: |
| """This function takes any number of inputs and sums them all together.""" |
| return sum(series) |
| |
| @does(_sum_series) |
| def D_XMAS_GC_WEIGHTED_BY_DAY(D_XMAS_GC_WEIGHTED_BY_DAY_1: pd.Series, |
| D_XMAS_GC_WEIGHTED_BY_DAY_2: pd.Series) -> pd.Series: |
| """Adds D_XMAS_GC_WEIGHTED_BY_DAY_1 and D_XMAS_GC_WEIGHTED_BY_DAY_2""" |
| pass |
| ``` |
| |
| In the above example `@does` applies `_sum_series` to the function `D_XMAS_GC_WEIGHTED_BY_DAY`. |
| Note we don't need any parameter replacement as `_sum_series` takes in just `**kwargs`, enabling it |
| to work with any set of parameters (and thus any old function). |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import does |
| |
| import internal_company_logic |
| |
| def _load_data(db: str, table: str) -> pd.DataFrame: |
| """Helper function to load data using your internal company logic""" |
| return internal_company_logic.read_table(db=db, table=table) |
| |
| @does(_load_data, db='marketing_spend_db', table='marketing_spend_table') |
| def marketing_spend_data(marketing_spend_db: str, marketing_spend_table: str) -> pd.Series: |
| """Loads marketing spend data from the database""" |
| pass |
| |
| @does(_load_data, db='client_acquisition_db', table='client_acquisition_table') |
| def client_acquisition_data(client_acquisition_db: str, client_acquisition_table: str) -> pd.Series: |
| """Loads client acquisition data from the database""" |
| pass |
| ``` |
| |
| In the above example, `@does` applies our internal function `_load_data`, which applies custom |
| logic to load a table from a database in the data warehouse. Note that we map the parameters -- in the first example, |
| the value of the parameter `marketing_spend_db` is passed to `db`, and the value of the parameter `marketing_spend_table` |
| is passed to `table`. |
| |
| |
| ## @config.when* |
| |
| `@config.when` allows you to specify different implementations depending on configuration parameters. |
| |
| The following use cases are supported: |
| 1. A column is present for only one value of a config parameter -- in this case, we define a function only once, |
| with a `@config.when` |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import config |
| |
| # signups_parent_before_launch is only present in the kids business line |
| @config.when(business_line='kids') |
| def signups_parent_before_launch(signups_from_existing_womens_tf: pd.Series) -> pd.Series: |
| """TODO: |
| :param signups_from_existing_womens_tf: |
| :return: |
| """ |
| return signups_from_existing_womens_tf |
| ``` |
| 2. A column is implemented differently for different business inputs, e.g. in the case of Stitch Fix gender intent. |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import config, model |
| import internal_package_with_logic |
| |
| # Some 21 day autoship cadence does not exist for kids, so we just return 0s |
| @config.when(gender_intent='kids') |
| def percent_clients_something__kids(date_index: pd.Series) -> pd.Series: |
| return pd.Series(index=date_index.index, data=0.0) |
| |
| # In other business lines, we have a model for it |
| @config.when_not(gender_intent='kids') |
| @model(internal_package_with_logic.GLM, 'some_model_name', output_column='percent_clients_something') |
| def percent_clients_something_model() -> pd.Series: |
| pass |
| ``` |
| Note the following: |
| - The function cannot have the same name in the same file (or python gets unhappy), so we name it with a |
| __ (dunderscore) as a suffix. The dunderscore is removed before it goes into the DAG. |
| - There is currently no `@config.otherwise(...)` decorator, so make sure to have `config.when` specify set of |
| configuration possibilities. |
| Any missing cases will not have that output column (and subsequent downstream nodes may error out if they ask for it). |
| To make this easier, we have a few more `@config` decorators: |
| |
| - `@config.when_not(param=value)` Will be included if the parameter is _not_ equal to the value specified. |
| - `@config.when_in(param=[value1, value2, ...])` Will be included if the parameter is equal to one of the specified |
| values. |
| - `@config.when_not_in(param=[value1, value2, ...])` Will be included if the parameter is not equal to any of the |
| specified values. |
| - `@config` If you're feeling adventurous, you can pass in a lambda function that takes in the entire configuration |
| and resolves to |
| `True` or `False`. You probably don't want to do this. |
| |
| To pass in the right value, you would provide `param`, e.g. `gender_intent`, or `business_line`, as a field in the dictionary passed to instantiate the driver. E.g. |
| ```python |
| config = { |
| "business_line": "kids" |
| } |
| dr = driver.Driver(config, module1, ...) |
| ``` |
| |
| ## @tag and friends |
| |
| ### @tag |
| |
| Allows you to attach metadata to an output(s), i.e. all nodes generated by a function and its decorators (note, this, by default, only applies to "final" nodes -- |
| not any intermediate nodes that are generated...). See "A mental model for decorators" below for how/why this works. |
| |
| ```python |
| A common use of this is to enable marking nodes as part of some data product, or for GDPR/privacy purposes. |
| |
| For instance: |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import tag |
| |
| def intermediate_column() -> pd.Series: |
| pass |
| |
| @tag(data_product='final', pii='true') |
| def final_column(intermediate_column: pd.Series) -> pd.Series: |
| pass |
| ``` |
| |
| `@tag` also allows you to specify a target with the `target_` parameter. See "A mental model for decorators" below for more details. |
| |
| |
| ### @tag_outputs (deprecated) |
| |
| `tag_outputs` enables you to attach metadata to a function that outputs multiple nodes, |
| and give different tag values to different outputs: |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import tag_outputs, extract_columns |
| |
| def intermediate_column() -> pd.Series: |
| pass |
| |
| @tag_outputs( |
| public={'column_a' : 'public'}, |
| private={'column_b' : 'private'}) |
| @extract_columns('column_a', 'column_b') |
| def data_used_in_multiple_ways() -> pd.DataFrame: |
| return load_some_data(...) |
| ``` |
| |
| In this case, the tag `accessibility` would have different values for the two nodes produced by the `data_used_in_multiple_ways` |
| function -- `public` for `column_a` public `private` for `column_b`.. |
| |
| A note on decorator precedence. If using `@tag` together with `@tag_outputs` on a function (you might want to do this because you use `@tag` to |
| "tag" all nodes with a certain set of values, and `@tag_outputs` to "tag" specific outputs with specific values), |
| they will be applied in order up from the function. So if you desire to override |
| `@tag`, for that to work, you would put `tag_outputs` above `tag` (as it would be applied last). |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import tag_outputs, tag, extract_columns |
| |
| def intermediate_column() -> pd.Series: |
| pass |
| |
| @tag_outputs( |
| public={'column_a' : 'public'}, |
| private={'column_b' : 'private', 'common_tag' : 'bar'}) |
| @tag(common_tag="foo") |
| @extract_columns('column_a', 'column_b') |
| def data_used_in_multiple_ways() -> pd.DataFrame: |
| return load_some_data(...) |
| ``` |
| |
| In the case above, `common_tag` would resolve to `foo` for `column_a` and `bar` for `column_b`. |
| Attempting an override in the reverse direction is currently undefined behavior. |
| |
| ### How do I query by tags? |
| Right now, we don't have a specific interface to query by tags, however we do expose them via the driver. |
| Using the `list_available_variables()` capability exposes tags along with their names & types, |
| enabling querying of the available outputs for specific tag matches. |
| E.g. |
| ```python |
| |
| from hamilton import driver |
| dr = driver.Driver(...) # create driver as required |
| all_possible_outputs = dr.list_available_variables() |
| desired_outputs = [o.name for o in all_possible_outputs |
| if 'my_tag_value' == o.tags.get('my_tag_key')] |
| output = dr.execute(desired_outputs) |
| ``` |
| |
| ## @check_output |
| |
| The `@check_output` decorator enables you to add simple data quality checks to your code. |
| |
| For example: |
| |
| ```python |
| import pandas as pd |
| import numpy as np |
| from hamilton.function_modifiers import check_output |
| |
| @check_output( |
| data_type=np.int64, |
| data_in_range=(0,100), |
| ) |
| def some_int_data_between_0_and_100() -> pd.Series: |
| pass |
| ``` |
| |
| The check_output validator takes in arguments that each correspond to one of the default validators. |
| These arguments tell it to add the default validator to the list. The above thus creates |
| two validators, one that checks the datatype of the series, and one that checks whether the data is in a certain range. |
| |
| |
| Note that you can also specify custom decorators using the `@check_output_custom` decorator. |
| `check_output` and `check_output_custom` accept the `_target` parameter, which specifies *which* output to check. |
| |
| See [data_quality](data_quality.md) for more information on available validators and how to build custom ones. |
| |
| ## @subdag |
| |
| The `@subdag` decorator enables you to rerun components of your DAG with varying parameters. Note that this is immensely powerful -- if we |
| draw analogies from Hamilton to standard procedural programming paradigms, we might have the following correspondence: |
| |
| - `config.when` + friends -- `if/else` statements |
| - `parameterize`/`extract_columns` -- `for` loop |
| - `does` -- effectively macros |
| And so on. `@subdag` takes this one step further. |
| - `@subdag` -- subroutine definition |
| E.G. take a certain set of nodes, and run them with specified parameters. |
| |
| Why might you want to use this? Let's take a look at some examples: |
| |
| 1. You have a feature engineering pipeline that you want to run on multiple datasets. If its exactly the same, this is perfect. If not, this |
| works perfectly as well, you just have to utilize different functions in each or the `config.when` + `config` parameter to rerun it. |
| 2. You want to train multiple models in the same DAG that share some logic (in features or training) -- this allows you to reuse and continually add more. |
| 3. You want to combine multiple similar DAGs (e.g. one for each business line) into one so you can build a cross-business line model. |
| |
| This basically bridges the gap between the flexibility of non-declarative pipelining frameworks with the readability/maintainability of declarative ones. |
| |
| Let's take a look at a simplified example (in [examples/](examples/reusing_functions/reusable_subdags.py)). |
| |
| ```python |
| @extract_columns("timestamp", "user_id", "region") # one of "US", "CA" (canada) |
| def website_interactions() -> pd.DataFrame: |
| return ... |
| |
| def interactions_filtered(website_interactions: pd.DataFrame, region: str) -> pd.DataFrame: |
| """Filters interactions by region -- note this will be run differently depending on the region its in""" |
| pass |
| |
| def unique_users(filtered_interactions: pd.DataFrame, grain: str) -> pd.Series: |
| """Gives the number of shares traded by the frequency""" |
| return ... |
| |
| @subdag( |
| unique_users, interactions_filtered, |
| inputs={"grain": value("day")}, |
| config={"region": "US"}, |
| ) |
| def daily_users_US(unique_users: pd.Series) -> pd.Series: |
| """Calculates quarterly data for just US users |
| Note that this just returns the output. You can modify it if you'd like! |
| """ |
| |
| return unique_users |
| |
| |
| @subdag( |
| unique_users, interactions_filtered, |
| inputs={"grain": value("day")}, |
| config={"region": "CA"}, |
| ) |
| def daily_users_CA(unique_users: pd.Series) -> pd.Series: |
| """Calculates quarterly data for just canada users""" |
| return unique_users |
| ``` |
| |
| This example tracks users on a per-value user data. Specifically, we track the following: |
| |
| 1. Daily user data for canada |
| 2. Quarterly user data for the US |
| |
| These each live under a separate namespace -- this exists solely so the two sets of similar nodes can coexist. |
| Note this set is contrived to demonstrate functionality -- it should be easy to imagine how we could add more variations. |
| |
| The inputs to the `subdag` decorator takes in a variety of inputs that determine _which_ possible set of functions will constitute the DAG, and then specifically _what_ the DAG is and _how_ it is connected. |
| - _which_ functions constitute the subdag is specified by the `*args` input, which is a collection of modules/functions. These are used to determine the functions (i.e. nodes) that could end up in the produced subDAG. |
| - _what_ the sub-DAG is, and _how_ it is connected is specified by two parameters. `config` provides configuration to use to generate the subDAG (in this case the region), and `inputs` provides inputs to connect the subdag with the current DAG (similar in spirit to `@parameterize`). |
| |
| |
| Note that, if you wanted to do this functionality without this decorator, you'd have two options: |
| 1. Rewrite every function for each scenario -- this is repetetive and doesn't scale |
| 2. Utilize the `driver` within the functions -- E.G. |
| |
| ```python |
| def daily_users_CA(unique_users: pd.Series) -> pd.Series: |
| """Calculates quarterly data for just canada users""" |
| dr = hamilton.driver.Driver({"region" : "CA"}, unique_users, interactions_filtered) |
| return dr.execute(["unique_users"], inputs={"grain": value("day")})["unique_users"] |
| ``` |
| |
| While this is a clever approach (and you can use it if you want), there are a few drawbacks: |
| |
| 1. You have no visibility into the subdag -- hamilton has no knowledge about what's being run |
| 2. Any execution parameters have to be reproduced for the driver. E.G. running on dask, etc..., is unknown to the driver. |
| 3. DAG compilation is done at runtime -- you would not know any type-errors until the function is running |
| |
| That said, this *is* a good mental model for explaining subdag functionality. Think about |
| it as a driver within a function, but in a way that hamilton can understand and operate. |
| Best of both worlds! |
| |
| ## @parameterize_extract_columns |
| |
| `@parameterize_extract_columns` gives you the power of both `@extract_columns` and `@parameterize` in one decorator. |
| |
| It takes in a list of `Parameterized_Extract` objects, each of which is composed of: |
| 1. A list of columns to extract, and |
| 2. A parameterization that gets used |
| |
| In the following case, we produce four columns, two for each parameterization. |
| |
| ```python |
| import pandas as pd |
| from function_modifiers import parameterize_extract_columns, ParameterizedExtract, source, value |
| @parameterize_extract_columns( |
| ParameterizedExtract( |
| ("outseries1a", "outseries2a"), |
| {"input1": source("inseries1a"), "input2": source("inseries1b"), "input3": value(10)}, |
| ), |
| ParameterizedExtract( |
| ("outseries1b", "outseries2b"), |
| {"input1": source("inseries2a"), "input2": source("inseries2b"), "input3": value(100)}, |
| ), |
| ) |
| def fn(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame: |
| return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1) |
| ``` |
| |
| ## @parameterize_frame |
| |
| `@parameterize_frame` enables you to run parameterize_extract_columns with a dataframe specifying the parameterizations |
| -- allowing for less verbose specification. The above example can be rewritten as: |
| |
| ```python |
| from hamilton.experimental.decorators.parameterize_frame import parameterize_frame |
| df = pd.DataFrame( |
| [ |
| ["outseries1a", "outseries2a", "inseries1a", "inseries2a", 10], |
| ["outseries1b", "outseries2b", "inseries1b", "inseries2b", 100], |
| # ... |
| ], |
| # Have to switch as indices have to be unique |
| columns=[ |
| [ |
| "output1", |
| "output2", |
| "input1", |
| "input2", |
| "input3", |
| ], # configure whether column is source or value and also whether it's input ("source", "value") or output ("out") |
| ["out", "out", "source", "source", "value"], |
| ], |
| ) |
| |
| @parameterize_frame(df) |
| def my_func(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame: |
| return pd.DataFrame( |
| [input1 * input2 * input3, input1 + input2 + input3] |
| ) |
| ``` |
| |
| Note that we have a double-index. Note that this is still in experimental, |
| and has the possibility of being changed; we'd love feedback on this |
| API if you end up using it! |
| |
| |
| ## @model |
| `@model` allows you to abstract a function that is a model. You will need to implement models that make sense for |
| your business case. Reach out if you need examples. |
| |
| Under the hood, they're just DAG nodes whose inputs are determined by a configuration parameter. A model takes in |
| two required parameters: |
| 1. The class it uses to run the model. If external to Stitch Fix you will need to write your own, else internally |
| see the internal docs for this. Basically the class defined determines what the function actually does. |
| 2. The configuration key that determines how the model functions. This is just the name of a configuration parameter |
| that stores the way the model is run. |
| |
| The following is an example usage of `@model`: |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import model |
| import internal_package_with_logic |
| |
| @model(internal_package_with_logic.GLM, 'model_p_cancel_manual_res') |
| # This runs a GLM (Generalized Linear Model) |
| # The associated configuration parameter is 'model_p_cancel_manual_res', |
| # which points to the results of loading the model_p_cancel_manual_res table |
| def prob_cancel_manual_res() -> pd.Series: |
| pass |
| ``` |
| |
| `GLM` here is not part of the hamilton framework, and instead a user defined model. |
| |
| Models (optionally) accept a `output_column` parameter -- this is specifically if the name of the function differs |
| from the output column that it should represent. E.G. if you use the model result as an intermediate object, and manipulate |
| it all later. At Stitch Fix this is necessary because various dependent columns that a model queries |
| (e.g. `MULTIPLIER_...` and `OFFSET_...`) are derived from the model's name. |
| |
| ## A mental model for decorators |
| |
| Decorators fall under a few broad categories: |
| |
| 1. Decorators that decide whether a function should resolves into a node (E.G. config.when()) |
| 2. Decorators that *turn* a function into a set of nodes (E.G. `@does`) |
| 3. Decorators that modify a set of nodes, turning it into another set of nodes (E.G. `@parameterize`, `@extract_columns`, and `@check_output`) |
| |
| Note the distinction between "functions" and "nodes". "functions" are what the user writes, and in the plain case, |
| correspond 1:1 to nodes. Nodes are then generated from functions -- decorators control how that happens. Thus, one |
| function can correspond to 0, 1 or many nodes. |
| |
| The decorators that do this are broken up into subclasses that each have various nuances, one of which is whether or not they allow layering. |
| To layer effectively, we need to know *which* nodes a decorator will modify from the result of another decorators. For the ones |
| that allow this type of layers, they take in a `target_` parameter (with `_` due to the fact that they often have **kwargs) that tells *which* |
| nodes they should modify. |
| |
| There are 4 options for `target_`: |
| |
| 1. None -- this is the default behavior. This only applies to nodes generated by the function that are not |
| depended on by any other nodes generated by the function. |
| 2. `...` (the literal python Ellipsis). This means that *all* nodes generated by the function and its decorators |
| will have the transformation applied to them. |
| 3. A string. This means that only the node with the specified name will have the transformation applied. |
| 4. A list of strings. This means that all nodes with the specified name will have the transformation applied. |
| |
| This is very powerful, and we will likely be applying it to more decorators. Currently `tag` and `check_output` are the only |
| decorators that accept this parameter. |
| |
| To dive into this, let's take a look at the `tag` decorator with the default `target_`: |
| |
| This will only decorate `col1`, `col2`, and `col3`: |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import tag, extract_columns |
| |
| @tag(type="dataframe") |
| @extract_columns("col1", "col2", "col3") |
| def dataframe() -> pd.DataFrame: |
| pass |
| ``` |
| |
| Once we decorate it with `target_="original_dataframe"`, we are now *just* decorating |
| the dataframe that gets extracted from (that is a node after all). |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import tag, extract_columns |
| |
| @tag(type="dataframe", target_='original_dataframe') |
| @extract_columns("col1", "col2", "col3") |
| def dataframe() -> pd.DataFrame: |
| pass |
| ``` |
| |
| Whereas the following would tag all the columns we extract one value: |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import tag, extract_columns |
| |
| @tag(type="extracted_column", target_=["col1", "col2", "col3"]) |
| @extract_columns("col1", "col2", "col3") |
| def dataframe() -> pd.DataFrame: |
| pass |
| ``` |
| |
| The following would tag *everything* |
| |
| ```python |
| import pandas as pd |
| from hamilton.function_modifiers import tag, extract_columns |
| |
| @tag(type="any_node_created", target_=...) |
| @extract_columns("col1", "col2", "col3") |
| def dataframe() -> pd.DataFrame: |
| pass |
| ``` |
| |
| Passing `None` to `target_` is the same as not passing it at all -- this decorates the "final" nodes as we saw in the first example. |