blob: 95a397a477f0162104e0c47bcd2b0beac91d4cde [file] [log] [blame] [view]
# Classic Hamilton Hello World
In this example we show you a custom scikit-learn `Transformer` class. This class should be compliant with [scikit-learn transformers specifications](https://scikit-learn.org/stable/developers/develop.html). This class is meant to be used as part of broader scikit-learn pipelines. Scikit-learn estimators and pipelines allow for stateful objects, which are helpful when applying transformations on train-test splits notably. Also, all pipeline, estimator, and transformer objects should be picklable, enabling reproducible pipelines.
File organization:
* `my_functions_a.py` and `my_functions_b.py` house the logic that we want to compute.
* `run.py` runs the DAG and asserts the properties of the output for basic use cases.
To run things:
```bash
> python run.py
```
# DAG Visualization:
Here is the visualization of the execution that the transformer currently performs if you run `run.py`:
![scikit_transformer](scikit_transformer.png)
# Limitations and TODOs
- The current implementation relies on Hamilton defaults' `base.HamiltonGraphAdapter` and `base.PandasDataFrameResult` which limits the compatibility with other computation engines supported by Hamilton
- The current implementation could be improved for deeper object inspection. A particular challenge is that the Hamilton driver alters the number of columns / features of the input array, and does so by specifying the *output columns*. In contrast, scikit-learn typically reads columns / features of the *input array* and passes it down. It would be worth looking at [the output feature naming convention](https://scikit-learn.org/stable/developers/develop.html#developer-api-for-set-output) and the [`ColumnTransformer` class](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) which aims to fulfill a similar objective to Hamilton.
- The current implementation allows little direct access to the `Hamilton driver`. Currently, the driver accessible via the `.driver_` attribute after calling `.fit()` or `.fit_transform()` which seems to be coherent with typical scikit-learn behavior.
- The current implementation could be slightly modified to have no dependencies on scikit-learn itself. However, having inheriting from `BaseEstimator` and `TransformerMixin` provides some clarity about the class's purpose.