tree: 0685d8f6323c59fecf6fdbc878091c22a2c46ed4 [path history] [tgz]
  1. scenario_1/
  2. scenario_2/
  3. README.md
examples/feature_engineering/feature_engineering_multiple_contexts/README.md

Feature Engineering in Multiple Contexts

What is feature engineering? It's the process of transforming data for input to a “model”.

To make models better, it's common to perform and try a lot of “transforms”. This is where Hamilton comes in. Hamilton allows you to:

  • write different transformations in a straightforward and formulaic manner
  • keep them managed and versioned with computational lineage (if using something like git)
  • has a great testing and documentation story

which allows you to sanely iterate, maintain, and determine what works best for your modeling domain.

In this series of examples, we‘ll skip talking about the benefits of Hamilton here, and instead focus on how to use it for feature engineering. But first, some context on what challenges you’re likely to face with feature engineering in general.

What is hard about feature engineering?

There are certain dimensions that make feature engineering hard:

  1. Code: Organizing and maintaining code for reuse/collaboration/discoverability.
  2. Lineage: Keeping track of what data is being used for what purpose.
  3. Deployment: Offline vs online vs streaming needs.

Code: Organizing and maintaining code for reuse/collaboration/discoverability.

Individuals build features, but teams own them.

Have you ever dreaded taking over someone else's code? This is a common problem with feature engineering!

Why? The code for feature engineering is often spread out across many files, and created by many individuals. E.g. scripts, notebooks, libraries, etc., and written in many ways. This makes it hard to reuse code, collaborate, discover what code is available, and therefore maintain what is actually being used in “production” and what is not.

Lineage: Keeping track of what data is being used for what purpose

With the growth of data teams, along with data governance & privacy regulations, the need for knowing and understanding what data is being used and for what purpose is important for the business to easily answer. A “modeler” a lot of the times is not a stakeholder in needing this visibility, they just want to build models, but these concerns are often put on their plate to address, which slows down their ability to build and ship features and thus models.

Not having lineage or visibility into what data is being used for what purpose can lead to a lot of problems:

  • teams break data assumptions without knowing it, e.g. upstream team stops updating data used downstream.
  • teams are not aware of what data is available to them, e.g. duplication of data & effort.
  • teams have to spend time figuring out what data is being used for what purpose, e.g. to audit models.
  • teams struggle to debug inherited feature workflows, e.g. to fix bugs or add new features.

Deployment: Offline vs online vs streaming needs

This is a big topic. We wont do it justice here, but let's try to give a brief overview of two main problems:

(1) There are a lot of different deployment needs when you get something to production. For example, you might want to:

  • run a batch job to generate features for a model
  • hit a webservice to make predictions in realtime that needs features computed on the fly, or retrieved from a cache (e.g. feature store).
  • run a streaming job to generate features for a model in realtime
  • require all three or a subset of the above ways of deploying features.

So the challenge is, how do you design your processes to take in account your deployment needs?

(2) Implement features once or twice or thrice? To enable (1), you need to ask yourself, can we share features? or do we need to reimplement them for every system that we want to use them in?

With (1) and (2) in mind, you can see that there are a lot of different dimensions to consider when designing your feature engineering processes. They have to connect with each other, and be flexible enough to support your specific deployment needs.

Using Hamilton for Feature Engineering for Batch/Offline

If you fall into only needing to deploy features for batch jobs, then stop right there. You don't need these examples, as they are focused on how to bridge the gap between “offline” and “online” feature engineering. You should instead browse the other examples like data_quality.

Using Hamilton for Feature Engineering for Batch/Offline and Online/Streaming

These example scenarios here are for the people who have to deal with both batch and online feature engineering.

We provide two examples for two common scenarios that occur if you have this need. Note, the example code in these scenarios tries to be illustrative about how to think and frame using Hamilton. It contains minimal features so as to not overwhelm you, and leaves out some implementation details that you would need to fill in for your specific use case, e.g. like fitting a model using the features, or where to store aggregate feature values, etc.

Scenario Context

A not too uncommon task is that you need to do feature engineering in an offline (e.g. batch via airflow) setting, as well as an online setting (e.g. synchronous request via FastAPI). What commonly happens is that the code for features is not shared, and results in two implementations that result in subtle bugs and hard to maintain code.

With this example series, we show how you can use Hamilton to:

  1. write a feature once. (scenarios 1 and 2)
  2. leverage that feature code anywhere that python runs. e.g. in batch and online. (scenarios 1 and 2)
  3. show how to modularize components so that if you have values cached in a feature store, you can inject those values into your feature computation needs. (scenario 2)

The task that we‘re modeling here isn’t that important, but if you must know, we're trying to predict the number of hours of absence that an employee will have given some information about them; this is based off the data_quality example, which is based off of the Metaflow+Hamilton example, where Hamilton was used for the feature engineering process -- in that example only offline feature engineering was modeled.

Assumptions we're using:

  1. You have a fixed set of features that you want to compute for a model that you have determined as being useful a priori.
  2. We are agnostic of the actual model -- and skip any details of that in the examples.
  3. We use Pandas as the data structure in our example here because it‘s easy to reuse in a batch, and online context. However you need not use Pandas if you don’t want to.

Let's explain the context of the two scenarios a bit more.

Scenario 1: the simple case - ETL + Online API

In this scenario we assume we can get the same raw inputs at prediction time, as would be provided at training time.

This is a straightforward process if all your feature transforms are map operations. If however you have some transforms that are aggregations, then you need to be careful about how you connect your offline ETL with online.

In this example, there are two features, age_mean and age_std_dev, that we avoid recomputing in an online setting. Instead, we “store” the values for them when we compute features in the offline ETL, and then use those “stored” values at prediction time to get the right feature computation to happen.

Scenario 2: the more complex case - request doesn't have all the raw data - ETL + Online API

In this scenario we assume we are not passed in data, but need to fetch it ourselves as part of the online API request.

We will pretend to hit a feature store, that will provide us with the required data to compute the features for input to the model. This example shows one way to modularize your Hamilton code so that you can swap out the “source” of the data. To simplify the example, we assume that we can get all the input data we need from a feature store, rather than it also coming in via the request. Note: if using a feature store, which is effectively a cache, you might not need Hamilton on the online side, if, and only if, you can get all the data you need from the feature store, without needing to perform any computations. In this situation, you would push compute features to the feature store from your offline ETL process that creates features.

A good exercise would be to make note of the differences with this scenario (2) and scenario (1) in how they structure the code with Hamilton.

What's next?

Jump into each directory and read the README, it'll explain how the example is set up and how things should work.

What are extensions/uses not shown here but we know you can do them

Here are two ideas that come to mind:

  1. Streaming settings. Given the examples, it should be clear how to make it possbile to use Hamilton in a streaming setting.
  2. How to ask Hamilton what features are needed as input to know what to request from the feature store. With tags, and querying the DAG at the start of the app, you could dynamically ask Hamilton what's required and then only go to the feature store for that data. If this type of example would be of interest, let us know.