examples/feature_engineering_multiple_contexts/README.md - hamilton - Git at Google

 # Feature Engineering in Multiple Contexts

 What is feature engineering? It's the process of transforming data for input to a "model".

 To make models better, it's common to perform and try a lot of "transforms". This is where Hamilton comes in.
 Hamilton allows you to:
 * write different transformations in a straightforward and formulaic manner
 * keep them managed and versioned with computational lineage (if using something like git)
 * has a great testing and documentation story

 which allows you to sanely iterate, maintain, and determine what works best for your modeling domain.

 In this series of examples, we'll skip talking about the benefits of Hamilton here, and instead focus on how to use it
 for feature engineering. But first, some context on what challenges you're likely to face with feature engineering
 in general.

 # What is hard about feature engineering?
 There are certain dimensions that make feature engineering hard:

 1. Code: Organizing and maintaining code for reuse/collaboration/discoverability.
 2. Lineage: Keeping track of what data is being used for what purpose.
 3. Deployment: Offline vs online vs streaming needs.

 ## Code: Organizing and maintaining code for reuse/collaboration/discoverability.
 > Individuals build features, but teams own them.

 Have you ever dreaded taking over someone else's code? This is a common problem with feature engineering!

 Why? The code for feature engineering is often spread out across many files, and created by many individuals.
 E.g. scripts, notebooks, libraries, etc., and written in many ways. This makes it hard to reuse code,
 collaborate, discover what code is available, and therefore maintain what is actually being used in "production" and
 what is not.

 ## Lineage: Keeping track of what data is being used for what purpose
 With the growth of data teams, along with data governance & privacy regulations, the need for knowing and understanding what
 data is being used and for what purpose is important for the business to easily answer. A "modeler" a lot of the times
 is not a stakeholder in needing this visibility, they just want to build models, but these concerns are often put on
 their plate to address, which slows down their ability to build and ship features and thus models.

 Not having lineage or visibility into what data is being used for what purpose can lead to a lot of problems:
  - teams break data assumptions without knowing it, e.g. upstream team stops updating data used downstream.
  - teams are not aware of what data is available to them, e.g. duplication of data & effort.
  - teams have to spend time figuring out what data is being used for what purpose, e.g. to audit models.
  - teams struggle to debug inherited feature workflows, e.g. to fix bugs or add new features.


 ## Deployment: Offline vs online vs streaming needs
 This is a big topic. We wont do it justice here, but let's try to give a brief overview of two main problems:

 (1) There are a lot of different deployment needs when you get something to production. For example, you might want to:
    - run a batch job to generate features for a model
    - hit a webservice to make predictions in realtime that needs features computed on the fly, or retrieved from a cache (e.g. feature store).
    - run a streaming job to generate features for a model in realtime
    - require all three or a subset of the above ways of deploying features.

 So the challenge is, how do you design your processes to take in account your deployment needs?

 (2) Implement features once or twice or thrice? To enable (1), you need to ask yourself, can we share features? or
 do we need to reimplement them for every system that we want to use them in?

 With (1) and (2) in mind, you can see that there are a lot of different dimensions to consider when designing your
 feature engineering processes. They have to connect with each other, and be flexible enough to support your specific
 deployment needs.

 # Using Hamilton for Feature Engineering for Batch/Offline
 If you fall into **only** needing to deploy features for batch jobs, then stop right there. You don't need these examples,
 as they are focused on how to bridge the gap between "offline" and "online" feature engineering. You should instead
 browse the other examples like `data_quality`.

 # Using Hamilton for Feature Engineering for Batch/Offline and Online/Streaming
 These example scenarios here are for the people who have to deal with both batch and online feature engineering.

 We provide two examples for two common scenarios that occur if you have this need. Note, the example code in these
 scenarios tries to be illustrative about how to think and frame using Hamilton. It contains minimal features so as to
 not overwhelm you, and leaves out some implementation details that you would need to fill in for your specific use case,
 e.g. like fitting a model using the features, or where to store aggregate feature values, etc.

 ## Scenario Context
 A not too uncommon task is that you need to do feature engineering in an offline (e.g. batch via airflow)
 setting, as well as an online setting (e.g. synchronous request via FastAPI). What commonly
 happens is that the code for features is not shared, and results in two implementations
 that result in subtle bugs and hard to maintain code.

 With this example series, we show how you can use Hamilton to:

 1. write a feature once. (scenarios 1 and 2)
 2. leverage that feature code anywhere that python runs. e.g. in batch and online. (scenarios 1 and 2)
 3. show how to modularize components so that if you have values cached in a feature store,
 you can inject those values into your feature computation needs. (scenario 2)

 The task that we're modeling here isn't that important, but if you must know, we're trying to predict the number of
 hours of absence that an employee will have given some information about them; this is based off the `data_quality`
 example, which is based off of the [Metaflow+Hamilton example](https://outerbounds.com/blog/developing-scalable-feature-engineering-dags/),
 where Hamilton was used for the feature engineering process -- in that example only offline feature engineering was modeled.

 Assumptions we're using:
 1. You have a fixed set of features that you want to compute for a model that you have determined as being useful a priori.
 2. We are agnostic of the actual model -- and skip any details of that in the examples.
 3. We use Pandas as the data structure in our example here because it's easy to reuse in a batch, and online context. However you
 need not use Pandas if you don't want to.

 Let's explain the context of the two scenarios a bit more.

 ## Scenario 1: the simple case - ETL + Online API
 In this scenario we assume we can get the same raw inputs at prediction time, as would be provided at training time.

 This is a straightforward process if all your feature transforms are [map operations](https://en.wikipedia.org/wiki/Map_(higher-order_function)).
 If however you have some transforms that are aggregations, then you need to be careful about how you connect your offline
 ETL with online.

 In this example, there are two features, `age_mean` and `age_std_dev`, that we avoid recomputing in an online setting.
 Instead, we "store" the values for them when we compute features in the offline ETL, and then use those "stored" values
 at prediction time to get the right feature computation to happen.

 ## Scenario 2: the more complex case - request doesn't have all the raw data - ETL + Online API
 In this scenario we assume we are not passed in data, but need to fetch it ourselves as part of the online API request.

 We will pretend to hit a feature store, that will provide us with the required data to compute the features for
 input to the model. This example shows one way to modularize your Hamilton code so that you can swap out the "source"
 of the data. To simplify the example, we assume that we can get all the input data we need from a feature store, rather
 than it also coming in via the request. Note: if using a feature store, which is effectively a cache, you might not need
 Hamilton on the online side, if, and only if, you can get all the data you need from the feature store, without needing
 to perform any computations. In this situation, you would push compute features to the feature store from your offline
 ETL process that creates features.

 A good exercise would be to make note of the differences with this scenario (2) and scenario (1) in how they structure
 the code with Hamilton.

 # What's next?
 Jump into each directory and read the README, it'll explain how the example is set up and how things should work.

 # What are extensions/uses not shown here but we know you can do them
 Here are two ideas that come to mind:

 1. Streaming settings. Given the examples, it should be clear how to make it possbile to use Hamilton in a streaming setting.
 2. How to ask Hamilton what features are needed as input to know what to request from the feature store. With tags, and
 querying the DAG at the start of the app, you could dynamically ask Hamilton what's required and then only go to the
 feature store for that data. If this type of example would be of interest, let us know.
	# Feature Engineering in Multiple Contexts

	What is feature engineering? It's the process of transforming data for input to a "model".

	To make models better, it's common to perform and try a lot of "transforms". This is where Hamilton comes in.
	Hamilton allows you to:
	* write different transformations in a straightforward and formulaic manner
	* keep them managed and versioned with computational lineage (if using something like git)
	* has a great testing and documentation story

	which allows you to sanely iterate, maintain, and determine what works best for your modeling domain.

	In this series of examples, we'll skip talking about the benefits of Hamilton here, and instead focus on how to use it
	for feature engineering. But first, some context on what challenges you're likely to face with feature engineering
	in general.

	# What is hard about feature engineering?
	There are certain dimensions that make feature engineering hard:

	1. Code: Organizing and maintaining code for reuse/collaboration/discoverability.
	2. Lineage: Keeping track of what data is being used for what purpose.
	3. Deployment: Offline vs online vs streaming needs.

	## Code: Organizing and maintaining code for reuse/collaboration/discoverability.
	> Individuals build features, but teams own them.

	Have you ever dreaded taking over someone else's code? This is a common problem with feature engineering!

	Why? The code for feature engineering is often spread out across many files, and created by many individuals.
	E.g. scripts, notebooks, libraries, etc., and written in many ways. This makes it hard to reuse code,
	collaborate, discover what code is available, and therefore maintain what is actually being used in "production" and
	what is not.

	## Lineage: Keeping track of what data is being used for what purpose
	With the growth of data teams, along with data governance & privacy regulations, the need for knowing and understanding what
	data is being used and for what purpose is important for the business to easily answer. A "modeler" a lot of the times
	is not a stakeholder in needing this visibility, they just want to build models, but these concerns are often put on
	their plate to address, which slows down their ability to build and ship features and thus models.

	Not having lineage or visibility into what data is being used for what purpose can lead to a lot of problems:
	- teams break data assumptions without knowing it, e.g. upstream team stops updating data used downstream.
	- teams are not aware of what data is available to them, e.g. duplication of data & effort.
	- teams have to spend time figuring out what data is being used for what purpose, e.g. to audit models.
	- teams struggle to debug inherited feature workflows, e.g. to fix bugs or add new features.


	## Deployment: Offline vs online vs streaming needs
	This is a big topic. We wont do it justice here, but let's try to give a brief overview of two main problems:

	(1) There are a lot of different deployment needs when you get something to production. For example, you might want to:
	- run a batch job to generate features for a model
	- hit a webservice to make predictions in realtime that needs features computed on the fly, or retrieved from a cache (e.g. feature store).
	- run a streaming job to generate features for a model in realtime
	- require all three or a subset of the above ways of deploying features.

	So the challenge is, how do you design your processes to take in account your deployment needs?

	(2) Implement features once or twice or thrice? To enable (1), you need to ask yourself, can we share features? or
	do we need to reimplement them for every system that we want to use them in?

	With (1) and (2) in mind, you can see that there are a lot of different dimensions to consider when designing your
	feature engineering processes. They have to connect with each other, and be flexible enough to support your specific
	deployment needs.

	# Using Hamilton for Feature Engineering for Batch/Offline
	If you fall into only needing to deploy features for batch jobs, then stop right there. You don't need these examples,
	as they are focused on how to bridge the gap between "offline" and "online" feature engineering. You should instead
	browse the other examples like `data_quality`.

	# Using Hamilton for Feature Engineering for Batch/Offline and Online/Streaming
	These example scenarios here are for the people who have to deal with both batch and online feature engineering.

	We provide two examples for two common scenarios that occur if you have this need. Note, the example code in these
	scenarios tries to be illustrative about how to think and frame using Hamilton. It contains minimal features so as to
	not overwhelm you, and leaves out some implementation details that you would need to fill in for your specific use case,
	e.g. like fitting a model using the features, or where to store aggregate feature values, etc.

	## Scenario Context
	A not too uncommon task is that you need to do feature engineering in an offline (e.g. batch via airflow)
	setting, as well as an online setting (e.g. synchronous request via FastAPI). What commonly
	happens is that the code for features is not shared, and results in two implementations
	that result in subtle bugs and hard to maintain code.

	With this example series, we show how you can use Hamilton to:

	1. write a feature once. (scenarios 1 and 2)
	2. leverage that feature code anywhere that python runs. e.g. in batch and online. (scenarios 1 and 2)
	3. show how to modularize components so that if you have values cached in a feature store,
	you can inject those values into your feature computation needs. (scenario 2)

	The task that we're modeling here isn't that important, but if you must know, we're trying to predict the number of
	hours of absence that an employee will have given some information about them; this is based off the `data_quality`
	example, which is based off of the [Metaflow+Hamilton example](https://outerbounds.com/blog/developing-scalable-feature-engineering-dags/),
	where Hamilton was used for the feature engineering process -- in that example only offline feature engineering was modeled.

	Assumptions we're using:
	1. You have a fixed set of features that you want to compute for a model that you have determined as being useful a priori.
	2. We are agnostic of the actual model -- and skip any details of that in the examples.
	3. We use Pandas as the data structure in our example here because it's easy to reuse in a batch, and online context. However you
	need not use Pandas if you don't want to.

	Let's explain the context of the two scenarios a bit more.

	## Scenario 1: the simple case - ETL + Online API
	In this scenario we assume we can get the same raw inputs at prediction time, as would be provided at training time.

	This is a straightforward process if all your feature transforms are [map operations](https://en.wikipedia.org/wiki/Map_(higher-order_function)).
	If however you have some transforms that are aggregations, then you need to be careful about how you connect your offline
	ETL with online.

	In this example, there are two features, `age_mean` and `age_std_dev`, that we avoid recomputing in an online setting.
	Instead, we "store" the values for them when we compute features in the offline ETL, and then use those "stored" values
	at prediction time to get the right feature computation to happen.

	## Scenario 2: the more complex case - request doesn't have all the raw data - ETL + Online API
	In this scenario we assume we are not passed in data, but need to fetch it ourselves as part of the online API request.

	We will pretend to hit a feature store, that will provide us with the required data to compute the features for
	input to the model. This example shows one way to modularize your Hamilton code so that you can swap out the "source"
	of the data. To simplify the example, we assume that we can get all the input data we need from a feature store, rather
	than it also coming in via the request. Note: if using a feature store, which is effectively a cache, you might not need
	Hamilton on the online side, if, and only if, you can get all the data you need from the feature store, without needing
	to perform any computations. In this situation, you would push compute features to the feature store from your offline
	ETL process that creates features.

	A good exercise would be to make note of the differences with this scenario (2) and scenario (1) in how they structure
	the code with Hamilton.

	# What's next?
	Jump into each directory and read the README, it'll explain how the example is set up and how things should work.

	# What are extensions/uses not shown here but we know you can do them
	Here are two ideas that come to mind:

	1. Streaming settings. Given the examples, it should be clear how to make it possbile to use Hamilton in a streaming setting.
	2. How to ask Hamilton what features are needed as input to know what to request from the feature store. With tags, and
	querying the DAG at the start of the app, you could dynamically ask Hamilton what's required and then only go to the
	feature store for that data. If this type of example would be of interest, let us know.