tree: 5526b3473d8f680a5cd168c75b2afc09d45f62a3 [path history] [tgz]
  1. .gitignore
  2. my_full_dag.png
  3. my_functions.py
  4. my_script.py
  5. my_wrapper.py
  6. README.md
examples/pandas/split-apply-combine/README.md

Split / Apply / Combine

This example demonstrates how to perform a split-apply-combine transformation using Hamilton.

Many data analysis or processing involve one or more of the following steps:

  • Split: splitting a data set into groups,
  • Apply: applying some functions to each of the groups,
  • Combine: combining the results.

For this example, we want to split a DataFrame in 2 DataFrames, then for each DataFrames apply* a different transformation pipeline then combine the result in a DataFrame.

Note: we also add an adapter that strictly type checks the inputs and outputs as the code runs. This isn't required but is here to show how to exercise it.

Example

The example consists of calculating the tax of individuals or families based on their income and the number of children they have.

The following rules applies to Income:

  • < 50k: Tax rate is 15 %
  • 50k to 70k: Tax rate is 18 %
  • 70k to 100k: Tax rate is 20 %
  • 100k to 120k: Tax rate is 22 %
  • 120k to 150k: Tax rate is 25 %
  • over 150k: Tax rate is 28 %

The following rules applies to the number of children when the income is under 100k:

  • 0 child: Tax credit 0 %
  • 1 child: Tax credit 2 %
  • 2 children: Tax credit 4 %
  • 3 children: Tax credit 6 %
  • 4 children: Tax credit 8 %
  • over 4 children: Tax credit 10 %

The following data needs to be processed:

NameIncomeChildren
John756002
Bob340001
Chloe1115003
Thomas2345461
Ellis1448652
Deane1385004
Mariella694125
Carlos655350
Toney436423
Ramiro1178502

Running the example

You can run the example doing:

# cd examples/pandas/split-apply-combine/
python my_script.py

The expected result is :

NameIncomeChildrenTax RateTax CreditTaxTax Formula
John75600220 %4 %14515(75600 * 0.2) - (75600 * 0.2) * 0.04
Bob34000115 %2 %4998(34000 * 0.15) - (34000 * 0.15) * 0.02
Chloe111500322 %24530(111500 * 0.22)
Thomas234546128 %65673(234546 * 0.28)
Ellis144865225 %36216(144865 * 0.25)
Deane138500425 %34625(138500 * 0.25)
Mariella69412518 %10 %11245(69412 * 0.18) - (69412 * 0.18) * 0.1
Carlos65535018 %0 %11796(65535 * 0.18) - (65535 * 0.18) * 0.0
Toney43642315 %6 %6154(43642 * 0.15) - (43642 * 0.15) * 0.06
Ramiro117850222 %25927(117850 * 0.22)

The DAG generate should look like:

my_full_dag.png