tree: 71c0e942e1080acd24907bff5a0d31fc5f4be8d9 [path history] [tgz]
  1. data_loaders.py
  2. pandas_on_spark.png
  3. README.md
  4. run.py
examples/spark/pandas_on_spark/README.md

Hamilton on Koalas Spark 3.2+

Here we have a hello world example showing how you can take some Hamilton functions and then easily run them in a distributed setting via Spark 3.2+ using Koalas.

pip install sf-hamilton[pyspark] or pip install sf-hamilton pyspark[pandas_on_spark] to for the right dependencies to run this example.

File organization:

  • business_logic.py houses logic that should be invariant to how hamilton is executed.
  • data_loaders.py houses logic to load data for the business_logic.py module. The idea is that you'd swap this module out for other ways of loading data.
  • run.py is the script that ties everything together.

DAG Visualization:

Here is the visualization of the execution when you'd execute run.py:

pandas_on_spark.png