tree: 487931ed7d584f7bedded23a36d7f47836e24985 [path history] [tgz]
  1. install_wheels.sh
  2. partial_workflow.py
  3. README.md
  4. simple_kll_example.py
examples/airflow/README.md

Dataproc on GCP via Airflow

This directory contains a small example of how to use DataSktches via pyspark in an Airflow DAG on Dataproc.

The workflow is very incomplete, and intended to show only a few key fields. The main points are specifying an initialization script to ensure the wheels and jars are correctly placed, and adding the jars so Spark includes them on the classpath. As of Spark 3.5, we need to specify the user classpath first, as Spark has an older version of the HLL sketch using an incompatible library version.

The initialization script install_wheels.sh is important here, as that installs the wheels -- both a wheel built for dataskethces-spark as well as the regular datasketches-python package. To simplify workflow management, the example script symlinks the named versions to generic names, so that the workflow itself can remain unchanged during verion updates, although the initializaiton script and possibly the real compute task would need updates.

simple_kll_example.py is a very trivial example, feeding some points into a KLL sketch and querying it. The example shows how sketches can be used from within pyspark, but also move seamlessly into python for additional analysis.