tree: 7332b730a9a1ebe893f626e63c2c656b5137f8d6 [path history] [tgz]
  1. aggregate_data.py
  2. dag.png
  3. list_data.py
  4. notebook.ipynb
  5. out.png
  6. out.png.pdf
  7. process_data.py
  8. README.md
  9. requirements.txt
  10. run.py
examples/parallelism/file_processing/README.md

Parallelism Example

Overview

This is a very simple example of dynamically parameterizing sections of the DAG in parallel. This loads data from the kaggle dataset Airbnb Prices in European Cities, and does the following:

Take home

This demonstrates two powerful capabilities:

  1. Dynamically generating sets of nodes based on the result of another node
  2. Running these nodes in parallel

Note that this does not do anything particularly complex -- the dataset/computation is meant to illustrate how you could use these powers. These datasets are small and the data processing is quite simple.

Running

First, download the data and place inside a directory called data. You can download the data from here. You'll need a kaggle account.

You can run the basic analysis in the terminal with:

python run.py

And you can play around with the data using the notebook.ipynb notebook.