Parallelism Example

Overview

This is a very simple example of dynamically parameterizing sections of the DAG in parallel. This loads data from the kaggle dataset Airbnb Prices in European Cities, and does the following:

Lists out the cities in the dataset See list_data.py
Loads the files and runs a set of very basic summary statistics. See process_data.py
Aggregates the summary into a single dataframe See aggregate_data.py

Take home

This demonstrates two powerful capabilities:

Dynamically generating sets of nodes based on the result of another node
Running these nodes in parallel

Note that this does not do anything particularly complex -- the dataset/computation is meant to illustrate how you could use these powers. These datasets are small and the data processing is quite simple.

Running

First, download the data and place inside a directory called data. You can download the data from here. You'll need a kaggle account.

You can run the basic analysis in the terminal with:

python run.py

And you can play around with the data using the notebook.ipynb notebook.