examples/parallelism/file_processing/README.md - hamilton - Git at Google

 # Parallelism Example

 ## Overview
 This is a very simple example of dynamically parameterizing sections of the DAG in parallel.
 This loads data from the kaggle dataset [Airbnb Prices in European Cities](https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities),
 and does the following:

 - Lists out the cities in the dataset See [list_data.py](list_data.py)
 - Loads the files and runs a set of very basic summary statistics. See [process_data.py](process_data.py)
 - Aggregates the summary into a single dataframe  See [aggregate_data.py](aggregate_data.py)

 ## Take home

 This demonstrates two powerful capabilities:

 1. Dynamically generating sets of nodes based on the result of another node
 2. Running these nodes in parallel

 Note that this does not do anything particularly complex -- the dataset/computation is meant to illustrate how you could use these powers.
 These datasets are small and the data processing is quite simple.

 ## Running

 First, download the data and place inside a directory called `data`.
 You can download the data from [here](https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities). You'll need a kaggle account.

 You can run the basic analysis in the terminal with:

 ```bash
 python run.py
 ```

 And you can play around with the data using the `notebook.ipynb` notebook.
	# Parallelism Example

	## Overview
	This is a very simple example of dynamically parameterizing sections of the DAG in parallel.
	This loads data from the kaggle dataset [Airbnb Prices in European Cities](https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities),
	and does the following:

	- Lists out the cities in the dataset See [list_data.py](list_data.py)
	- Loads the files and runs a set of very basic summary statistics. See [process_data.py](process_data.py)
	- Aggregates the summary into a single dataframe See [aggregate_data.py](aggregate_data.py)

	## Take home

	This demonstrates two powerful capabilities:

	1. Dynamically generating sets of nodes based on the result of another node
	2. Running these nodes in parallel

	Note that this does not do anything particularly complex -- the dataset/computation is meant to illustrate how you could use these powers.
	These datasets are small and the data processing is quite simple.

	## Running

	First, download the data and place inside a directory called `data`.
	You can download the data from [here](https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities). You'll need a kaggle account.

	You can run the basic analysis in the terminal with:

	```bash
	python run.py
	```

	And you can play around with the data using the `notebook.ipynb` notebook.