Here we show how one can define some transformation logic that uses the new data quality feature. In addition, we also show how you can manage execution on top of Ray, Pandas on Spark.
Note: at this current time - release 1.9.0 - Dask data types + default validators with @check_output do not work well together. Using Dask for multiprocessing using non-dask types works fine. That said, if you need to use Dask data types, we recommend using the pandera integration. See data_quality/pandera folder for this same code, but using the pandera integration instead to validate outputs -- see examples/data_quality/pandera for examples.
We want to create some features for input to training a model. The task is to try to predict absenteeism at work. Inspiration comes from here and here.
There are three parts to this task.
These three parts map to the files laid out below -- with part (3) having multiple possible files.
We will be adding @check_output decorators to step (2). You could add them to step (1) if you like, but we omit that here as an exercise for the reader.
@check_output and also shows how one might support running the same code on Spark, Dask, and Ray, if you need to change how some code works.Each file should have some documentation at the top to help identify what it is and how you can use it.
Running the code involves installing the right python dependencies, and then executing the code. It is best practice to create a python virtual environment for each project/example; we omit showing that step here.
The DAG created for each way of executing is logically the same, though it might involve different functions being executed in the case of spark and dask; the @config.when* decorators are used for this purpose.
Note: importance is not specified in the @check_outputdecorators found in feature_logic.py. The default behavior is therefore invoked, which is to log a “warning” and not stop execution. If stopping execution is desired, importance="fail" should then be added to the decorators; more centralized control is going to be added in future releases.
pip install -r requirements.txt python run.py
It is best practice to create a python virtual environment for each project/example. We omit showing that step here.
pip install -r requirements-dask.txt python run_dask.py
It is best practice to create a python virtual environment for each project/example. We omit showing that step here.
pip install -r requirements-ray.txt python run_ray.py
It is best practice to create a python virtual environment for each project/example. We omit showing that step here.
pip install -r requirements-spark.txt python run_spark.py