| .. ------------------------------------------------------------- |
| .. |
| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| .. |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| .. |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| .. |
| .. ------------------------------------------------------------ |
| |
| Python end-to-end tutorial |
| ========================== |
| |
| The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API. |
| For this, we want to use the `Adult <https://archive.ics.uci.edu/ml/datasets/adult/>`_ dataset and predict whether the income of a person exceeds $50K/yr based on census data. |
| The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K. |
| Most of these features are categorical string values, but the dataset also includes continuous features. |
| For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS. |
| In the first level, shows the built-in preprocessing capabilities of SystemDS. |
| With the second level, we want to show how we can integrate custom-built networks or algorithms into our Python program. |
| |
| Prerequisite: |
| |
| - :doc:`/getting_started/install` |
| |
| Level 1 |
| ------- |
| |
| This example shows how one can work the SystemDS framework. |
| More precisely, we will make use of the built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. |
| The dataset used in this tutorial is a preprocessed version of the "UCI Adult Data Set". |
| If one wants to skip the explanation then the full script is available at the end of this level. |
| |
| We will train a Multinomial Logistic Regression model on the training dataset and subsequently use the test dataset |
| to assess how well our model can predict if the income is above or below $50K/yr based on the features. |
| |
| Step 1: Load and prepare data |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the |
| labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take |
| a fraction of the training and test set into account to speed up the execution. |
| |
| .. include:: ../code/guide/end_to_end/part1.py |
| :code: python |
| :start-line: 20 |
| :end-line: 51 |
| |
| Here the DataManager contains the code for downloading and setting up either Pandas DataFrames or internal SystemDS Frames, |
| for the best performance and no data transfer from pandas to SystemDS it is recommended to read directly from disk into SystemDS. |
| |
| Step 2: Training |
| ~~~~~~~~~~~~~~~~ |
| |
| Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our |
| training data. Afterward, we can make predictions on the test data and assess the performance of the model. |
| |
| .. include:: ../code/guide/end_to_end/part1.py |
| :code: python |
| :start-line: 53 |
| :end-line: 54 |
| |
| Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called. |
| E.g. betas_res = betas.compute(). |
| |
| We can now use the trained model to make predictions on the test data. |
| |
| .. include:: ../code/guide/end_to_end/part1.py |
| :code: python |
| :start-line: 56 |
| :end-line: 57 |
| |
| The multiLogRegPredict function has three return values: |
| - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example. |
| - y_pred, is the predictions made using the model |
| - acc, is the accuracy achieved by the model. |
| |
| Step 3: Confusion Matrix |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding |
| which classes the model has difficulties separating. |
| The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix |
| for the predictions and the confusion matrix averages of each true class. |
| |
| .. include:: ../code/guide/end_to_end/part1.py |
| :code: python |
| :start-line: 59 |
| :end-line: 60 |
| |
| Full Script |
| ~~~~~~~~~~~ |
| |
| In the full script, some steps are combined to reduce the overall script. |
| |
| .. include:: ../code/guide/end_to_end/part1.py |
| :code: python |
| :start-line: 20 |
| :end-line: 65 |
| |
| Level 2 |
| ------- |
| |
| In this level we want to show how we can integrate a custom built algorithm using the Python API. |
| For this we will introduce another dml file, which can be used to train a basic feed forward network. |
| |
| Step 1: Obtain data |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| For the whole data setup please refer to level 1, Step 1, as these steps are almost identical, |
| but instead of preparing the test data, we only prepare the training data. |
| |
| .. include:: ../code/guide/end_to_end/part2.py |
| :code: python |
| :start-line: 20 |
| :end-line: 47 |
| |
| Step 2: Load the algorithm |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| We use a neural network with 2 hidden layers, each consisting of 200 neurons. |
| First, we need to source the dml file for neural networks. |
| This file includes all the necessary functions for training, evaluating, and storing the model. |
| The returned object of the source call is further used for calling the functions. |
| The file can be found here: |
| |
| .. include:: ../code/guide/end_to_end/part2.py |
| :code: python |
| :start-line: 48 |
| :end-line: 51 |
| |
| |
| Step 3: Training the neural network |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Training a neural network in SystemDS using the train function is straightforward. |
| The first two arguments are the training features and the target values we want to fit our model on. |
| Then we need to set the hyperparameters of the model. |
| We choose to train for 1 epoch with a batch size of 16 and a learning rate of 0.01, which are common parameters for neural networks. |
| The seed argument ensures that running the code again yields the same results. |
| |
| .. include:: ../code/guide/end_to_end/part2.py |
| :code: python |
| :start-line: 52 |
| :end-line: 58 |
| |
| |
| Step 4: Saving the model |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| For later usage, we can save the trained model. |
| We only need to specify the name of our model and the file path. |
| This call stores the weights and biases of our model. |
| Similarly the transformation metadata to transform input data to the model, |
| is saved. |
| |
| .. include:: ../code/guide/end_to_end/part2.py |
| :code: python |
| :start-line: 59 |
| :end-line: 65 |
| |
| Step 5: Predict on Unseen data |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Once the model is saved along with metadata, it is simple to apply it all to |
| unseen data: |
| |
| .. include:: ../code/guide/end_to_end/part2.py |
| :code: python |
| :start-line: 66 |
| :end-line: 77 |
| |
| |
| Full Script NN |
| ~~~~~~~~~~~~~~ |
| |
| The complete script now can be seen here: |
| |
| |
| .. include:: ../code/guide/end_to_end/part2.py |
| :code: python |
| :start-line: 20 |
| :end-line: 80 |