| { |
| "nbformat": 4, |
| "nbformat_minor": 0, |
| "metadata": { |
| "colab": { |
| "name": "SystemDS on Colaboratory.ipynb", |
| "provenance": [], |
| "collapsed_sections": [] |
| }, |
| "kernelspec": { |
| "name": "python3", |
| "display_name": "Python 3" |
| } |
| }, |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "XX60cA7YuZsw", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### Copyright © 2020 The Apache Software Foundation." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "8GEGDZ9GuZGp", |
| "colab_type": "code", |
| "cellView": "form", |
| "colab": {} |
| }, |
| "source": [ |
| "# @title Apache Version 2.0 (The \"License\");\n", |
| "#-------------------------------------------------------------\n", |
| "#\n", |
| "# Licensed to the Apache Software Foundation (ASF) under one\n", |
| "# or more contributor license agreements. See the NOTICE file\n", |
| "# distributed with this work for additional information\n", |
| "# regarding copyright ownership. The ASF licenses this file\n", |
| "# to you under the Apache License, Version 2.0 (the\n", |
| "# \"License\"); you may not use this file except in compliance\n", |
| "# with the License. You may obtain a copy of the License at\n", |
| "#\n", |
| "# http://www.apache.org/licenses/LICENSE-2.0\n", |
| "#\n", |
| "# Unless required by applicable law or agreed to in writing,\n", |
| "# software distributed under the License is distributed on an\n", |
| "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", |
| "# KIND, either express or implied. See the License for the\n", |
| "# specific language governing permissions and limitations\n", |
| "# under the License.\n", |
| "#\n", |
| "#-------------------------------------------------------------" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "_BbCdLjRoy2A", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "### Developer notebook for Apache SystemDS" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "zhdfvxkEq1BX", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "Run this notebook online at [Google Colab ↗](https://colab.research.google.com/github/apache/systemds/blob/master/notebooks/systemds_dev.ipynb).\n", |
| "\n", |
| "\n" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "efFVuggts1hr", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the\n", |
| "\n", |
| "A. standalone mode \\\n", |
| "B. with Apache Spark.\n", |
| "\n", |
| "Flow of the notebook:\n", |
| "1. Download and Install the dependencies\n", |
| "2. Go to section **A** or **B**" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "vBC5JPhkGbIV", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "#### Download and Install the dependencies\n", |
| "\n", |
| "1. **Runtime:** Java (OpenJDK 8 is preferred)\n", |
| "2. **Build:** Apache Maven\n", |
| "3. **Backend:** Apache Spark (optional)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "VkLasseNylPO", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### Setup\n", |
| "\n", |
| "A custom function to run OS commands." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "4Wmf-7jfydVH", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# Run and print a shell command.\n", |
| "def run(command):\n", |
| " print('>> {}'.format(command))\n", |
| " !{command}\n", |
| " print('')" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "kvD4HBMi0ohY", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### Install Java\n", |
| "Let us install OpenJDK 8. More about [OpenJDK ↗](https://openjdk.java.net/install/)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "8Xnb_ePUyQIL", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", |
| "\n", |
| "# run the below command to replace the existing installation\n", |
| "!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java\n", |
| "\n", |
| "import os\n", |
| "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", |
| "\n", |
| "!java -version" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "BhmBWf3u3Q0o", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### Install Apache Maven\n", |
| "\n", |
| "SystemDS uses Apache Maven to build and manage the project. More about [Apache Maven ↗](http://maven.apache.org/).\n", |
| "\n", |
| "Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find `pom.xml` find the codebase!" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "I81zPDcblchL", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# Download the maven source.\n", |
| "maven_version = 'apache-maven-3.6.3'\n", |
| "maven_path = f\"/opt/{maven_version}\"\n", |
| "\n", |
| "if not os.path.exists(maven_path):\n", |
| " run(f\"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip\")\n", |
| " run('unzip -q -d /opt apache-maven.zip')\n", |
| " run('rm -f apache-maven.zip')\n", |
| "\n", |
| "# Let's choose the absolute path instead of $PATH environment variable.\n", |
| "def maven(args):\n", |
| " run(f\"{maven_path}/bin/mvn {args}\")\n", |
| "\n", |
| "maven('-v')" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "Xphbe3R43XLw", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### Install Apache Spark (Optional, if you want to work with spark backend)\n" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "_WgEa00pTs3w", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at\n", |
| "https://spark.apache.org/downloads.html" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "3zdtkFkLnskx", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# Spark and Hadoop version\n", |
| "spark_version = 'spark-2.4.6'\n", |
| "hadoop_version = 'hadoop2.7'\n", |
| "spark_path = f\"/opt/{spark_version}-bin-{hadoop_version}\"\n", |
| "if not os.path.exists(spark_path):\n", |
| " run(f\"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz\")\n", |
| " run('tar zxf apache-spark.tgz -C /opt')\n", |
| " run('rm -f apache-spark.tgz')\n", |
| "\n", |
| "os.environ[\"SPARK_HOME\"] = spark_path\n", |
| "os.environ[\"PATH\"] += \":$SPARK_HOME/bin\"\n" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "91pJ5U8k3cjk", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "#### Get Apache SystemDS\n", |
| "\n", |
| "Apache SystemDS development happens on GitHub at [apache/systemds ↗](https://github.com/apache/systemds)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "SaPIprmg3lKE", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "!git clone https://github.com/apache/systemds systemds --depth=1\n", |
| "%cd systemds" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "40Fo9tPUzbWK", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### Build the project" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "s0Iorb0ICgHa", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR\n", |
| "# Option 1: Build only the java codebase\n", |
| "maven('clean package -q')\n", |
| "\n", |
| "# Option 2: For building along with python distribution\n", |
| "# maven('clean package -P distribution')" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "SUGac5w9ZRBQ", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "### A. Working with SystemDS in **standalone** mode\n", |
| "\n", |
| "NOTE: Let's pay attention to *directories* and *relative paths*. :)\n", |
| "\n" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "g5Nk2Bb4UU2O", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### 1. Set SystemDS environment variables\n", |
| "\n", |
| "These are useful for the `./bin/systemds` script." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "2ZnSzkq8UT32", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "!export SYSTEMDS_ROOT=$(pwd)\n", |
| "!export PATH=$SYSTEMDS_ROOT/bin:$PATH" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "zyLmFCv6ZYk5", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### 2. Download Haberman data\n", |
| "\n", |
| "Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival\n", |
| "\n", |
| "About: The survival of patients who had undergone surgery for breast cancer.\n", |
| "\n", |
| "Data Attributes:\n", |
| "1. Age of patient at time of operation (numerical)\n", |
| "2. Patient's year of operation (year - 1900, numerical)\n", |
| "3. Number of positive axillary nodes detected (numerical)\n", |
| "4. Survival status (class attribute)\n", |
| " - 1 = the patient survived 5 years or longer\n", |
| " - 2 = the patient died within 5 year" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "ZrQFBQehV8SF", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "!mkdir ../data" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "E1ZFCTFmXFY_", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "!wget -P ../data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "FTo8Py_vOGpX", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# Display first 10 lines of the dataset\n", |
| "# Notice that the test is plain csv with no headers!\n", |
| "!sed -n 1,10p ../data/haberman.data" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "Oy2kgVdkaeWK", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### 2.1 Set `metadata` for the data\n", |
| "\n", |
| "The data does not have any info on the value types. So, `metadata` for the data\n", |
| "helps know the size and format for the matrix data as `.mtd` file with the same\n", |
| "name and location as `.data` file." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "vfypIgJWXT6K", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# generate metadata file for the dataset\n", |
| "!echo '{\"rows\": 306, \"cols\": 4, \"format\": \"csv\"}' > ../data/haberman.data.mtd\n", |
| "\n", |
| "# generate type description for the data\n", |
| "!echo '1,1,1,2' > ../data/types.csv\n", |
| "!echo '{\"rows\": 1, \"cols\": 4, \"format\": \"csv\"}' > ../data/types.csv.mtd" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "7Vis3V31bA53", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### 3. Find the algorithm to run with `systemds`" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "L_0KosFhbhun", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# Inspect the directory structure of systemds code base\n", |
| "!ls" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "R7C5DVM7YfTb", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# List all the scripts (also called top level algorithms!)\n", |
| "!ls scripts/algorithms" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "5PrxwviWJhNd", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# Lets choose univariate statistics script.\n", |
| "# Output the algorithm documentation\n", |
| "# start from line no. 22 onwards. Till 35th line the command looks like\n", |
| "!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "zv_7wRPFSeuJ", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "IqY_ARNnavrC", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### 3.1 Let us inspect the output data" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "k-_eQg9TauPi", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# output first 10 lines only.\n", |
| "!sed -n 1,10p ../data/univarOut.mtx" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "o5VCCweiDMjf", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "#### B. Run SystemDS with Apache Spark" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "6gJhL7lc1vf7", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "#### Playground for DML scripts\n", |
| "\n", |
| "DML - A custom language designed for SystemDS with R-like syntax." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "zzqeSor__U6M", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### A test `dml` script to prototype algorithms\n", |
| "\n", |
| "Modify the code in the below cell and run to work develop data science tasks\n", |
| "in a high level language." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "t59rTyNbOF5b", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "%%writefile ../test.dml\n", |
| "\n", |
| "# This code code acts as a playground for dml code\n", |
| "X = rand (rows = 20, cols = 10)\n", |
| "y = X %*% rand(rows = ncol(X), cols = 1)\n", |
| "lm(X = X, y = y)" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "VDfeuJYE1JfK", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "Submit the `dml` script to Spark with `spark-submit`.\n", |
| "More about [Spark Submit ↗](https://spark.apache.org/docs/latest/submitting-applications.html)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "YokktyNE1Cig", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "!$SPARK_HOME/bin/spark-submit \\\n", |
| " ./target/SystemDS.jar -f ../test.dml" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "id": "gCMkudo_-8_8", |
| "colab_type": "text" |
| }, |
| "source": [ |
| "##### Run a binary classification example with sample data\n", |
| "\n", |
| "One would notice that no other script than simple dml is used in this example completely." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "metadata": { |
| "id": "OSLq2cZb_SUl", |
| "colab_type": "code", |
| "colab": {} |
| }, |
| "source": [ |
| "# Example binary classification task with sample data.\n", |
| "# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml" |
| ], |
| "execution_count": null, |
| "outputs": [] |
| } |
| ] |
| } |