notebooks/systemds_dev.ipynb - systemds - Git at Google

 {
   "nbformat": 4,
   "nbformat_minor": 0,
   "metadata": {
     "colab": {
       "name": "SystemDS on Colaboratory.ipynb",
       "provenance": [],
       "collapsed_sections": []
     },
     "kernelspec": {
       "name": "python3",
       "display_name": "Python 3"
     }
   },
   "cells": [
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "XX60cA7YuZsw",
         "colab_type": "text"
       },
       "source": [
         "##### Copyright &copy; 2020 The Apache Software Foundation."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "8GEGDZ9GuZGp",
         "colab_type": "code",
         "cellView": "form",
         "colab": {}
       },
       "source": [
         "# @title Apache Version 2.0 (The \"License\");\n",
         "#-------------------------------------------------------------\n",
         "#\n",
         "# Licensed to the Apache Software Foundation (ASF) under one\n",
         "# or more contributor license agreements.  See the NOTICE file\n",
         "# distributed with this work for additional information\n",
         "# regarding copyright ownership.  The ASF licenses this file\n",
         "# to you under the Apache License, Version 2.0 (the\n",
         "# \"License\"); you may not use this file except in compliance\n",
         "# with the License.  You may obtain a copy of the License at\n",
         "#\n",
         "#   http://www.apache.org/licenses/LICENSE-2.0\n",
         "#\n",
         "# Unless required by applicable law or agreed to in writing,\n",
         "# software distributed under the License is distributed on an\n",
         "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
         "# KIND, either express or implied.  See the License for the\n",
         "# specific language governing permissions and limitations\n",
         "# under the License.\n",
         "#\n",
         "#-------------------------------------------------------------"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "_BbCdLjRoy2A",
         "colab_type": "text"
       },
       "source": [
         "### Developer notebook for Apache SystemDS"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "zhdfvxkEq1BX",
         "colab_type": "text"
       },
       "source": [
         "Run this notebook online at [Google Colab ↗](https://colab.research.google.com/github/apache/systemds/blob/master/notebooks/systemds_dev.ipynb).\n",
         "\n",
         "\n"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "efFVuggts1hr",
         "colab_type": "text"
       },
       "source": [
         "This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the\n",
         "\n",
         "A. standalone mode \\\n",
         "B. with Apache Spark.\n",
         "\n",
         "Flow of the notebook:\n",
         "1. Download and Install the dependencies\n",
         "2. Go to section **A** or **B**"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "vBC5JPhkGbIV",
         "colab_type": "text"
       },
       "source": [
         "#### Download and Install the dependencies\n",
         "\n",
         "1. **Runtime:** Java (OpenJDK 8 is preferred)\n",
         "2. **Build:** Apache Maven\n",
         "3. **Backend:** Apache Spark (optional)"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "VkLasseNylPO",
         "colab_type": "text"
       },
       "source": [
         "##### Setup\n",
         "\n",
         "A custom function to run OS commands."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "4Wmf-7jfydVH",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# Run and print a shell command.\n",
         "def run(command):\n",
         "  print('>> {}'.format(command))\n",
         "  !{command}\n",
         "  print('')"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "kvD4HBMi0ohY",
         "colab_type": "text"
       },
       "source": [
         "##### Install Java\n",
         "Let us install OpenJDK 8. More about [OpenJDK ↗](https://openjdk.java.net/install/)."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "8Xnb_ePUyQIL",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
         "\n",
         "# run the below command to replace the existing installation\n",
         "!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java\n",
         "\n",
         "import os\n",
         "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
         "\n",
         "!java -version"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "BhmBWf3u3Q0o",
         "colab_type": "text"
       },
       "source": [
         "##### Install Apache Maven\n",
         "\n",
         "SystemDS uses Apache Maven to build and manage the project. More about [Apache Maven ↗](http://maven.apache.org/).\n",
         "\n",
         "Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find `pom.xml` find the codebase!"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "I81zPDcblchL",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# Download the maven source.\n",
         "maven_version = 'apache-maven-3.6.3'\n",
         "maven_path = f\"/opt/{maven_version}\"\n",
         "\n",
         "if not os.path.exists(maven_path):\n",
         "  run(f\"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip\")\n",
         "  run('unzip -q -d /opt apache-maven.zip')\n",
         "  run('rm -f apache-maven.zip')\n",
         "\n",
         "# Let's choose the absolute path instead of $PATH environment variable.\n",
         "def maven(args):\n",
         "  run(f\"{maven_path}/bin/mvn {args}\")\n",
         "\n",
         "maven('-v')"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "Xphbe3R43XLw",
         "colab_type": "text"
       },
       "source": [
         "##### Install Apache Spark (Optional, if you want to work with spark backend)\n"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "_WgEa00pTs3w",
         "colab_type": "text"
       },
       "source": [
         "NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at\n",
         "https://spark.apache.org/downloads.html"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "3zdtkFkLnskx",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# Spark and Hadoop version\n",
         "spark_version = 'spark-2.4.6'\n",
         "hadoop_version = 'hadoop2.7'\n",
         "spark_path = f\"/opt/{spark_version}-bin-{hadoop_version}\"\n",
         "if not os.path.exists(spark_path):\n",
         "  run(f\"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz\")\n",
         "  run('tar zxf apache-spark.tgz -C /opt')\n",
         "  run('rm -f apache-spark.tgz')\n",
         "\n",
         "os.environ[\"SPARK_HOME\"] = spark_path\n",
         "os.environ[\"PATH\"] += \":$SPARK_HOME/bin\"\n"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "91pJ5U8k3cjk",
         "colab_type": "text"
       },
       "source": [
         "#### Get Apache SystemDS\n",
         "\n",
         "Apache SystemDS development happens on GitHub at [apache/systemds ↗](https://github.com/apache/systemds)"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "SaPIprmg3lKE",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "!git clone https://github.com/apache/systemds systemds --depth=1\n",
         "%cd systemds"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "40Fo9tPUzbWK",
         "colab_type": "text"
       },
       "source": [
         "##### Build the project"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "s0Iorb0ICgHa",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR\n",
         "# Option 1: Build only the java codebase\n",
         "maven('clean package -q')\n",
         "\n",
         "# Option 2: For building along with python distribution\n",
         "# maven('clean package -P distribution')"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "SUGac5w9ZRBQ",
         "colab_type": "text"
       },
       "source": [
         "### A. Working with SystemDS in **standalone** mode\n",
         "\n",
         "NOTE: Let's pay attention to *directories* and *relative paths*. :)\n",
         "\n"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "g5Nk2Bb4UU2O",
         "colab_type": "text"
       },
       "source": [
         "##### 1. Set SystemDS environment variables\n",
         "\n",
         "These are useful for the `./bin/systemds` script."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "2ZnSzkq8UT32",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "!export SYSTEMDS_ROOT=$(pwd)\n",
         "!export PATH=$SYSTEMDS_ROOT/bin:$PATH"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "zyLmFCv6ZYk5",
         "colab_type": "text"
       },
       "source": [
         "##### 2. Download Haberman data\n",
         "\n",
         "Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival\n",
         "\n",
         "About: The survival of patients who had undergone surgery for breast cancer.\n",
         "\n",
         "Data Attributes:\n",
         "1. Age of patient at time of operation (numerical)\n",
         "2. Patient's year of operation (year - 1900, numerical)\n",
         "3. Number of positive axillary nodes detected (numerical)\n",
         "4. Survival status (class attribute)\n",
         "    - 1 = the patient survived 5 years or longer\n",
         "    - 2 = the patient died within 5 year"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "ZrQFBQehV8SF",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "!mkdir ../data"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "E1ZFCTFmXFY_",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "!wget -P ../data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "FTo8Py_vOGpX",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# Display first 10 lines of the dataset\n",
         "# Notice that the test is plain csv with no headers!\n",
         "!sed -n 1,10p ../data/haberman.data"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "Oy2kgVdkaeWK",
         "colab_type": "text"
       },
       "source": [
         "##### 2.1 Set `metadata` for the data\n",
         "\n",
         "The data does not have any info on the value types. So, `metadata` for the data\n",
         "helps know the size and format for the matrix data as `.mtd` file with the same\n",
         "name and location as `.data` file."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "vfypIgJWXT6K",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# generate metadata file for the dataset\n",
         "!echo '{\"rows\": 306, \"cols\": 4, \"format\": \"csv\"}' > ../data/haberman.data.mtd\n",
         "\n",
         "# generate type description for the data\n",
         "!echo '1,1,1,2' > ../data/types.csv\n",
         "!echo '{\"rows\": 1, \"cols\": 4, \"format\": \"csv\"}' > ../data/types.csv.mtd"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "7Vis3V31bA53",
         "colab_type": "text"
       },
       "source": [
         "##### 3. Find the algorithm to run with `systemds`"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "L_0KosFhbhun",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# Inspect the directory structure of systemds code base\n",
         "!ls"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "R7C5DVM7YfTb",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# List all the scripts (also called top level algorithms!)\n",
         "!ls scripts/algorithms"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "5PrxwviWJhNd",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# Lets choose univariate statistics script.\n",
         "# Output the algorithm documentation\n",
         "# start from line no. 22 onwards. Till 35th line the command looks like\n",
         "!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "zv_7wRPFSeuJ",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "IqY_ARNnavrC",
         "colab_type": "text"
       },
       "source": [
         "##### 3.1 Let us inspect the output data"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "k-_eQg9TauPi",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# output first 10 lines only.\n",
         "!sed -n 1,10p ../data/univarOut.mtx"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "o5VCCweiDMjf",
         "colab_type": "text"
       },
       "source": [
         "#### B. Run SystemDS with Apache Spark"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "6gJhL7lc1vf7",
         "colab_type": "text"
       },
       "source": [
         "#### Playground for DML scripts\n",
         "\n",
         "DML - A custom language designed for SystemDS with R-like syntax."
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "zzqeSor__U6M",
         "colab_type": "text"
       },
       "source": [
         "##### A test `dml` script to prototype algorithms\n",
         "\n",
         "Modify the code in the below cell and run to work develop data science tasks\n",
         "in a high level language."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "t59rTyNbOF5b",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "%%writefile ../test.dml\n",
         "\n",
         "# This code code acts as a playground for dml code\n",
         "X = rand (rows = 20, cols = 10)\n",
         "y = X %*% rand(rows = ncol(X), cols = 1)\n",
         "lm(X = X, y = y)"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "VDfeuJYE1JfK",
         "colab_type": "text"
       },
       "source": [
         "Submit the `dml` script to Spark with `spark-submit`.\n",
         "More about [Spark Submit ↗](https://spark.apache.org/docs/latest/submitting-applications.html)"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "YokktyNE1Cig",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "!$SPARK_HOME/bin/spark-submit \\\n",
         "    ./target/SystemDS.jar -f ../test.dml"
       ],
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "gCMkudo_-8_8",
         "colab_type": "text"
       },
       "source": [
         "##### Run a binary classification example with sample data\n",
         "\n",
         "One would notice that no other script than simple dml is used in this example completely."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {
         "id": "OSLq2cZb_SUl",
         "colab_type": "code",
         "colab": {}
       },
       "source": [
         "# Example binary classification task with sample data.\n",
         "# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml"
       ],
       "execution_count": null,
       "outputs": []
     }
   ]
 }
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "SystemDS on Colaboratory.ipynb",
	"provenance": [],
	"collapsed_sections": []
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "XX60cA7YuZsw",
	"colab_type": "text"
	},
	"source": [
	"##### Copyright © 2020 The Apache Software Foundation."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "8GEGDZ9GuZGp",
	"colab_type": "code",
	"cellView": "form",
	"colab": {}
	},
	"source": [
	"# @title Apache Version 2.0 (The \"License\");\n",
	"#-------------------------------------------------------------\n",
	"#\n",
	"# Licensed to the Apache Software Foundation (ASF) under one\n",
	"# or more contributor license agreements. See the NOTICE file\n",
	"# distributed with this work for additional information\n",
	"# regarding copyright ownership. The ASF licenses this file\n",
	"# to you under the Apache License, Version 2.0 (the\n",
	"# \"License\"); you may not use this file except in compliance\n",
	"# with the License. You may obtain a copy of the License at\n",
	"#\n",
	"# http://www.apache.org/licenses/LICENSE-2.0\n",
	"#\n",
	"# Unless required by applicable law or agreed to in writing,\n",
	"# software distributed under the License is distributed on an\n",
	"# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
	"# KIND, either express or implied. See the License for the\n",
	"# specific language governing permissions and limitations\n",
	"# under the License.\n",
	"#\n",
	"#-------------------------------------------------------------"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "_BbCdLjRoy2A",
	"colab_type": "text"
	},
	"source": [
	"### Developer notebook for Apache SystemDS"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "zhdfvxkEq1BX",
	"colab_type": "text"
	},
	"source": [
	"Run this notebook online at [Google Colab ↗](https://colab.research.google.com/github/apache/systemds/blob/master/notebooks/systemds_dev.ipynb).\n",
	"\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "efFVuggts1hr",
	"colab_type": "text"
	},
	"source": [
	"This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the\n",
	"\n",
	"A. standalone mode \\\n",
	"B. with Apache Spark.\n",
	"\n",
	"Flow of the notebook:\n",
	"1. Download and Install the dependencies\n",
	"2. Go to section A or B"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "vBC5JPhkGbIV",
	"colab_type": "text"
	},
	"source": [
	"#### Download and Install the dependencies\n",
	"\n",
	"1. Runtime: Java (OpenJDK 8 is preferred)\n",
	"2. Build: Apache Maven\n",
	"3. Backend: Apache Spark (optional)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "VkLasseNylPO",
	"colab_type": "text"
	},
	"source": [
	"##### Setup\n",
	"\n",
	"A custom function to run OS commands."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "4Wmf-7jfydVH",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Run and print a shell command.\n",
	"def run(command):\n",
	" print('>> {}'.format(command))\n",
	" !{command}\n",
	" print('')"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "kvD4HBMi0ohY",
	"colab_type": "text"
	},
	"source": [
	"##### Install Java\n",
	"Let us install OpenJDK 8. More about [OpenJDK ↗](https://openjdk.java.net/install/)."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "8Xnb_ePUyQIL",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
	"\n",
	"# run the below command to replace the existing installation\n",
	"!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java\n",
	"\n",
	"import os\n",
	"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
	"\n",
	"!java -version"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "BhmBWf3u3Q0o",
	"colab_type": "text"
	},
	"source": [
	"##### Install Apache Maven\n",
	"\n",
	"SystemDS uses Apache Maven to build and manage the project. More about [Apache Maven ↗](http://maven.apache.org/).\n",
	"\n",
	"Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find `pom.xml` find the codebase!"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "I81zPDcblchL",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Download the maven source.\n",
	"maven_version = 'apache-maven-3.6.3'\n",
	"maven_path = f\"/opt/{maven_version}\"\n",
	"\n",
	"if not os.path.exists(maven_path):\n",
	" run(f\"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip\")\n",
	" run('unzip -q -d /opt apache-maven.zip')\n",
	" run('rm -f apache-maven.zip')\n",
	"\n",
	"# Let's choose the absolute path instead of $PATH environment variable.\n",
	"def maven(args):\n",
	" run(f\"{maven_path}/bin/mvn {args}\")\n",
	"\n",
	"maven('-v')"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Xphbe3R43XLw",
	"colab_type": "text"
	},
	"source": [
	"##### Install Apache Spark (Optional, if you want to work with spark backend)\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "_WgEa00pTs3w",
	"colab_type": "text"
	},
	"source": [
	"NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at\n",
	"https://spark.apache.org/downloads.html"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "3zdtkFkLnskx",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Spark and Hadoop version\n",
	"spark_version = 'spark-2.4.6'\n",
	"hadoop_version = 'hadoop2.7'\n",
	"spark_path = f\"/opt/{spark_version}-bin-{hadoop_version}\"\n",
	"if not os.path.exists(spark_path):\n",
	" run(f\"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz\")\n",
	" run('tar zxf apache-spark.tgz -C /opt')\n",
	" run('rm -f apache-spark.tgz')\n",
	"\n",
	"os.environ[\"SPARK_HOME\"] = spark_path\n",
	"os.environ[\"PATH\"] += \":$SPARK_HOME/bin\"\n"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "91pJ5U8k3cjk",
	"colab_type": "text"
	},
	"source": [
	"#### Get Apache SystemDS\n",
	"\n",
	"Apache SystemDS development happens on GitHub at [apache/systemds ↗](https://github.com/apache/systemds)"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "SaPIprmg3lKE",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!git clone https://github.com/apache/systemds systemds --depth=1\n",
	"%cd systemds"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "40Fo9tPUzbWK",
	"colab_type": "text"
	},
	"source": [
	"##### Build the project"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "s0Iorb0ICgHa",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR\n",
	"# Option 1: Build only the java codebase\n",
	"maven('clean package -q')\n",
	"\n",
	"# Option 2: For building along with python distribution\n",
	"# maven('clean package -P distribution')"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "SUGac5w9ZRBQ",
	"colab_type": "text"
	},
	"source": [
	"### A. Working with SystemDS in standalone mode\n",
	"\n",
	"NOTE: Let's pay attention to directories and relative paths. :)\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "g5Nk2Bb4UU2O",
	"colab_type": "text"
	},
	"source": [
	"##### 1. Set SystemDS environment variables\n",
	"\n",
	"These are useful for the `./bin/systemds` script."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "2ZnSzkq8UT32",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!export SYSTEMDS_ROOT=$(pwd)\n",
	"!export PATH=$SYSTEMDS_ROOT/bin:$PATH"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "zyLmFCv6ZYk5",
	"colab_type": "text"
	},
	"source": [
	"##### 2. Download Haberman data\n",
	"\n",
	"Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival\n",
	"\n",
	"About: The survival of patients who had undergone surgery for breast cancer.\n",
	"\n",
	"Data Attributes:\n",
	"1. Age of patient at time of operation (numerical)\n",
	"2. Patient's year of operation (year - 1900, numerical)\n",
	"3. Number of positive axillary nodes detected (numerical)\n",
	"4. Survival status (class attribute)\n",
	" - 1 = the patient survived 5 years or longer\n",
	" - 2 = the patient died within 5 year"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "ZrQFBQehV8SF",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!mkdir ../data"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "E1ZFCTFmXFY_",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!wget -P ../data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "FTo8Py_vOGpX",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Display first 10 lines of the dataset\n",
	"# Notice that the test is plain csv with no headers!\n",
	"!sed -n 1,10p ../data/haberman.data"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Oy2kgVdkaeWK",
	"colab_type": "text"
	},
	"source": [
	"##### 2.1 Set `metadata` for the data\n",
	"\n",
	"The data does not have any info on the value types. So, `metadata` for the data\n",
	"helps know the size and format for the matrix data as `.mtd` file with the same\n",
	"name and location as `.data` file."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "vfypIgJWXT6K",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# generate metadata file for the dataset\n",
	"!echo '{\"rows\": 306, \"cols\": 4, \"format\": \"csv\"}' > ../data/haberman.data.mtd\n",
	"\n",
	"# generate type description for the data\n",
	"!echo '1,1,1,2' > ../data/types.csv\n",
	"!echo '{\"rows\": 1, \"cols\": 4, \"format\": \"csv\"}' > ../data/types.csv.mtd"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "7Vis3V31bA53",
	"colab_type": "text"
	},
	"source": [
	"##### 3. Find the algorithm to run with `systemds`"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "L_0KosFhbhun",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Inspect the directory structure of systemds code base\n",
	"!ls"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "R7C5DVM7YfTb",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# List all the scripts (also called top level algorithms!)\n",
	"!ls scripts/algorithms"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "5PrxwviWJhNd",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Lets choose univariate statistics script.\n",
	"# Output the algorithm documentation\n",
	"# start from line no. 22 onwards. Till 35th line the command looks like\n",
	"!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "zv_7wRPFSeuJ",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "IqY_ARNnavrC",
	"colab_type": "text"
	},
	"source": [
	"##### 3.1 Let us inspect the output data"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "k-_eQg9TauPi",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# output first 10 lines only.\n",
	"!sed -n 1,10p ../data/univarOut.mtx"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "o5VCCweiDMjf",
	"colab_type": "text"
	},
	"source": [
	"#### B. Run SystemDS with Apache Spark"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "6gJhL7lc1vf7",
	"colab_type": "text"
	},
	"source": [
	"#### Playground for DML scripts\n",
	"\n",
	"DML - A custom language designed for SystemDS with R-like syntax."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "zzqeSor__U6M",
	"colab_type": "text"
	},
	"source": [
	"##### A test `dml` script to prototype algorithms\n",
	"\n",
	"Modify the code in the below cell and run to work develop data science tasks\n",
	"in a high level language."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "t59rTyNbOF5b",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"%%writefile ../test.dml\n",
	"\n",
	"# This code code acts as a playground for dml code\n",
	"X = rand (rows = 20, cols = 10)\n",
	"y = X %*% rand(rows = ncol(X), cols = 1)\n",
	"lm(X = X, y = y)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "VDfeuJYE1JfK",
	"colab_type": "text"
	},
	"source": [
	"Submit the `dml` script to Spark with `spark-submit`.\n",
	"More about [Spark Submit ↗](https://spark.apache.org/docs/latest/submitting-applications.html)"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "YokktyNE1Cig",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!$SPARK_HOME/bin/spark-submit \\\n",
	" ./target/SystemDS.jar -f ../test.dml"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "gCMkudo_-8_8",
	"colab_type": "text"
	},
	"source": [
	"##### Run a binary classification example with sample data\n",
	"\n",
	"One would notice that no other script than simple dml is used in this example completely."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "OSLq2cZb_SUl",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# Example binary classification task with sample data.\n",
	"# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml"
	],
	"execution_count": null,
	"outputs": []
	}
	]
	}