blob: dd9d7064784cf21b97e9c767071620fc6ee1ed4a [file] [log] [blame]
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "SystemDS on Colaboratory.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "XX60cA7YuZsw",
"colab_type": "text"
},
"source": [
"##### Copyright © 2020 The Apache Software Foundation."
]
},
{
"cell_type": "code",
"metadata": {
"id": "8GEGDZ9GuZGp",
"colab_type": "code",
"cellView": "form",
"colab": {}
},
"source": [
"# @title Apache Version 2.0 (The \"License\");\n",
"#-------------------------------------------------------------\n",
"#\n",
"# Licensed to the Apache Software Foundation (ASF) under one\n",
"# or more contributor license agreements. See the NOTICE file\n",
"# distributed with this work for additional information\n",
"# regarding copyright ownership. The ASF licenses this file\n",
"# to you under the Apache License, Version 2.0 (the\n",
"# \"License\"); you may not use this file except in compliance\n",
"# with the License. You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing,\n",
"# software distributed under the License is distributed on an\n",
"# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
"# KIND, either express or implied. See the License for the\n",
"# specific language governing permissions and limitations\n",
"# under the License.\n",
"#\n",
"#-------------------------------------------------------------"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "_BbCdLjRoy2A",
"colab_type": "text"
},
"source": [
"### Developer notebook for Apache SystemDS"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zhdfvxkEq1BX",
"colab_type": "text"
},
"source": [
"Run this notebook online at [Google Colab ↗](https://colab.research.google.com/github/apache/systemds/blob/master/notebooks/systemds_dev.ipynb).\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "efFVuggts1hr",
"colab_type": "text"
},
"source": [
"This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the\n",
"\n",
"A. standalone mode \\\n",
"B. with Apache Spark.\n",
"\n",
"Flow of the notebook:\n",
"1. Download and Install the dependencies\n",
"2. Go to section **A** or **B**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vBC5JPhkGbIV",
"colab_type": "text"
},
"source": [
"#### Download and Install the dependencies\n",
"\n",
"1. **Runtime:** Java (OpenJDK 8 is preferred)\n",
"2. **Build:** Apache Maven\n",
"3. **Backend:** Apache Spark (optional)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VkLasseNylPO",
"colab_type": "text"
},
"source": [
"##### Setup\n",
"\n",
"A custom function to run OS commands."
]
},
{
"cell_type": "code",
"metadata": {
"id": "4Wmf-7jfydVH",
"colab_type": "code",
"colab": {}
},
"source": [
"# Run and print a shell command.\n",
"def run(command):\n",
" print('>> {}'.format(command))\n",
" !{command}\n",
" print('')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "kvD4HBMi0ohY",
"colab_type": "text"
},
"source": [
"##### Install Java\n",
"Let us install OpenJDK 8. More about [OpenJDK ↗](https://openjdk.java.net/install/)."
]
},
{
"cell_type": "code",
"metadata": {
"id": "8Xnb_ePUyQIL",
"colab_type": "code",
"colab": {}
},
"source": [
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"\n",
"# run the below command to replace the existing installation\n",
"!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java\n",
"\n",
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
"\n",
"!java -version"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BhmBWf3u3Q0o",
"colab_type": "text"
},
"source": [
"##### Install Apache Maven\n",
"\n",
"SystemDS uses Apache Maven to build and manage the project. More about [Apache Maven ↗](http://maven.apache.org/).\n",
"\n",
"Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find `pom.xml` find the codebase!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "I81zPDcblchL",
"colab_type": "code",
"colab": {}
},
"source": [
"# Download the maven source.\n",
"maven_version = 'apache-maven-3.6.3'\n",
"maven_path = f\"/opt/{maven_version}\"\n",
"\n",
"if not os.path.exists(maven_path):\n",
" run(f\"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip\")\n",
" run('unzip -q -d /opt apache-maven.zip')\n",
" run('rm -f apache-maven.zip')\n",
"\n",
"# Let's choose the absolute path instead of $PATH environment variable.\n",
"def maven(args):\n",
" run(f\"{maven_path}/bin/mvn {args}\")\n",
"\n",
"maven('-v')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Xphbe3R43XLw",
"colab_type": "text"
},
"source": [
"##### Install Apache Spark (Optional, if you want to work with spark backend)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_WgEa00pTs3w",
"colab_type": "text"
},
"source": [
"NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at\n",
"https://spark.apache.org/downloads.html"
]
},
{
"cell_type": "code",
"metadata": {
"id": "3zdtkFkLnskx",
"colab_type": "code",
"colab": {}
},
"source": [
"# Spark and Hadoop version\n",
"spark_version = 'spark-2.4.6'\n",
"hadoop_version = 'hadoop2.7'\n",
"spark_path = f\"/opt/{spark_version}-bin-{hadoop_version}\"\n",
"if not os.path.exists(spark_path):\n",
" run(f\"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz\")\n",
" run('tar zxf apache-spark.tgz -C /opt')\n",
" run('rm -f apache-spark.tgz')\n",
"\n",
"os.environ[\"SPARK_HOME\"] = spark_path\n",
"os.environ[\"PATH\"] += \":$SPARK_HOME/bin\"\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "91pJ5U8k3cjk",
"colab_type": "text"
},
"source": [
"#### Get Apache SystemDS\n",
"\n",
"Apache SystemDS development happens on GitHub at [apache/systemds ↗](https://github.com/apache/systemds)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "SaPIprmg3lKE",
"colab_type": "code",
"colab": {}
},
"source": [
"!git clone https://github.com/apache/systemds systemds --depth=1\n",
"%cd systemds"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "40Fo9tPUzbWK",
"colab_type": "text"
},
"source": [
"##### Build the project"
]
},
{
"cell_type": "code",
"metadata": {
"id": "s0Iorb0ICgHa",
"colab_type": "code",
"colab": {}
},
"source": [
"# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR\n",
"# Option 1: Build only the java codebase\n",
"maven('clean package -q')\n",
"\n",
"# Option 2: For building along with python distribution\n",
"# maven('clean package -P distribution')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "SUGac5w9ZRBQ",
"colab_type": "text"
},
"source": [
"### A. Working with SystemDS in **standalone** mode\n",
"\n",
"NOTE: Let's pay attention to *directories* and *relative paths*. :)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "g5Nk2Bb4UU2O",
"colab_type": "text"
},
"source": [
"##### 1. Set SystemDS environment variables\n",
"\n",
"These are useful for the `./bin/systemds` script."
]
},
{
"cell_type": "code",
"metadata": {
"id": "2ZnSzkq8UT32",
"colab_type": "code",
"colab": {}
},
"source": [
"!export SYSTEMDS_ROOT=$(pwd)\n",
"!export PATH=$SYSTEMDS_ROOT/bin:$PATH"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "zyLmFCv6ZYk5",
"colab_type": "text"
},
"source": [
"##### 2. Download Haberman data\n",
"\n",
"Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival\n",
"\n",
"About: The survival of patients who had undergone surgery for breast cancer.\n",
"\n",
"Data Attributes:\n",
"1. Age of patient at time of operation (numerical)\n",
"2. Patient's year of operation (year - 1900, numerical)\n",
"3. Number of positive axillary nodes detected (numerical)\n",
"4. Survival status (class attribute)\n",
" - 1 = the patient survived 5 years or longer\n",
" - 2 = the patient died within 5 year"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ZrQFBQehV8SF",
"colab_type": "code",
"colab": {}
},
"source": [
"!mkdir ../data"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "E1ZFCTFmXFY_",
"colab_type": "code",
"colab": {}
},
"source": [
"!wget -P ../data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "FTo8Py_vOGpX",
"colab_type": "code",
"colab": {}
},
"source": [
"# Display first 10 lines of the dataset\n",
"# Notice that the test is plain csv with no headers!\n",
"!sed -n 1,10p ../data/haberman.data"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Oy2kgVdkaeWK",
"colab_type": "text"
},
"source": [
"##### 2.1 Set `metadata` for the data\n",
"\n",
"The data does not have any info on the value types. So, `metadata` for the data\n",
"helps know the size and format for the matrix data as `.mtd` file with the same\n",
"name and location as `.data` file."
]
},
{
"cell_type": "code",
"metadata": {
"id": "vfypIgJWXT6K",
"colab_type": "code",
"colab": {}
},
"source": [
"# generate metadata file for the dataset\n",
"!echo '{\"rows\": 306, \"cols\": 4, \"format\": \"csv\"}' > ../data/haberman.data.mtd\n",
"\n",
"# generate type description for the data\n",
"!echo '1,1,1,2' > ../data/types.csv\n",
"!echo '{\"rows\": 1, \"cols\": 4, \"format\": \"csv\"}' > ../data/types.csv.mtd"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7Vis3V31bA53",
"colab_type": "text"
},
"source": [
"##### 3. Find the algorithm to run with `systemds`"
]
},
{
"cell_type": "code",
"metadata": {
"id": "L_0KosFhbhun",
"colab_type": "code",
"colab": {}
},
"source": [
"# Inspect the directory structure of systemds code base\n",
"!ls"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "R7C5DVM7YfTb",
"colab_type": "code",
"colab": {}
},
"source": [
"# List all the scripts (also called top level algorithms!)\n",
"!ls scripts/algorithms"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "5PrxwviWJhNd",
"colab_type": "code",
"colab": {}
},
"source": [
"# Lets choose univariate statistics script.\n",
"# Output the algorithm documentation\n",
"# start from line no. 22 onwards. Till 35th line the command looks like\n",
"!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "zv_7wRPFSeuJ",
"colab_type": "code",
"colab": {}
},
"source": [
"!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "IqY_ARNnavrC",
"colab_type": "text"
},
"source": [
"##### 3.1 Let us inspect the output data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "k-_eQg9TauPi",
"colab_type": "code",
"colab": {}
},
"source": [
"# output first 10 lines only.\n",
"!sed -n 1,10p ../data/univarOut.mtx"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "o5VCCweiDMjf",
"colab_type": "text"
},
"source": [
"#### B. Run SystemDS with Apache Spark"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6gJhL7lc1vf7",
"colab_type": "text"
},
"source": [
"#### Playground for DML scripts\n",
"\n",
"DML - A custom language designed for SystemDS with R-like syntax."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zzqeSor__U6M",
"colab_type": "text"
},
"source": [
"##### A test `dml` script to prototype algorithms\n",
"\n",
"Modify the code in the below cell and run to work develop data science tasks\n",
"in a high level language."
]
},
{
"cell_type": "code",
"metadata": {
"id": "t59rTyNbOF5b",
"colab_type": "code",
"colab": {}
},
"source": [
"%%writefile ../test.dml\n",
"\n",
"# This code code acts as a playground for dml code\n",
"X = rand (rows = 20, cols = 10)\n",
"y = X %*% rand(rows = ncol(X), cols = 1)\n",
"lm(X = X, y = y)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "VDfeuJYE1JfK",
"colab_type": "text"
},
"source": [
"Submit the `dml` script to Spark with `spark-submit`.\n",
"More about [Spark Submit ↗](https://spark.apache.org/docs/latest/submitting-applications.html)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "YokktyNE1Cig",
"colab_type": "code",
"colab": {}
},
"source": [
"!$SPARK_HOME/bin/spark-submit \\\n",
" ./target/SystemDS.jar -f ../test.dml"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "gCMkudo_-8_8",
"colab_type": "text"
},
"source": [
"##### Run a binary classification example with sample data\n",
"\n",
"One would notice that no other script than simple dml is used in this example completely."
]
},
{
"cell_type": "code",
"metadata": {
"id": "OSLq2cZb_SUl",
"colab_type": "code",
"colab": {}
},
"source": [
"# Example binary classification task with sample data.\n",
"# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml"
],
"execution_count": null,
"outputs": []
}
]
}