samples/jupyter-notebooks/Linear Regression Algorithms Demo.ipynb - systemds - Git at Google

 {
  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Linear Regression Algorithms using Apache SystemML\n",
     "\n",
     "This notebook shows:\n",
     "- Install SystemML Python package and jar file\n",
     "  - pip\n",
     "  - SystemML 'Hello World'\n",
     "- Example 1: Matrix Multiplication\n",
     "  - SystemML script to generate a random matrix, perform matrix multiplication, and compute the sum of the output\n",
     "  - Examine execution plans, and increase data size to obverve changed execution plans\n",
     "- Load diabetes dataset from scikit-learn\n",
     "- Example 2: Implement three different algorithms to train linear regression model\n",
     "  - Algorithm 1: Linear Regression - Direct Solve (no regularization)\n",
     "  - Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)\n",
     "  - Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)\n",
     "- Example 3: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API\n",
     "- Example 4: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API\n",
     "- Uninstall/Clean up SystemML Python package and jar file"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Install SystemML Python package and jar file"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "!pip uninstall systemml --y\n",
     "!pip install --user https://repository.apache.org/content/groups/snapshots/org/apache/systemml/systemml/1.0.0-SNAPSHOT/systemml-1.0.0-20171201.070207-23-python.tar.gz"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "!pip show systemml"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Import SystemML API "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "from systemml import MLContext, dml, dmlFromResource\n",
     "\n",
     "ml = MLContext(sc)\n",
     "\n",
     "print \"Spark Version:\", sc.version\n",
     "print \"SystemML Version:\", ml.version()\n",
     "print \"SystemML Built-Time:\", ml.buildTime()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "ml.execute(dml(\"\"\"s = 'Hello World!'\"\"\").output(\"s\")).get(\"s\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Import numpy, sklearn, and define some helper functions"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
     "from sklearn import datasets\n",
     "plt.switch_backend('agg')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Example 1: Matrix Multiplication"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### SystemML script to generate a random matrix, perform matrix multiplication, and compute the sum of the output"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true,
     "slideshow": {
      "slide_type": "-"
     }
    },
    "outputs": [],
    "source": [
     "script = \"\"\"\n",
     "    X = rand(rows=$nr, cols=1000, sparsity=0.5)\n",
     "    A = t(X) %*% X\n",
     "    s = sum(A)\n",
     "\"\"\""
    ]
   },
   {
    "cell_type": "raw",
    "metadata": {},
    "source": [
     "ml.setStatistics(False)"
    ]
   },
   {
    "cell_type": "raw",
    "metadata": {},
    "source": [
     "ml.setExplain(True).setExplainLevel(\"runtime\")"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = dml(script).input('$nr', 1e5).output('s')\n",
     "s = ml.execute(prog).get('s')\n",
     "print (s)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Load diabetes dataset from scikit-learn "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "%matplotlib inline"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "diabetes = datasets.load_diabetes()\n",
     "diabetes_X = diabetes.data[:, np.newaxis, 2]\n",
     "diabetes_X_train = diabetes_X[:-20]\n",
     "diabetes_X_test = diabetes_X[-20:]\n",
     "diabetes_y_train = diabetes.target[:-20].reshape(-1,1)\n",
     "diabetes_y_test = diabetes.target[-20:].reshape(-1,1)\n",
     "\n",
     "plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')\n",
     "plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "diabetes.data.shape"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Example 2: Implement three different algorithms to train linear regression model"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "collapsed": true
    },
    "source": [
     "## Algorithm 1: Linear Regression - Direct Solve (no regularization) "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "#### Least squares formulation\n",
     "w* = argminw ||Xw-y||2 = argminw (y - Xw)'(y - Xw) = argminw (w'(X'X)w - w'(X'y))/2\n",
     "\n",
     "#### Setting the gradient\n",
     "dw = (X'X)w - (X'y) to 0, w = (X'X)-1(X' y) = solve(X'X, X'y)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "script = \"\"\"\n",
     "    # add constant feature to X to model intercept\n",
     "    X = cbind(X, matrix(1, rows=nrow(X), cols=1))\n",
     "    A = t(X) %*% X\n",
     "    b = t(X) %*% y\n",
     "    w = solve(A, b)\n",
     "    bias = as.scalar(w[nrow(w),1])\n",
     "    w = w[1:nrow(w)-1,]\n",
     "\"\"\""
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')\n",
     "w, bias = ml.execute(prog).get('w','bias')\n",
     "w = w.toNumPy()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')\n",
     "plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')\n",
     "\n",
     "plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='blue', linestyle ='dotted')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "collapsed": true
    },
    "source": [
     "## Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "#### Algorithm\n",
     "`Step 1: Start with an initial point \n",
     "while(not converged) { \n",
     "  Step 2: Compute gradient dw. \n",
     "  Step 3: Compute stepsize alpha.     \n",
     "  Step 4: Update: wnew = wold + alpha*dw \n",
     "}`\n",
     "\n",
     "#### Gradient formula\n",
     "`dw = r = (X'X)w - (X'y)`\n",
     "\n",
     "#### Step size formula\n",
     "`Find number alpha to minimize f(w + alpha*r) \n",
     "alpha = -(r'r)/(r'X'Xr)`\n",
     "\n",
     "![Gradient Descent](http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "script = \"\"\"\n",
     "    # add constant feature to X to model intercepts\n",
     "    X = cbind(X, matrix(1, rows=nrow(X), cols=1))\n",
     "    max_iter = 100\n",
     "    w = matrix(0, rows=ncol(X), cols=1)\n",
     "    for(i in 1:max_iter){\n",
     "        XtX = t(X) %*% X\n",
     "        dw = XtX %*%w - t(X) %*% y\n",
     "        alpha = -(t(dw) %*% dw) / (t(dw) %*% XtX %*% dw)\n",
     "        w = w + dw*alpha\n",
     "    }\n",
     "    bias = as.scalar(w[nrow(w),1])\n",
     "    w = w[1:nrow(w)-1,]    \n",
     "\"\"\""
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')\n",
     "w, bias = ml.execute(prog).get('w', 'bias')\n",
     "w = w.toNumPy()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')\n",
     "plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')\n",
     "\n",
     "plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Problem with gradient descent: Takes very similar directions many times\n",
     "\n",
     "Solution: Enforce conjugacy\n",
     "\n",
     "`Step 1: Start with an initial point \n",
     "while(not converged) {\n",
     "   Step 2: Compute gradient dw.\n",
     "   Step 3: Compute stepsize alpha.\n",
     "   Step 4: Compute next direction p by enforcing conjugacy with previous direction.\n",
     "   Step 4: Update: w_new = w_old + alpha*p\n",
     "}`\n",
     "\n",
     "![Gradient Descent vs Conjugate Gradient](http://i.stack.imgur.com/zh1HH.png)\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "script = \"\"\"\n",
     "    # add constant feature to X to model intercepts\n",
     "    X = cbind(X, matrix(1, rows=nrow(X), cols=1))\n",
     "    m = ncol(X); i = 1; \n",
     "    max_iter = 20;\n",
     "    w = matrix (0, rows = m, cols = 1); # initialize weights to 0\n",
     "    dw = - t(X) %*% y; p = - dw;        # dw = (X'X)w - (X'y)\n",
     "    norm_r2 = sum (dw ^ 2); \n",
     "    for(i in 1:max_iter) {\n",
     "        q = t(X) %*% (X %*% p)\n",
     "        alpha = norm_r2 / sum (p * q);  # Minimizes f(w - alpha*r)\n",
     "        w = w + alpha * p;              # update weights\n",
     "        dw = dw + alpha * q;           \n",
     "        old_norm_r2 = norm_r2; norm_r2 = sum (dw ^ 2);\n",
     "        p = -dw + (norm_r2 / old_norm_r2) * p; # next direction - conjugacy to previous direction\n",
     "        i = i + 1;\n",
     "    }\n",
     "    bias = as.scalar(w[nrow(w),1])\n",
     "    w = w[1:nrow(w)-1,]    \n",
     "\"\"\""
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')\n",
     "w, bias = ml.execute(prog).get('w','bias')\n",
     "w = w.toNumPy()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')\n",
     "plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')\n",
     "\n",
     "plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Example 3: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = dmlFromResource('scripts/algorithms/LinearRegDS.dml').input(X=diabetes_X_train, y=diabetes_y_train).input('$icpt',1.0).output('beta_out')\n",
     "w = ml.execute(prog).get('beta_out')\n",
     "w = w.toNumPy()\n",
     "bias=w[1]"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')\n",
     "plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')\n",
     "\n",
     "plt.plot(diabetes_X_test, (w[0]*diabetes_X_test)+bias, color='red', linestyle ='dashed')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Example 4: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "*mllearn* API allows a Python programmer to invoke SystemML's algorithms using scikit-learn like API as well as Spark's MLPipeline API."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from pyspark.sql import SQLContext\n",
     "from systemml.mllearn import LinearRegression\n",
     "sqlCtx = SQLContext(sc)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "regr = LinearRegression(sqlCtx)\n",
     "# Train the model using the training sets\n",
     "regr.fit(diabetes_X_train, diabetes_y_train)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "predictions = regr.predict(diabetes_X_test)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "# Use the trained model to perform prediction\n",
     "%matplotlib inline\n",
     "plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')\n",
     "plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')\n",
     "\n",
     "plt.plot(diabetes_X_test, predictions, color='black')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Uninstall/Clean up SystemML Python package and jar file"
    ]
   },
   {
    "cell_type": "raw",
    "metadata": {},
    "source": [
     "!pip uninstall systemml --y"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 2",
    "language": "python",
    "name": "python2"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 2
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython2",
    "version": "2.7.11"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 1
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Linear Regression Algorithms using Apache SystemML\n",
	"\n",
	"This notebook shows:\n",
	"- Install SystemML Python package and jar file\n",
	" - pip\n",
	" - SystemML 'Hello World'\n",
	"- Example 1: Matrix Multiplication\n",
	" - SystemML script to generate a random matrix, perform matrix multiplication, and compute the sum of the output\n",
	" - Examine execution plans, and increase data size to obverve changed execution plans\n",
	"- Load diabetes dataset from scikit-learn\n",
	"- Example 2: Implement three different algorithms to train linear regression model\n",
	" - Algorithm 1: Linear Regression - Direct Solve (no regularization)\n",
	" - Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)\n",
	" - Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)\n",
	"- Example 3: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API\n",
	"- Example 4: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API\n",
	"- Uninstall/Clean up SystemML Python package and jar file"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Install SystemML Python package and jar file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"!pip uninstall systemml --y\n",
	"!pip install --user https://repository.apache.org/content/groups/snapshots/org/apache/systemml/systemml/1.0.0-SNAPSHOT/systemml-1.0.0-20171201.070207-23-python.tar.gz"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"!pip show systemml"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Import SystemML API "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"from systemml import MLContext, dml, dmlFromResource\n",
	"\n",
	"ml = MLContext(sc)\n",
	"\n",
	"print \"Spark Version:\", sc.version\n",
	"print \"SystemML Version:\", ml.version()\n",
	"print \"SystemML Built-Time:\", ml.buildTime()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ml.execute(dml(\"\"\"s = 'Hello World!'\"\"\").output(\"s\")).get(\"s\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Import numpy, sklearn, and define some helper functions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import matplotlib.pyplot as plt\n",
	"import numpy as np\n",
	"from sklearn import datasets\n",
	"plt.switch_backend('agg')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Example 1: Matrix Multiplication"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### SystemML script to generate a random matrix, perform matrix multiplication, and compute the sum of the output"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true,
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [],
	"source": [
	"script = \"\"\"\n",
	" X = rand(rows=$nr, cols=1000, sparsity=0.5)\n",
	" A = t(X) %*% X\n",
	" s = sum(A)\n",
	"\"\"\""
	]
	},
	{
	"cell_type": "raw",
	"metadata": {},
	"source": [
	"ml.setStatistics(False)"
	]
	},
	{
	"cell_type": "raw",
	"metadata": {},
	"source": [
	"ml.setExplain(True).setExplainLevel(\"runtime\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"prog = dml(script).input('$nr', 1e5).output('s')\n",
	"s = ml.execute(prog).get('s')\n",
	"print (s)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Load diabetes dataset from scikit-learn "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%matplotlib inline"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"diabetes = datasets.load_diabetes()\n",
	"diabetes_X = diabetes.data[:, np.newaxis, 2]\n",
	"diabetes_X_train = diabetes_X[:-20]\n",
	"diabetes_X_test = diabetes_X[-20:]\n",
	"diabetes_y_train = diabetes.target[:-20].reshape(-1,1)\n",
	"diabetes_y_test = diabetes.target[-20:].reshape(-1,1)\n",
	"\n",
	"plt.scatter(diabetes_X_train, diabetes_y_train, color='black')\n",
	"plt.scatter(diabetes_X_test, diabetes_y_test, color='red')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"diabetes.data.shape"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Example 2: Implement three different algorithms to train linear regression model"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"## Algorithm 1: Linear Regression - Direct Solve (no regularization) "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Least squares formulation\n",
	"w* = argminw \|\|Xw-y\|\|2 = argminw (y - Xw)'(y - Xw) = argminw (w'(X'X)w - w'(X'y))/2\n",
	"\n",
	"#### Setting the gradient\n",
	"dw = (X'X)w - (X'y) to 0, w = (X'X)-1(X' y) = solve(X'X, X'y)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"script = \"\"\"\n",
	" # add constant feature to X to model intercept\n",
	" X = cbind(X, matrix(1, rows=nrow(X), cols=1))\n",
	" A = t(X) %*% X\n",
	" b = t(X) %*% y\n",
	" w = solve(A, b)\n",
	" bias = as.scalar(w[nrow(w),1])\n",
	" w = w[1:nrow(w)-1,]\n",
	"\"\"\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')\n",
	"w, bias = ml.execute(prog).get('w','bias')\n",
	"w = w.toNumPy()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"plt.scatter(diabetes_X_train, diabetes_y_train, color='black')\n",
	"plt.scatter(diabetes_X_test, diabetes_y_test, color='red')\n",
	"\n",
	"plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='blue', linestyle ='dotted')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"## Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Algorithm\n",
	"`Step 1: Start with an initial point \n",
	"while(not converged) { \n",
	" Step 2: Compute gradient dw. \n",
	" Step 3: Compute stepsize alpha. \n",
	" Step 4: Update: wnew = wold + alpha*dw \n",
	"}`\n",
	"\n",
	"#### Gradient formula\n",
	"`dw = r = (X'X)w - (X'y)`\n",
	"\n",
	"#### Step size formula\n",
	"`Find number alpha to minimize f(w + alpha*r) \n",
	"alpha = -(r'r)/(r'X'Xr)`\n",
	"\n",
	"![Gradient Descent](http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"script = \"\"\"\n",
	" # add constant feature to X to model intercepts\n",
	" X = cbind(X, matrix(1, rows=nrow(X), cols=1))\n",
	" max_iter = 100\n",
	" w = matrix(0, rows=ncol(X), cols=1)\n",
	" for(i in 1:max_iter){\n",
	" XtX = t(X) %*% X\n",
	" dw = XtX %%w - t(X) %% y\n",
	" alpha = -(t(dw) %% dw) / (t(dw) %% XtX %*% dw)\n",
	" w = w + dw*alpha\n",
	" }\n",
	" bias = as.scalar(w[nrow(w),1])\n",
	" w = w[1:nrow(w)-1,] \n",
	"\"\"\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')\n",
	"w, bias = ml.execute(prog).get('w', 'bias')\n",
	"w = w.toNumPy()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"plt.scatter(diabetes_X_train, diabetes_y_train, color='black')\n",
	"plt.scatter(diabetes_X_test, diabetes_y_test, color='red')\n",
	"\n",
	"plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Problem with gradient descent: Takes very similar directions many times\n",
	"\n",
	"Solution: Enforce conjugacy\n",
	"\n",
	"`Step 1: Start with an initial point \n",
	"while(not converged) {\n",
	" Step 2: Compute gradient dw.\n",
	" Step 3: Compute stepsize alpha.\n",
	" Step 4: Compute next direction p by enforcing conjugacy with previous direction.\n",
	" Step 4: Update: w_new = w_old + alpha*p\n",
	"}`\n",
	"\n",
	"![Gradient Descent vs Conjugate Gradient](http://i.stack.imgur.com/zh1HH.png)\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"script = \"\"\"\n",
	" # add constant feature to X to model intercepts\n",
	" X = cbind(X, matrix(1, rows=nrow(X), cols=1))\n",
	" m = ncol(X); i = 1; \n",
	" max_iter = 20;\n",
	" w = matrix (0, rows = m, cols = 1); # initialize weights to 0\n",
	" dw = - t(X) %*% y; p = - dw; # dw = (X'X)w - (X'y)\n",
	" norm_r2 = sum (dw ^ 2); \n",
	" for(i in 1:max_iter) {\n",
	" q = t(X) %% (X %% p)\n",
	" alpha = norm_r2 / sum (p * q); # Minimizes f(w - alpha*r)\n",
	" w = w + alpha * p; # update weights\n",
	" dw = dw + alpha * q; \n",
	" old_norm_r2 = norm_r2; norm_r2 = sum (dw ^ 2);\n",
	" p = -dw + (norm_r2 / old_norm_r2) * p; # next direction - conjugacy to previous direction\n",
	" i = i + 1;\n",
	" }\n",
	" bias = as.scalar(w[nrow(w),1])\n",
	" w = w[1:nrow(w)-1,] \n",
	"\"\"\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')\n",
	"w, bias = ml.execute(prog).get('w','bias')\n",
	"w = w.toNumPy()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"plt.scatter(diabetes_X_train, diabetes_y_train, color='black')\n",
	"plt.scatter(diabetes_X_test, diabetes_y_test, color='red')\n",
	"\n",
	"plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Example 3: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"prog = dmlFromResource('scripts/algorithms/LinearRegDS.dml').input(X=diabetes_X_train, y=diabetes_y_train).input('$icpt',1.0).output('beta_out')\n",
	"w = ml.execute(prog).get('beta_out')\n",
	"w = w.toNumPy()\n",
	"bias=w[1]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"plt.scatter(diabetes_X_train, diabetes_y_train, color='black')\n",
	"plt.scatter(diabetes_X_test, diabetes_y_test, color='red')\n",
	"\n",
	"plt.plot(diabetes_X_test, (w[0]*diabetes_X_test)+bias, color='red', linestyle ='dashed')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Example 4: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"mllearn API allows a Python programmer to invoke SystemML's algorithms using scikit-learn like API as well as Spark's MLPipeline API."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from pyspark.sql import SQLContext\n",
	"from systemml.mllearn import LinearRegression\n",
	"sqlCtx = SQLContext(sc)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"regr = LinearRegression(sqlCtx)\n",
	"# Train the model using the training sets\n",
	"regr.fit(diabetes_X_train, diabetes_y_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"predictions = regr.predict(diabetes_X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Use the trained model to perform prediction\n",
	"%matplotlib inline\n",
	"plt.scatter(diabetes_X_train, diabetes_y_train, color='black')\n",
	"plt.scatter(diabetes_X_test, diabetes_y_test, color='red')\n",
	"\n",
	"plt.plot(diabetes_X_test, predictions, color='black')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Uninstall/Clean up SystemML Python package and jar file"
	]
	},
	{
	"cell_type": "raw",
	"metadata": {},
	"source": [
	"!pip uninstall systemml --y"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}