samples/jupyter-notebooks/DML Tips and Tricks (aka Fun With DML).ipynb - systemds - Git at Google

 {
  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "1. [Replace NaN with mode](#NaN2Mode)\n",
     "* [Use sample builtin function to create sample from matrix](#sample)\n",
     "* [Count of Matching Values in two Matrices/Vectors](#MatchinRows)\n",
     "* [Cross Validation](#CrossValidation)\n",
     "* [Value-based join of two Matrices](#JoinMatrices)\n",
     "* [Filter Matrix to include only Frequent Column Values](#FilterMatrix)\n",
     "* [Construct (sparse) Matrix from (rowIndex, colIndex, values) triplets](#Construct_sparse_Matrix)\n",
     "* [Find and remove duplicates in columns or rows](#Find_and_remove_duplicates)\n",
     "* [Set based Indexing](#Set_based_Indexing)\n",
     "* [Group by Aggregate using Linear Algebra](#Multi_column_Sorting)\n",
     "* [Cumulative Summation with Decay Multiplier](#CumSum_Product)\n",
     "* [Invert Lower Triangular Matrix](#Invert_Lower_Triangular_Matrix)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 15,
    "metadata": {
     "collapsed": false,
     "scrolled": false
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "2017-09-22 07:57:57 UTC\n"
      ]
     }
    ],
    "source": [
     "from systemml import MLContext, dml, jvm_stdout\n",
     "ml = MLContext(sc)\n",
     "\n",
     "print (ml.buildTime())"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Replace NaN with mode<a id=\"NaN2Mode\" />"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "This functions replaces NaN in column with mode of column"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 13,
    "metadata": {
     "collapsed": false
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "Before: \n",
       "1.000 NaN\n",
       "1.000 NaN\n",
       "1.000 2.000\n",
       "2.000 1.000\n",
       "1.000 2.000\n",
       "\n",
       "After: \n",
       "1.000 2.000\n",
       "1.000 2.000\n",
       "1.000 2.000\n",
       "2.000 1.000\n",
       "1.000 2.000\n",
       "\n",
       "SystemML Statistics:\n",
       "Total execution time:\t\t0.001 sec.\n",
       "Number of executed Spark inst:\t0.\n",
       "\n",
       "\n"
      ]
     }
    ],
    "source": [
     "prog=\"\"\"\n",
     "# Function for NaN-aware replacement with mode\n",
     "replaceNaNwithMode = function (matrix[double] X, integer colId) \n",
     "      return (matrix[double] X) \n",
     "{\n",
     "   Xi = replace (target=X[,colId], pattern=0/0, replacement=max(X[,colId])+1)   # replace NaN with largest value + 1\n",
     "   agg = aggregate (target=Xi, groups=Xi, fn=\"count\")                           # count each distinct value\n",
     "   mode = as.scalar (rowIndexMax(t(agg[1:nrow(agg)-1, ])))                      # mode is max frequent value except last value\n",
     "   X[,colId] = replace (target=Xi, pattern=max(Xi), replacement=mode)           # fill in mode\n",
     "}\n",
     "\n",
     "X = matrix('1 NaN 1 NaN 1 2 2 1 1 2', rows = 5, cols = 2)\n",
     "\n",
     "Y = replaceNaNwithMode (X, 2)\n",
     "\n",
     "print (\"Before: \\n\" + toString(X))\n",
     "print (\"After: \\n\" + toString(Y))\n",
     "\"\"\"\n",
     "with jvm_stdout(True):\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Use sample builtin function to create sample from matrix<a id=\"sample\" />"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Use sample() function, create permutation matrix using table(), and pull sample from X."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 18,
    "metadata": {
     "collapsed": false
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "X: \n",
       "2.000 1.000\n",
       "8.000 3.000\n",
       "5.000 6.000\n",
       "7.000 9.000\n",
       "4.000 4.000\n",
       "\n",
       "sv: \n",
       "1.000\n",
       "4.000\n",
       "\n",
       "samples: \n",
       "2.000 1.000\n",
       "7.000 9.000\n",
       "\n",
       "SystemML Statistics:\n",
       "Total execution time:\t\t0.001 sec.\n",
       "Number of executed Spark inst:\t0.\n",
       "\n",
       "\n"
      ]
     }
    ],
    "source": [
     "prog=\"\"\"\n",
     "X = matrix ('2 1 8 3 5 6 7 9 4 4', rows = 5, cols = 2 )\n",
     "\n",
     "nbrSamples = 2\n",
     "\n",
     "sv = order (target = sample (nrow (X), nbrSamples, FALSE))  # samples w/o replacement, and order            \n",
     "P = table (seq (1, nbrSamples), sv, nbrSamples, nrow(X))    # permutation matrix\n",
     "samples = P %*% X;                                          # apply P to perform selection\n",
     "\n",
     "\n",
     "print (\"X: \\n\" + toString(X))\n",
     "print (\"sv: \\n\" + toString(sv))\n",
     "print (\"samples: \\n\" + toString(samples))\n",
     "\"\"\"\n",
     "with jvm_stdout(True):\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Count of Matching Values in two Matrices/Vectors<a id=\"MatchingRows\" />"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Given two matrices/vectors X and Y, get a count of the rows where X and Y have the same value."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 19,
    "metadata": {
     "collapsed": false
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "t(X): 8.000 4.000 5.000 4.000 9.000 10.000\n",
       "\n",
       "t(Y): 4.000 9.000 5.000 1.000 9.000 7.000\n",
       "\n",
       "Number of Matches: 2.0\n",
       "\n",
       "SystemML Statistics:\n",
       "Total execution time:\t\t0.001 sec.\n",
       "Number of executed Spark inst:\t0.\n",
       "\n",
       "\n"
      ]
     }
    ],
    "source": [
     "prog=\"\"\"\n",
     "X = matrix('8 4 5 4 9 10', rows = 6, cols = 1)\n",
     "Y = matrix('4 9 5 1 9 7 ', rows = 6, cols = 1)\n",
     "\n",
     "matches = sum (X == Y)\n",
     "\n",
     "print (\"t(X): \" + toString(t(X)))\n",
     "print (\"t(Y): \" + toString(t(Y)))\n",
     "print (\"Number of Matches: \" + matches + \"\\n\")\n",
     "\"\"\"\n",
     "with jvm_stdout(True):\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Cross Validation<a id=\"CrossValidation\" />"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Perform kFold cross validation by running in parallel fold creation, training algorithm, test algorithm, and evaluation."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
    "metadata": {
     "collapsed": true
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "Test data Xyi2\n",
       "10.000 11.000 12.000 4.000\n",
       "16.000 17.000 18.000 6.000\n",
       "\n",
       "Train data Xyni2\n",
       "1.000 2.000 3.000 1.000\n",
       "4.000 5.000 6.000 2.000\n",
       "7.000 8.000 9.000 3.000\n",
       "13.000 14.000 15.000 5.000\n",
       "\n",
       "w2\n",
       "95.000\n",
       "106.000\n",
       "117.000\n",
       "\n",
       "stats2\n",
       "8938.000\n",
       "\n",
       "\n",
       "Test data Xyi3\n",
       "1.000 2.000 3.000 1.000\n",
       "7.000 8.000 9.000 3.000\n",
       "\n",
       "Train data Xyni3\n",
       "4.000 5.000 6.000 2.000\n",
       "10.000 11.000 12.000 4.000\n",
       "13.000 14.000 15.000 5.000\n",
       "16.000 17.000 18.000 6.000\n",
       "\n",
       "w3\n",
       "209.000\n",
       "226.000\n",
       "243.000\n",
       "\n",
       "stats3\n",
       "6844.000\n",
       "\n",
       "\n",
       "Test data Xyi1\n",
       "4.000 5.000 6.000 2.000\n",
       "13.000 14.000 15.000 5.000\n",
       "\n",
       "Train data Xyni1\n",
       "1.000 2.000 3.000 1.000\n",
       "7.000 8.000 9.000 3.000\n",
       "10.000 11.000 12.000 4.000\n",
       "16.000 17.000 18.000 6.000\n",
       "\n",
       "w1\n",
       "158.000\n",
       "172.000\n",
       "186.000\n",
       "\n",
       "stats1\n",
       "9853.000\n",
       "\n",
       "\n",
       "SV selection vector:\n",
       "3.000\n",
       "1.000\n",
       "3.000\n",
       "2.000\n",
       "1.000\n",
       "2.000\n",
       "\n",
       "SystemML Statistics:\n",
       "Total execution time:\t\t0.024 sec.\n",
       "Number of executed Spark inst:\t0.\n",
       "\n",
       "\n"
      ]
     }
    ],
    "source": [
     "prog = \"\"\"\n",
     "holdOut = 1/3\n",
     "kFolds = 1/holdOut\n",
     "\n",
     "nRows = 6; nCols = 3; \n",
     "\n",
     "X = matrix(seq(1, nRows * nCols), rows = nRows, cols = nCols)             # X data\n",
     "y = matrix(seq(1, nRows), rows = nRows, cols = 1)                         # y label data\n",
     "Xy = cbind (X,y)                                                          # Xy Data for CV\n",
     "\n",
     "sv = rand (rows = nRows, cols = 1, min = 0.0, max = 1.0, pdf = \"uniform\") # sv selection vector for fold creation \n",
     "sv = (order(target=sv, by=1, index.return=TRUE)) %% kFolds + 1            #    with numbers between 1 .. kFolds \n",
     "\n",
     "stats = matrix(0, rows=kFolds, cols=1)                                    # stats per kFolds model on test data\n",
     "\n",
     "parfor (i in 1:kFolds)\n",
     "{\n",
     "   # Skip empty training data or test data. \n",
     "   if  ( sum (sv == i) > 0 & sum (sv == i) < nrow(X) )    \n",
     "   {\n",
     "      Xyi  = removeEmpty(target = Xy, margin = \"rows\", select = (sv == i))  # Xyi fold, i.e. 1/k of rows (test data)\n",
     "      Xyni = removeEmpty(target = Xy, margin = \"rows\", select = (sv != i))  # Xyni data, i.e. (k-1)/k of rows (train data)\n",
     "\n",
     "      # Skip extreme label inbalance\n",
     "      distinctLabels = aggregate( target = Xyni[,1], groups = Xyni[,1], fn = \"count\")\n",
     "      if ( nrow(distinctLabels) > 1)\n",
     "      {\n",
     "         wi = trainAlg (Xyni[ ,1:ncol(Xy)-1], Xyni[ ,ncol(Xy)])             # wi Model for i-th training data\n",
     "         pi = testAlg  (Xyi [ ,1:ncol(Xy)-1], wi)                           # pi Prediction for i-th test data\n",
     "         ei = evalPrediction (pi, Xyi[ ,ncol(Xy)])                          # stats[i,] evaluation of prediction of i-th fold\n",
     "         stats[i,] = ei\n",
     "    \n",
     "         print (  \"Test data Xyi\" + i + \"\\n\" + toString(Xyi)  \n",
     "               + \"\\nTrain data Xyni\" + i + \"\\n\" + toString(Xyni)  \n",
     "               + \"\\nw\" + i + \"\\n\" + toString(wi) \n",
     "               + \"\\nstats\" + i + \"\\n\" + toString(stats[i,]) \n",
     "               + \"\\n\")\n",
     "      }\n",
     "      else\n",
     "      {\n",
     "        print (\"Training data for fold \" + i + \" has only \" + nrow(distinctLabels) + \" distinct labels. Needs to be > 1.\")\n",
     "      }    \n",
     "   } \n",
     "   else \n",
     "   {\n",
     "      print (\"Training data or test data for fold \" + i + \" is empty. Fold not validated.\")\n",
     "   }\n",
     "\n",
     "}\n",
     "\n",
     "print (\"SV selection vector:\\n\" + toString(sv))\n",
     "\n",
     "trainAlg = function (matrix[double] X, matrix[double] y)\n",
     "  return (matrix[double] w)\n",
     "{\n",
     "   w = t(X) %*% y\n",
     "}\n",
     "\n",
     "testAlg = function (matrix[double] X, matrix[double] w)\n",
     "  return (matrix[double] p)\n",
     "{\n",
     "   p = X %*% w\n",
     "}\n",
     "\n",
     "evalPrediction = function (matrix[double] p, matrix[double] y)\n",
     "  return (matrix[double] e)\n",
     "{\n",
     "   e = as.matrix(sum (p - y))\n",
     "}\n",
     "\"\"\"\n",
     "\n",
     "with jvm_stdout(True):\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Value-based join of two Matrices<a id=\"JoinMatrices\"/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Given matrix M1 and M2, join M1 on column 2 with M2 on column 2, and return matching rows of M1."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 55,
    "metadata": {
     "collapsed": false
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "M1 \n",
       "1.000 1.000\n",
       "2.000 3.000\n",
       "3.000 3.000\n",
       "4.000 4.000\n",
       "5.000 3.000\n",
       "6.000 4.000\n",
       "7.000 1.000\n",
       "8.000 2.000\n",
       "9.000 1.000\n",
       "\n",
       "M2 \n",
       "1.000 1.000\n",
       "2.000 8.000\n",
       "3.000 3.000\n",
       "4.000 3.000\n",
       "5.000 1.000\n",
       "\n",
       "M1[,2] joined with M2[,2], and return matching M1 rows\n",
       "1.000 1.000\n",
       "2.000 3.000\n",
       "3.000 3.000\n",
       "5.000 3.000\n",
       "7.000 1.000\n",
       "9.000 1.000\n",
       "\n",
       "SystemML Statistics:\n",
       "Total execution time:\t\t0.001 sec.\n",
       "Number of executed Spark inst:\t0.\n",
       "\n",
       "\n"
      ]
     }
    ],
    "source": [
     "prog = \"\"\"\n",
     "M1 = matrix ('1 1 2 3 3 3 4 4 5 3 6 4 7 1 8 2 9 1', rows = 9, cols = 2)\n",
     "M2 = matrix ('1 1 2 8 3 3 4 3 5 1', rows = 5, cols = 2)\n",
     "\n",
     "I = rowSums (outer (M1[,2], t(M2[,2]), \"==\"))                  # I : indicator matrix for M1\n",
     "M12 = removeEmpty (target = M1, margin = \"rows\", select = I)   # apply filter to retrieve join result\n",
     "\n",
     "print (\"M1 \\n\" + toString(M1))\n",
     "print (\"M2 \\n\" + toString(M2))\n",
     "print (\"M1[,2] joined with M2[,2], and return matching M1 rows\\n\" + toString(M12))\n",
     "\"\"\"\n",
     "with jvm_stdout():\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Filter Matrix to include only Frequent Column Values <a id=\"FilterMatrix\"/>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Given a matrix, filter the matrix to only include rows with column values that appear more often than MinFreq."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 53,
    "metadata": {
     "collapsed": false
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "1.000 1.000\n",
       "2.000 3.000\n",
       "3.000 3.000\n",
       "4.000 4.000\n",
       "5.000 3.000\n",
       "6.000 4.000\n",
       "7.000 1.000\n",
       "8.000 2.000\n",
       "9.000 1.000\n",
       "\n",
       "1.000 1.000\n",
       "2.000 3.000\n",
       "3.000 3.000\n",
       "5.000 3.000\n",
       "7.000 1.000\n",
       "9.000 1.000\n",
       "\n",
       "SystemML Statistics:\n",
       "Total execution time:\t\t0.001 sec.\n",
       "Number of executed Spark inst:\t0.\n",
       "\n",
       "\n"
      ]
     }
    ],
    "source": [
     "prog = \"\"\"\n",
     "MinFreq = 3                                                           # minimum frequency of tokens\n",
     "\n",
     "M = matrix ('1 1 2 3 3 3 4 4 5 3 6 4 7 1 8 2 9 1', rows = 9, cols = 2)\n",
     "\n",
     "gM = aggregate (target = M[,2], groups = M[,2], fn = \"count\")         # gM: group by and count (grouped matrix)\n",
     "gv = cbind (seq(1,nrow(gM)), gM)                                      # gv: add group values to counts (group values)\n",
     "fg = removeEmpty (target = gv * (gv[,2] >= MinFreq), margin = \"rows\") # fg: filtered groups\n",
     "I = rowSums (outer (M[,2] ,t(fg[,1]), \"==\"))                          # I : indicator of size M with filtered groups\n",
     "fM = removeEmpty (target = M, margin = \"rows\", select = I)            # FM: filter matrix\n",
     "\n",
     "print (toString(M))\n",
     "print (toString(fM))\n",
     "\"\"\"\n",
     "with jvm_stdout():\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Construct (sparse) Matrix from (rowIndex, colIndex, values) triplets<a id=\"Construct_sparse_Matrix\"></a>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Given rowIndex, colIndex, and values as column vectors, construct (sparse) matrix."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = \"\"\"\n",
     "I = matrix (\"1 3 3 4 5\", rows = 5, cols = 1)\n",
     "J = matrix (\"2 3 4 1 6\", rows = 5, cols = 1)\n",
     "V = matrix (\"10 20 30 40 50\", rows = 5, cols = 1)\n",
     "\n",
     "M = table (I, J, V)\n",
     "print (toString (M))\n",
     "\"\"\"\n",
     "ml.execute(dml(prog).output('M')).get('M').toNumPy()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Find and remove duplicates in columns or rows<a id=\"Find_and_remove_duplicates\"></a>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Assuming values are sorted."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = \"\"\"\n",
     "X = matrix (\"1 2 3 3 3 4 5 10\", rows = 8, cols = 1)\n",
     "\n",
     "I = rbind (matrix (1,1,1), (X[1:nrow (X)-1,] != X[2:nrow (X),]));             # compare current with next value\n",
     "res = removeEmpty (target = X, margin = \"rows\", select = I);                  # select where different\n",
     "\"\"\"\n",
     "ml.execute(dml(prog).output('res')).get('res').toNumPy()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### No assumptions on values."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = \"\"\"\n",
     "X = matrix (\"3 2 1 3 3 4 5 10\", rows = 8, cols = 1)\n",
     "\n",
     "I = aggregate (target = X, groups = X[,1], fn = \"count\")                                 # group and count duplicates\n",
     "res = removeEmpty (target = seq (1, max (X[,1])), margin = \"rows\", select = (I != 0));   # select groups\n",
     "\"\"\"\n",
     "ml.execute(dml(prog).output('res')).get('res').toNumPy()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Order the values and then remove duplicates."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = \"\"\"\n",
     "X = matrix (\"3 2 1 3 3 4 5 10\", rows = 8, cols = 1)\n",
     "\n",
     "X = order (target = X, by = 1)                                      # order values\n",
     "I = rbind (matrix (1,1,1), (X[1:nrow (X)-1,] != X[2:nrow (X),]));\n",
     "res = removeEmpty (target = X, margin = \"rows\", select = I);\n",
     "\"\"\"\n",
     "ml.execute(dml(prog).output('res')).get('res').toNumPy()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Set based Indexing<a id=\"Set_based_Indexing\"></a>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Given a matrix X, and a indicator matrix J with indices into X. \n",
     "Use J to perform operation on X, e.g. add value 10 to cells in X indicated by J. "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = \"\"\"\n",
     "X = matrix (1, rows = 1, cols = 100)\n",
     "J = matrix (\"10 20 25 26 28 31 50 67 79\", rows = 1, cols = 9)\n",
     "\n",
     "res = X + table (matrix (1, rows = 1, cols = ncol (J)), J, 10)\n",
     "\n",
     "print (toString (res))\n",
     "\"\"\"\n",
     "ml.execute(dml(prog).output('res')).get('res').toNumPy()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "collapsed": true
    },
    "source": [
     "## Group by Aggregate using Linear Algebra<a id=\"Multi_column_Sorting\"></a>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Given a matrix PCV as (Position, Category, Value), sort PCV by category, and within each category by value in descending order. Create indicator vector for category changes, create distinct categories, and perform linear algebra operations."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = \"\"\"\n",
     "C = matrix ('50 40 20 10 30 20 40 20 30', rows = 9, cols = 1)                              # category data\n",
     "V = matrix ('20 11 49 33 94 29 48 74 57', rows = 9, cols = 1)                              # value data\n",
     "\n",
     "PCV = cbind (cbind (seq (1, nrow (C), 1), C), V);                                      # PCV representation\n",
     "PCV = order (target = PCV, by = 3, decreasing = TRUE,  index.return = FALSE);\n",
     "PCV = order (target = PCV, by = 2, decreasing = FALSE, index.return = FALSE);\n",
     "\n",
     "# Find all rows of PCV where the category has a new value, in comparison to the previous row\n",
     "\n",
     "is_new_C = matrix (1, rows = 1, cols = 1);\n",
     "if (nrow (C) > 1) {\n",
     "  is_new_C = rbind (is_new_C, (PCV [1:nrow(C) - 1, 2] < PCV [2:nrow(C), 2]));\n",
     "}\n",
     "\n",
     "# Associate each category with its index\n",
     "\n",
     "index_C = cumsum (is_new_C);                                                          # cumsum\n",
     "\n",
     "# For each category, compute:\n",
     "#   - the list of distinct categories\n",
     "#   - the maximum value for each category\n",
     "#   - 0-1 aggregation matrix that adds records of the same category\n",
     "\n",
     "distinct_C  = removeEmpty (target = PCV [, 2], margin = \"rows\", select = is_new_C);\n",
     "max_V_per_C = removeEmpty (target = PCV [, 3], margin = \"rows\", select = is_new_C);\n",
     "C_indicator = table (index_C, PCV [, 1], max (index_C), nrow (C));                    # table\n",
     "\n",
     "sum_V_per_C = C_indicator %*% V\n",
     "\"\"\"\n",
     "\n",
     "res = ml.execute(dml(prog).output('PCV','distinct_C', 'max_V_per_C', 'C_indicator', 'sum_V_per_C'))\n",
     "print (res.get('PCV').toNumPy())\n",
     "print (res.get('distinct_C').toNumPy())\n",
     "print (res.get('max_V_per_C').toNumPy())\n",
     "print (res.get('C_indicator').toNumPy())\n",
     "print (res.get('sum_V_per_C').toNumPy())"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Cumulative Summation with Decay Multiplier<a id=\"CumSum_Product\"></a>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Given matrix X, compute:\n",
     "\n",
     "    Y[i] = X[i]\n",
     "         + X[i-1] * C[i]\n",
     "         + X[i-2] * C[i] * C[i-1]\n",
     "         + X[i-3] * C[i] * C[i-1] * C[i-2]\n",
     "         + ...\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "cumsum_prod_def = \"\"\"\n",
     "cumsum_prod = function (Matrix[double] X, Matrix[double] C, double start)\n",
     "  return (Matrix[double] Y)\n",
     "# Computes the following recurrence in log-number of steps:\n",
     "#   Y [1, ] = X [1, ] + C [1, ] * start;\n",
     "#   Y [i+1, ] = X [i+1, ] + C [i+1, ] * Y [i, ]\n",
     "{\n",
     "   Y = X;  P = C;  m = nrow(X);  k = 1;\n",
     "   Y [1,] = Y [1,] + C [1,] * start;\n",
     "   while (k < m) {\n",
     "     Y [k + 1:m,] = Y [k + 1:m,] + Y [1:m - k,] * P [k + 1:m,];\n",
     "     P [k + 1:m,] = P [1:m - k,] * P [k + 1:m,];\n",
     "     k = 2 * k;\n",
     "   }\n",
     "}\n",
     "\"\"\""
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "In this example we use cumsum_prod for cumulative summation with \"breaks\", that is, multiple cumulative summations in one."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = cumsum_prod_def + \"\"\"\n",
     "X = matrix (\"1 2 3 4 5 6 7 8 9\", rows = 9, cols = 1);\n",
     "\n",
     "#Zeros in C cause \"breaks\" that restart the cumulative summation from 0\n",
     "C = matrix (\"0 1 1 0 1 1 1 0 1\", rows = 9, cols = 1);\n",
     "\n",
     "Y = cumsum_prod (X, C, 0);\n",
     "\n",
     "print (toString(Y))\n",
     "\"\"\"\n",
     "with jvm_stdout():\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "In this example, we copy selected rows downward to all consecutive non-selected rows."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = cumsum_prod_def + \"\"\"\n",
     "X = matrix (\"1 2 3 4 5 6 7 8 9\", rows = 9, cols = 1);\n",
     "\n",
     "# Ones in S represent selected rows to be copied, zeros represent non-selected rows\n",
     "S = matrix (\"1 0 0 1 0 0 0 1 0\", rows = 9, cols = 1);\n",
     "\n",
     "Y = cumsum_prod (X * S, 1 - S, 0);\n",
     "\n",
     "print (toString(Y))\n",
     "\"\"\"\n",
     "with jvm_stdout():\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "This is a naive implementation of cumulative summation with decay multiplier."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "cumsum_prod_naive_def = \"\"\"\n",
     "cumsum_prod_naive = function (Matrix[double] X, Matrix[double] C, double start)\n",
     "  return (Matrix[double] Y)\n",
     "{\n",
     "  Y = matrix (0, rows = nrow(X), cols = ncol(X));\n",
     "  Y [1,] = X [1,] + C [1,] * start;\n",
     "  for (i in 2:nrow(X))\n",
     "  {\n",
     "    Y [i,] = X [i,] + C [i,] * Y [i - 1,]\n",
     "  }\n",
     "}\n",
     "\"\"\""
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "There is a significant performance difference between the <b>naive</b> implementation and the <b>tricky</b> implementation."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = cumsum_prod_def + cumsum_prod_naive_def + \"\"\"\n",
     "X = rand (rows = 20000, cols = 10, min = 0, max = 1, pdf = \"uniform\", sparsity = 1.0);\n",
     "C = rand (rows = 20000, cols = 10, min = 0, max = 1, pdf = \"uniform\", sparsity = 1.0);\n",
     "\n",
     "Y1 = cumsum_prod_naive (X, C, 0.123);\n",
     "\"\"\"\n",
     "with jvm_stdout():\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog = cumsum_prod_def + cumsum_prod_naive_def + \"\"\"\n",
     "X = rand (rows = 20000, cols = 10, min = 0, max = 1, pdf = \"uniform\", sparsity = 1.0);\n",
     "C = rand (rows = 20000, cols = 10, min = 0, max = 1, pdf = \"uniform\", sparsity = 1.0);\n",
     "\n",
     "Y2 = cumsum_prod (X, C, 0.123);\n",
     "\"\"\"\n",
     "with jvm_stdout():\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Invert Lower Triangular Matrix<a id=\"Invert_Lower_Triangular_Matrix\"></a>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "In this example, we invert a lower triangular matrix using a the following divide-and-conquer approach. Given lower triangular matrix L, we compute its inverse X which is also lower triangular by splitting both matrices in the middle into 4 blocks (in a 2x2 fashion), and multiplying them together to get the identity matrix:\n",
     "\n",
     "\\begin{equation}\n",
     "L \\text{ %*% } X = \\left(\\begin{matrix} L_1 & 0 \\\\ L_2 & L_3 \\end{matrix}\\right)\n",
     "\\text{ %*% } \\left(\\begin{matrix} X_1 & 0 \\\\ X_2 & X_3 \\end{matrix}\\right)\n",
     "= \\left(\\begin{matrix} L_1 X_1 & 0 \\\\ L_2 X_1 + L_3 X_2 & L_3 X_3 \\end{matrix}\\right)\n",
     "= \\left(\\begin{matrix} I & 0 \\\\ 0 & I \\end{matrix}\\right)\n",
     "\\nonumber\n",
     "\\end{equation}\n",
     "\n",
     "If we multiply blockwise, we get three equations: \n",
     "\n",
     "$\n",
     "\\begin{equation}\n",
     "L1 \\text{ %*% } X1 = 1\\\\ \n",
     "L3 \\text{ %*% } X3 = 1\\\\\n",
     "L2 \\text{ %*% } X1 + L3 \\text{ %*% } X2 = 0\\\\\n",
     "\\end{equation}\n",
     "$\n",
     "\n",
     "Solving these equation gives the following formulas for X:\n",
     "\n",
     "$\n",
     "\\begin{equation}\n",
     "X1 = inv(L1) \\\\\n",
     "X3 = inv(L3) \\\\\n",
     "X2 = - X3 \\text{ %*% } L2 \\text{ %*% } X1 \\\\\n",
     "\\end{equation}\n",
     "$\n",
     "\n",
     "If we already recursively inverted L1 and L3, we can invert L2.  This suggests an algorithm that starts at the diagonal and iterates away from the diagonal, involving bigger and bigger blocks (of size 1, 2, 4, 8, etc.)  There is a logarithmic number of steps, and inside each step, the inversions can be performed in parallel using a parfor-loop.\n",
     "\n",
     "Function \"invert_lower_triangular\" occurs within more general inverse operations and matrix decompositions.  The divide-and-conquer idea allows to derive more efficient algorithms for other matrix decompositions.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "invert_lower_triangular_def = \"\"\"\n",
     "invert_lower_triangular = function (Matrix[double] LI)\n",
     "  return   (Matrix[double] LO)\n",
     "{\n",
     "  n = nrow (LI);\n",
     "  LO = matrix (0, rows = n, cols = n);\n",
     "  LO = LO + diag (1 / diag (LI));\n",
     "  \n",
     "  k = 1;\n",
     "  while (k < n)\n",
     "  {\n",
     "    LPF = matrix (0, rows = n, cols = n);\n",
     "    parfor (p in 0:((n - 1) / (2 * k)), check = 0)\n",
     "    {\n",
     "    i = 2 * k * p;\n",
     "    j = i + k;\n",
     "    q = min (n, j + k);\n",
     "    if (j + 1 <= q) {\n",
     "      L1 = LO [i + 1:j, i + 1:j];\n",
     "      L2 = LI [j + 1:q, i + 1:j];\n",
     "      L3 = LO [j + 1:q, j + 1:q];\n",
     "      LPF [j + 1:q, i + 1:j] = -L3 %*% L2 %*% L1;\n",
     "    }\n",
     "    }\n",
     "    LO = LO + LPF;\n",
     "    k = 2 * k;\n",
     "  }\n",
     "}\n",
     "\"\"\""
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog =  invert_lower_triangular_def + \"\"\"\n",
     "n = 1000;\n",
     "A = rand (rows = n, cols = n, min = -1, max = 1, pdf = \"uniform\", sparsity = 1.0);\n",
     "Mask = cumsum (diag (matrix (1, rows = n, cols = 1)));\n",
     "\n",
     "L = (A %*% t(A)) * Mask;     # Generate L for stability of the inverse\n",
     "\n",
     "X = invert_lower_triangular (L);\n",
     "\n",
     "print (\"Maximum difference between X %*% L and Identity = \" + max (abs (X %*% L - diag (matrix (1, rows = n, cols = 1)))));\n",
     "\"\"\"\n",
     "with jvm_stdout():\n",
     "    ml.execute(dml(prog))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "This is a naive implementation of inverting a lower triangular matrix."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "invert_lower_triangular_naive_def = \"\"\"\n",
     "invert_lower_triangular_naive = function (Matrix[double] LI)\n",
     "  return   (Matrix[double] LO)\n",
     "{\n",
     "  n = nrow (LI);\n",
     "  LO = diag (matrix (1, rows = n, cols = 1));\n",
     "  for (i in 1:n - 1)\n",
     "  {\n",
     "    LO [i,] = LO [i,] / LI [i, i];\n",
     "    LO [i + 1:n,] = LO [i + 1:n,] - LI [i + 1:n, i] %*% LO [i,];\n",
     "  }\n",
     "  LO [n,] = LO [n,] / LI [n, n];\n",
     "}\n",
     "\"\"\""
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The naive implementation is significantly slower than the divide-and-conquer implementation."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "collapsed": false
    },
    "outputs": [],
    "source": [
     "prog =  invert_lower_triangular_naive_def + \"\"\"\n",
     "n = 1000;\n",
     "A = rand (rows = n, cols = n, min = -1, max = 1, pdf = \"uniform\", sparsity = 1.0);\n",
     "Mask = cumsum (diag (matrix (1, rows = n, cols = 1)));\n",
     "\n",
     "L = (A %*% t(A)) * Mask;     # Generate L for stability of the inverse\n",
     "\n",
     "X = invert_lower_triangular_naive (L);\n",
     "\n",
     "print (\"Maximum difference between X %*% L and Identity = \" + max (abs (X %*% L - diag (matrix (1, rows = n, cols = 1)))));\n",
     "\"\"\"\n",
     "with jvm_stdout():\n",
     "    ml.execute(dml(prog))"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 2",
    "language": "python",
    "name": "python2"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 2
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython2",
    "version": "2.7.11"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 1
 }