- This will become a table of contents (this text will be scraped). {:toc}

SystemML enables *flexible*, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors, one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).

Algorithm scripts written in DML and PyDML can be run on Spark, on Hadoop, or in Standalone mode. SystemML also features an MLContext API that allows SystemML to be accessed via Scala or Python from a Spark Shell, a Jupyter Notebook, or a Zeppelin Notebook.

This Beginner's Guide serves as a starting point for writing DML and PyDML scripts.

DML and PyDML scripts can be invoked in a variety of ways. Suppose that we have `hello.dml`

and `hello.pydml`

scripts containing the following:

print('hello ' + $1)

One way to begin working with SystemML is to download a standalone tar.gz or zip distribution of SystemML and use the `runStandaloneSystemML.sh`

and `runStandaloneSystemML.bat`

scripts to run SystemML in standalone mode. The name of the DML or PyDML script is passed as the first argument to these scripts, along with a variety of arguments.

./runStandaloneSystemML.sh hello.dml -args world ./runStandaloneSystemML.sh hello.pydml -python -args world

SystemML has four value data types. In DML, these are: **double**, **integer**, **string**, and **boolean**. In PyDML, these are: **float**, **int**, **str**, and **bool**. In normal usage, the data type of a variable is implicit based on its value. Mathematical operations typically operate on doubles/floats, whereas integers/ints are typically useful for tasks such as iteration and accessing elements in a matrix.

cBoolean = TRUE print(‘cBoolean = ' + cBoolean) print(’(2 < 1) = ' + (2 < 1))

dString = ‘Hello’ eString = dString + ' World' print('dString = ' + dString) print('eString = ' + eString) {% endhighlight %}

cBool = True print(‘cBool = ' + cBool) print(’(2 < 1) = ' + (2 < 1))

dStr = ‘Hello’ eStr = dStr + ' World' print('dStr = ' + dStr) print('eStr = ' + eStr) {% endhighlight %}

A matrix can be created in DML using the ** matrix()** function and in PyDML using the

`full()`

We can also output the matrix element values using the ** toString** function:

For additional information about the ** matrix()** and

`full()`

`toString()`

A matrix can be saved using the ** write()** function in DML and the

`save()`

`text`

`i,j,v`

), `mm`

`Matrix Market`

), `csv`

`delimiter-separated values`

), and `binary`

Saving a matrix automatically creates a metadata file for each format except for Matrix Market, since Matrix Market contains metadata within the *.mm file. All formats are text-based except binary. The contents of the resulting files are shown here. *Note that the text ( i,j,v) and mm (Matrix Market) formats index from 1, even when working with PyDML, which is 0-based.*

A matrix can be loaded using the ** read()** function in DML and the

`load()`

`text`

`i,j,v`

), `mm`

`Matrix Market`

), `csv`

`delimiter-separated values`

), and `binary`

`format`

parameter is specified to the `read()`

`load()`

{% endhighlight %}

DML and PyDML offer a rich set of operators and built-in functions to perform various operations on matrices and scalars. Operators and built-in functions are described in great detail in the Language Reference (Expressions, Built-In Functions).

In this example, we create a matrix A. Next, we create another matrix B by adding 4 to each element in A. Next, we flip B by taking its transpose. We then multiply A and B, represented by matrix C. We create a matrix D with the same number of rows and columns as C, and initialize its elements to 5. We then subtract D from C and divide the values of its elements by 2 and assign the resulting matrix to D.

{% endhighlight %}

{% endhighlight %}

5.000 7.000 9.000 6.000 8.000 10.000 17.000 23.000 29.000 39.000 53.000 67.000 61.000 83.000 105.000 6.000 9.000 12.000 17.000 24.000 31.000 28.000 39.000 50.000

The elements in a matrix can be accessed by their row and column indices. In the example below, we have 3x3 matrix A. First, we access the element at the third row and third column. Next, we obtain a row slice (vector) of the matrix by specifying the row and leaving the column blank. We obtain a column slice (vector) by leaving the row blank and specifying the column. After that, we obtain a submatrix via range indexing, where we specify rows, separated by a colon, and columns, separated by a colon.

{% endhighlight %}

{% endhighlight %}

9.000 4.000 5.000 6.000 3.000 6.000 9.000 4.000 5.000 7.000 8.000

DML and PyDML both feature `if`

, `if-else`

, and `if-else-if`

conditional statements.

DML and PyDML feature 3 loop statements: `while`

, `for`

, and `parfor`

(parallel for). In the example, note that the `print`

statements within the `parfor`

loop can occur in any order since the iterations occur in parallel rather than sequentially as in a regular `for`

loop. The `parfor`

statement can include several optional parameters, as described in the Language Reference (ParFor Statement).

A = matrix(“1 2 3 4 5 6”, rows=3, cols=2)

for (i in 1:nrow(A)) { print(“for A[” + i + “,1]:” + as.scalar(A[i,1])) }

parfor(i in 1:nrow(A)) { print(“parfor A[” + i + “,1]:” + as.scalar(A[i,1])) }

{% endhighlight %}

A = full(“1 2 3 4 5 6”, rows=3, cols=2)

for (i in 0:nrow(A)-1): print(“for A[” + i + “,0]:” + scalar(A[i,0]))

parfor(i in 0:nrow(A)-1): print(“parfor A[” + i + “,0]:” + scalar(A[i,0]))

{% endhighlight %}

Functions encapsulate useful functionality in SystemML. In addition to built-in functions, users can define their own functions. Functions take 0 or more parameters and return 0 or more values. Currently, if a function returns nothing, it still needs to be assigned to a variable.

A = rand(rows=3, cols=2, min=0, max=2) # random 3x2 matrix with values 0 to 2 B = doSomething(A) write(A, “A.csv”, format=“csv”) write(B, “B.csv”, format=“csv”) {% endhighlight %}

A = rand(rows=3, cols=2, min=0, max=2) # random 3x2 matrix with values 0 to 2 B = doSomething(A) save(A, “A.csv”, format=“csv”) save(B, “B.csv”, format=“csv”) {% endhighlight %}

In the above example, a 3x2 matrix of random doubles between 0 and 2 is created using the ** rand()** function. Additional parameters can be passed to

`rand()`

Matrix A is passed to the `doSomething`

function. A column of 1 values is concatenated to the matrix. A column consisting of the values `(0, 1, 2)`

is concatenated to the matrix. Next, a column consisting of the maximum row values is concatenated to the matrix. A column consisting of the row sums is concatenated to the matrix, and this resulting matrix is returned to variable B. Matrix A is output to the `A.csv`

file and matrix B is saved as the `B.csv`

file.

Command-line arguments can be passed to DML and PyDML scripts either as named arguments or as positional arguments. Named arguments are the preferred technique. Named arguments can be passed utilizing the `-nvargs`

switch, and positional arguments can be passed using the `-args`

switch.

Default values can be set using the ** ifdef()** function.

In the example below, a matrix is read from the file system using named argument `M`

. The number of rows to print is specified using the `rowsToPrint`

argument, which defaults to 2 if no argument is supplied. Likewise, the number of columns is specified using `colsToPrint`

with a default value of 2.

fileM = $M

numRowsToPrint = ifdef($rowsToPrint, 2) # default to 2 numColsToPrint = ifdef($colsToPrint, 2) # default to 2

m = read(fileM)

for (i in 1:numRowsToPrint) { for (j in 1:numColsToPrint) { print(‘[’ + i + ‘,’ + j + ‘]:’ + as.scalar(m[i,j])) } }

{% endhighlight %}

fileM = $M

numRowsToPrint = ifdef($rowsToPrint, 2) # default to 2 numColsToPrint = ifdef($colsToPrint, 2) # default to 2

m = load(fileM)

for (i in 0:numRowsToPrint-1): for (j in 0:numColsToPrint-1): print(‘[’ + i + ‘,’ + j + ‘]:’ + scalar(m[i,j]))

{% endhighlight %}

Example #1 Results: [1,1]:1.0 [1,2]:2.0 [1,3]:3.0 Example #2 Arguments: -f ex.dml -nvargs M=m.csv Example #2 Results: [1,1]:1.0 [1,2]:2.0 [2,1]:0.0 [2,2]:0.0

Example #1 Results: [0,0]:1.0 [0,1]:2.0 [0,2]:3.0 Example #2 Arguments: -f ex.pydml -python -nvargs M=m.csv Example #2 Results: [0,0]:1.0 [0,1]:2.0 [1,0]:0.0 [1,1]:0.0

Here, we see identical functionality but with positional arguments.

fileM = $1

numRowsToPrint = ifdef($2, 2) # default to 2 numColsToPrint = ifdef($3, 2) # default to 2

m = read(fileM)

for (i in 1:numRowsToPrint) { for (j in 1:numColsToPrint) { print(‘[’ + i + ‘,’ + j + ‘]:’ + as.scalar(m[i,j])) } }

{% endhighlight %}

fileM = $1

numRowsToPrint = ifdef($2, 2) # default to 2 numColsToPrint = ifdef($3, 2) # default to 2

m = load(fileM)

for (i in 0:numRowsToPrint-1): for (j in 0:numColsToPrint-1): print(‘[’ + i + ‘,’ + j + ‘]:’ + scalar(m[i,j]))

{% endhighlight %}

Example #1 Results: [1,1]:1.0 [1,2]:2.0 [1,3]:3.0 Example #2 Arguments: -f ex.dml -args m.csv Example #2 Results: [1,1]:1.0 [1,2]:2.0 [2,1]:0.0 [2,2]:0.0

Example #1 Results: [0,0]:1.0 [0,1]:2.0 [0,2]:3.0 Example #2 Arguments: -f ex.pydml -python -args m.csv Example #2 Results: [0,0]:1.0 [0,1]:2.0 [1,0]:0.0 [1,1]:0.0

The Language Reference contains highly detailed information regarding DML.

In addition, many excellent examples of DML and PyDML can be found in the `scripts`

directory.