layout: global title: Beginner‘s Guide to DML and PyDML description: Beginner’s Guide to DML and PyDML

  • This will become a table of contents (this text will be scraped). {:toc}

Overview

SystemML enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors, one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).

Algorithm scripts written in DML and PyDML can be run on Hadoop, on Spark, or in Standalone mode. No script modifications are required to change between modes. SystemML automatically performs advanced optimizations based on data and cluster characteristics, so much of the need to manually tweak algorithms is largely reduced or eliminated.

This Beginner's Guide serves as a starting point for writing DML and PyDML scripts.

Script Invocation

DML and PyDML scripts can be invoked in a variety of ways. Suppose that we have hello.dml and hello.pydml scripts containing the following:

print('hello ' + $1)

One way to begin working with SystemML is to build the project and unpack the standalone distribution, which features the runStandaloneSystemML.sh and runStandaloneSystemML.bat scripts. The name of the DML or PyDML script is passed as the first argument to these scripts, along with a variety of arguments.

./runStandaloneSystemML.sh hello.dml -args world
./runStandaloneSystemML.sh hello.pydml -python -args world

For DML and PyDML script invocations that take multiple arguments, a common technique is to create a standard script that invokes runStandaloneSystemML.sh or runStandaloneSystemML.bat with the arguments specified.

SystemML itself is written in Java and is managed using Maven. As a result, SystemML can readily be imported into a standard development environment such as Eclipse. The DMLScript class serves as the main entrypoint to SystemML. Executing DMLScript with no arguments displays usage information. A script file can be specified using the -f argument.

In Eclipse, a Debug Configuration can be created with DMLScript as the Main class and any arguments specified as Program arguments. A PyDML script requires the addition of a -python switch.

SystemML contains a default set of configuration information. In addition to this, SystemML looks for a default ./SystemML-config.xml file in the working directory, where overriding configuration information can be specified. Furthermore, a config file can be specified using the -config argument, as in this example:

-f hello.dml -config=src/main/standalone/SystemML-config.xml -args world

When operating in a distributed environment, it is highly recommended that cluster-specific configuration information is provided to SystemML via a configuration file for optimal performance.

Data Types

SystemML has four value data types. In DML, these are: double, integer, string, and boolean. In PyDML, these are: float, int, str, and bool. In normal usage, the data type of a variable is implicit based on its value. Mathematical operations typically operate on doubles/floats, whereas integers/ints are typically useful for tasks such as iteration and accessing elements in a matrix.

cBoolean = TRUE print(‘cBoolean = ' + cBoolean) print(’(2 < 1) = ' + (2 < 1))

dString = ‘Hello’ eString = dString + ' World' print('dString = ' + dString) print('eString = ' + eString) {% endhighlight %}

cBool = True print(‘cBool = ' + cBool) print(’(2 < 1) = ' + (2 < 1))

dStr = ‘Hello’ eStr = dStr + ' World' print('dStr = ' + dStr) print('eStr = ' + eStr) {% endhighlight %}

Matrix Basics

Creating a Matrix

A matrix can be created in DML using the matrix() function and in PyDML using the full() function. In the example below, a matrix element is still considered to be of the matrix data type, so the value is cast to a scalar in order to print it. Matrix element values are of type double/float. Note that matrices index from 1 in both DML and PyDML.

For additional information about the matrix() and full() functions, please see the DML Language Reference (Matrix Construction) and the PyDML Language Reference (Matrix Construction).

Saving a Matrix

A matrix can be saved using the write() function in DML and the save() function in PyDML. SystemML supports four different formats: text (i,j,v), mm (Matrix Market), csv (delimiter-separated values), and binary.

Saving a matrix automatically creates a metadata file for each format except for Matrix Market, since Matrix Market contains metadata within the *.mm file. All formats are text-based except binary. The contents of the resulting files are shown here.

Loading a Matrix

A matrix can be loaded using the read() function in DML and the load() function in PyDML. As with saving, SystemML supports four formats: text (i,j,v), mm (Matrix Market), csv (delimiter-separated values), and binary. To read a file, a corresponding metadata file is required, except for the Matrix Market format.

Matrix Operations

DML and PyDML offer a rich set of operators and built-in functions to perform various operations on matrices and scalars. Operators and built-in functions are described in great detail in the DML Language Reference (Expressions, Built-In Functions) and the PyDML Language Reference (Expressions, Built-In Functions).

In this example, we create a matrix A. Next, we create another matrix B by adding 4 to each element in A. Next, we flip B by taking its transpose. We then multiply A and B, represented by matrix C. We create a matrix D with the same number of rows and columns as C, and initialize its elements to 5. We then subtract D from C and divide the values of its elements by 2 and assign the resulting matrix to D.

This example also shows a user-defined function called printMatrix(), which takes a string and matrix as arguments and returns nothing.

A = matrix(“1 2 3 4 5 6”, rows=3, cols=2) z = printMatrix(‘Matrix A:’, A) B = A + 4 B = t(B) z = printMatrix(‘Matrix B:’, B) C = A %*% B z = printMatrix(‘Matrix C:’, C) D = matrix(5, rows=nrow(C), cols=ncol(C)) D = (C - D) / 2 z = printMatrix(‘Matrix D:’, D)

{% endhighlight %}

A = full(“1 2 3 4 5 6”, rows=3, cols=2) z = printMatrix(‘Matrix A:’, A) B = A + 4 B = transpose(B) z = printMatrix(‘Matrix B:’, B) C = dot(A, B) z = printMatrix(‘Matrix C:’, C) D = full(5, rows=nrow(C), cols=ncol(C)) D = (C - D) / 2 z = printMatrix(‘Matrix D:’, D)

{% endhighlight %}

Matrix Indexing

The elements in a matrix can be accessed by their row and column indices. In the example below, we have 3x3 matrix A. First, we access the element at the third row and third column. Next, we obtain a row slice (vector) of the matrix by specifying row 2 and leaving the column blank. We obtain a column slice (vector) by leaving the row blank and specifying column 3. After that, we obtain a submatrix via range indexing, where we specify rows 2 to 3, separated by a colon, and columns 1 to 2, separated by a colon.

A = matrix(“1 2 3 4 5 6 7 8 9”, rows=3, cols=3) z = printMatrix(‘Matrix A:’, A) B = A[3,3] z = printMatrix(‘Matrix B:’, B) C = A[2,] z = printMatrix(‘Matrix C:’, C) D = A[,3] z = printMatrix(‘Matrix D:’, D) E = A[2:3,1:2] z = printMatrix(‘Matrix E:’, E)

{% endhighlight %}

A = full(“1 2 3 4 5 6 7 8 9”, rows=3, cols=3) z = printMatrix(‘Matrix A:’, A) B = A[3,3] z = printMatrix(‘Matrix B:’, B) C = A[2,] z = printMatrix(‘Matrix C:’, C) D = A[,3] z = printMatrix(‘Matrix D:’, D) E = A[2:3,1:2] z = printMatrix(‘Matrix E:’, E)

{% endhighlight %}

Control Statements

DML and PyDML both feature if and if-else conditional statements. In addition, DML features else-if which avoids the need for nested conditional statements.

DML and PyDML feature 3 loop statements: while, for, and parfor (parallel for). In the example, note that the print statements within the parfor loop can occur in any order since the iterations occur in parallel rather than sequentially as in a regular for loop. The parfor statement can include several optional parameters, as described in the DML Language Reference (ParFor Statement) and PyDML Language Reference (ParFor Statement).

A = matrix(“1 2 3 4 5 6”, rows=3, cols=2)

for (i in 1:nrow(A)) { print(“for A[” + i + “,1]:” + as.scalar(A[i,1])) }

parfor(i in 1:nrow(A)) { print(“parfor A[” + i + “,1]:” + as.scalar(A[i,1])) } {% endhighlight %}

A = full(“1 2 3 4 5 6”, rows=3, cols=2)

for (i in 1:nrow(A)): print(“for A[” + i + “,1]:” + scalar(A[i,1]))

parfor(i in 1:nrow(A)): print(“parfor A[” + i + “,1]:” + scalar(A[i,1])) {% endhighlight %}

User-Defined Functions

Functions encapsulate useful functionality in SystemML. In addition to built-in functions, users can define their own functions. Functions take 0 or more parameters and return 0 or more values. Currently, if a function returns nothing, it still needs to be assigned to a variable.

A = rand(rows=3, cols=2, min=0, max=2) # random 3x2 matrix with values 0 to 2 B = doSomething(A) write(A, “A.csv”, format=“csv”) write(B, “B.csv”, format=“csv”) {% endhighlight %}

A = rand(rows=3, cols=2, min=0, max=2) # random 3x2 matrix with values 0 to 2 B = doSomething(A) save(A, “A.csv”, format=“csv”) save(B, “B.csv”, format=“csv”) {% endhighlight %}

In the above example, a 3x2 matrix of random doubles between 0 and 2 is created using the rand() function. Additional parameters can be passed to rand() to control sparsity and other matrix characteristics.

Matrix A is passed to the doSomething function. A column of 1 values is concatenated to the matrix. A column consisting of the values (0, 1, 2) is concatenated to the matrix. Next, a column consisting of the maximum row values is concatenated to the matrix. A column consisting of the row sums is concatenated to the matrix, and this resulting matrix is returned to variable B. Matrix A is output to the A.csv file and matrix B is saved as the B.csv file.

Command-Line Arguments and Default Values

Command-line arguments can be passed to DML and PyDML scripts either as named arguments or as positional arguments. Named arguments are the preferred technique. Named arguments can be passed utilizing the -nvargs switch, and positional arguments can be passed using the -args switch.

Default values can be set using the ifdef() function.

In the example below, a matrix is read from the file system using named argument M. The number of rows to print is specified using the rowsToPrint argument, which defaults to 2 if no argument is supplied. Likewise, the number of columns is specified using colsToPrint with a default value of 2.

fileM = $M

numRowsToPrint = ifdef($rowsToPrint, 2) # default to 2 numColsToPrint = ifdef($colsToPrint, 2) # default to 2

m = read(fileM)

for (i in 1:numRowsToPrint) { for (j in 1:numColsToPrint) { print(‘[’ + i + ‘,’ + j + ‘]:’ + as.scalar(m[i,j])) } }

{% endhighlight %}

fileM = $M

numRowsToPrint = ifdef($rowsToPrint, 2) # default to 2 numColsToPrint = ifdef($colsToPrint, 2) # default to 2

m = load(fileM)

for (i in 1:numRowsToPrint): for (j in 1:numColsToPrint): print(‘[’ + i + ‘,’ + j + ‘]:’ + scalar(m[i,j]))

{% endhighlight %}

Example #1 Results:
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0

Example #2 Arguments:
-f ex.dml -nvargs M=M.txt

Example #2 Results:
[1,1]:1.0
[1,2]:2.0
[2,1]:0.0
[2,2]:0.0
Example #1 Results:
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0

Example #2 Arguments:
-f ex.pydml -python -nvargs M=M.txt

Example #2 Results:
[1,1]:1.0
[1,2]:2.0
[2,1]:0.0
[2,2]:0.0

Here, we see identical functionality but with positional arguments.

fileM = $1

numRowsToPrint = ifdef($2, 2) # default to 2 numColsToPrint = ifdef($3, 2) # default to 2

m = read(fileM)

for (i in 1:numRowsToPrint) { for (j in 1:numColsToPrint) { print(‘[’ + i + ‘,’ + j + ‘]:’ + as.scalar(m[i,j])) } }

{% endhighlight %}

fileM = $1

numRowsToPrint = ifdef($2, 2) # default to 2 numColsToPrint = ifdef($3, 2) # default to 2

m = load(fileM)

for (i in 1:numRowsToPrint): for (j in 1:numColsToPrint): print(‘[’ + i + ‘,’ + j + ‘]:’ + scalar(m[i,j]))

{% endhighlight %}

Example #1 Results:
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0

Example #2 Arguments:
-f ex.dml -args M.txt

Example #2 Results:
[1,1]:1.0
[1,2]:2.0
[2,1]:0.0
[2,2]:0.0
Example #1 Results:
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0

Example #2 Arguments:
-f ex.pydml -python -args M.txt

Example #2 Results:
[1,1]:1.0
[1,2]:2.0
[2,1]:0.0
[2,2]:0.0

Additional Information

The DML Language Reference and PyDML Language Reference contain highly detailed information regard DML and PyDML.

In addition, many excellent examples of DML and PyDML can be found in the system-ml/scripts and system-ml/test/scripts/applications directories.