This tutorial provides a quick introduction to using SystemML by running existing SystemML algorithms in standalone mode.
SystemML enables large-scale machine learning (ML) via a high-level declarative language with R-like syntax called DML and Python-like syntax called PyDML. DML and PyDML allow data scientists to express their ML algorithms with full flexibility but without the need to fine-tune distributed runtime execution plans and system configurations. These ML programs are dynamically compiled and optimized based on data and cluster characteristics using rule-based and cost-based optimization techniques. The compiler automatically generates hybrid runtime execution plans ranging from in-memory, single node execution to distributed computation for Hadoop or Spark Batch execution. SystemML features a suite of algorithms for Descriptive Statistics, Classification, Clustering, Regression, Matrix Factorization, and Survival Analysis. Detailed descriptions of these algorithms can be found in the Algorithms Reference.
Apache incubator releases of SystemML are available from the downloads page.
The SystemML project is available on GitHub at https://github.com/apache/incubator-systemml. SystemML can be downloaded from GitHub and built with Maven. Instructions to build and test SystemML can be found in the SystemML GitHub README.
SystemML's standalone mode is designed to allow data scientists to rapidly prototype algorithms on a single machine. The standalone release packages all required libraries into a single distribution file. In standalone mode, all operations occur on a single node in a non-Hadoop environment. Standalone mode is not appropriate for large datasets.
For large-scale production environments, SystemML algorithm execution can be distributed across multi-node clusters using Apache Hadoop or Apache Spark. We will make use of standalone mode throughout this tutorial.
To follow along with this guide, first build a standalone package of SystemML using Apache Maven and unpack it.
$ git clone https://github.com/apache/incubator-systemml.git $ cd incubator-systemml $ mvn clean package -P distribution $ tar -xvzf target/systemml-*-standalone.tar.gz -C .. $ cd ..
The extracted package should have these contents:
$ ls -lF systemml-{{site.SYSTEMML_VERSION}}/ total 96 -rw-r--r-- LICENSE -rw-r--r-- NOTICE -rw-r--r-- SystemML-config.xml drwxr-xr-x docs/ drwxr-xr-x lib/ -rw-r--r-- log4j.properties -rw-r--r-- readme.txt -rwxr-xr-x runStandaloneSystemML.bat* -rwxr-xr-x runStandaloneSystemML.sh* drwxr-xr-x scripts/
For the rest of the tutorial we will switch to the systemml-{{site.SYSTEMML_VERSION}}
directory.
$ cd ~/systemml-{{site.SYSTEMML_VERSION}}
Note that standalone mode supports both Mac/UNIX and Windows. To run the following examples on Windows, the “./runStandaloneSystemML.sh ...
” commands can be replaced with “./runStandaloneSystemML.bat ...
” commands.
In this tutorial we will use the Haberman's Survival Data Set which can be downloaded in CSV format from the Center for Machine Learning and Intelligent Systems
$ wget -P data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
The Haberman Data Set has 306 instances and 4 attributes (including the class attribute):
1
= the patient survived 5 years or longer2
= the patient died within 5 yearWe will need to create a metadata file (MTD) which stores metadata information about the content of the data file. The name of the MTD file associated with the data file <filename>
must be <filename>.mtd
.
$ echo '{"rows": 306, "cols": 4, "format": "csv"}' > data/haberman.data.mtd
Let's start with a simple example, computing certain univariate statistics for each feature column using the algorithm Univar-Stats.dml
which requires 3 arguments:
X
: location of the input data file to analyzeTYPES
: location of the file that contains the feature column types encoded by integer numbers: 1
= scale, 2
= nominal, 3
= ordinalSTATS
: location of the output matrix of computed statistics will be storedWe need to create a file types.csv
that describes the type of each column in the data along with it's metadata file types.csv.mtd
.
$ echo '1,1,1,2' > data/types.csv $ echo '{"rows": 1, "cols": 4, "format": "csv"}' > data/types.csv.mtd
To run the Univar-Stats.dml
algorithm, issue the following command:
$ ./runStandaloneSystemML.sh scripts/algorithms/Univar-Stats.dml -nvargs X=data/haberman.data TYPES=data/types.csv STATS=data/univarOut.mtx
The resulting matrix has one row per each univariate statistic and one column per input feature. The output file univarOut.mtx
describes that matrix. The elements of the first column denote the number of the statistic, the elements of the second column refer to the number of the feature column in the input data, and the elements of the third column show the value of the univariate statistic.
1 1 30.0 1 2 58.0 2 1 83.0 2 2 69.0 2 3 52.0 3 1 53.0 3 2 11.0 3 3 52.0 4 1 52.45751633986928 4 2 62.85294117647059 4 3 4.026143790849673 5 1 116.71458266366658 5 2 10.558630665380907 5 3 51.691117539912135 6 1 10.803452349303281 6 2 3.2494046632238507 6 3 7.189653506248555 7 1 0.6175922641866753 7 2 0.18575610076612029 7 3 0.41100513466216837 8 1 0.20594669940735139 8 2 0.051698529971741194 8 3 1.7857418611299172 9 1 0.1450718616532357 9 2 0.07798443581479181 9 3 2.954633471088322 10 1 -0.6150152487211726 10 2 -1.1324380182967442 10 3 11.425776549251449 11 1 0.13934809593495995 11 2 0.13934809593495995 11 3 0.13934809593495995 12 1 0.277810485320835 12 2 0.277810485320835 12 3 0.277810485320835 13 1 52.0 13 2 63.0 13 3 1.0 14 1 52.16013071895425 14 2 62.80392156862745 14 3 1.2483660130718954 15 4 2.0 16 4 1.0 17 4 1.0
The following table lists the number and name of each univariate statistic. The row numbers below correspond to the elements of the first column in the output matrix above. The signs “+” show applicability to scale or/and to categorical features.
Row | Name of Statistic | Scale | Categ. |
---|---|---|---|
1 | Minimum | + | |
2 | Maximum | + | |
3 | Range | + | |
4 | Mean | + | |
5 | Variance | + | |
6 | Standard deviation | + | |
7 | Standard error of mean | + | |
8 | Coefficient of variation | + | |
9 | Skewness | + | |
10 | Kurtosis | + | |
11 | Standard error of skewness | + | |
12 | Standard error of kurtosis | + | |
13 | Median | + | |
14 | Inter quartile mean | + | |
15 | Number of categories | + | |
16 | Mode | + | |
17 | Number of modes | + |
Let's take the same haberman.data
to explore the binary-class support vector machines algorithm l2-svm.dml
. This example also illustrates how to use of the sampling algorithm sample.dml
and the data split algorithm spliXY.dml
.
First we need to use the sample.dml
algorithm to separate the input into one training data set and one data set for model prediction.
Parameters:
X
: (input) input data set: filename of input data setsv
: (input) sampling vector: filename of 1-column vector w/ percentages. sum(sv) must be 1.O
: (output) folder name w/ samples generatedofmt
: (output) format of O: “csv”, “binary” (default)We will create the file perc.csv
and perc.csv.mtd
to define the sampling vector with a sampling rate of 50% to generate 2 data sets:
$ printf "0.5\n0.5" > data/perc.csv $ echo '{"rows": 2, "cols": 1, "format": "csv"}' > data/perc.csv.mtd
Let's run the sampling algorithm to create the two data samples:
$ ./runStandaloneSystemML.sh scripts/utils/sample.dml -nvargs X=data/haberman.data sv=data/perc.csv O=data/haberman.part ofmt="csv"
Next we use the splitXY.dml
algorithm to separate the feature columns from the label column(s).
Parameters:
X
: (input) filename of data matrixy
: (input) colIndex: starting index is 1OX
: (output) filename of output matrix with all columns except yOY
: (output) filename of output matrix with y columnofmt
: (output) format of OX and OY output matrix: “csv”, “binary” (default)We specify y=4
as the 4th column contains the labels to be predicted and run the splitXY.dml
algorithm on our training and test data sets.
$ ./runStandaloneSystemML.sh scripts/utils/splitXY.dml -nvargs X=data/haberman.part/1 y=4 OX=data/haberman.train.data.csv OY=data/haberman.train.labels.csv ofmt="csv" $ ./runStandaloneSystemML.sh scripts/utils/splitXY.dml -nvargs X=data/haberman.part/2 y=4 OX=data/haberman.test.data.csv OY=data/haberman.test.labels.csv ofmt="csv"
Now we need to train our model using the l2-svm.dml
algorithm.
X
: (input) filename of training data featuresY
: (input) filename of training data labelsmodel
: (output) filename of model that contains the learnt weightsfmt
: (output) format of model: “csv”, “text” (sparse-matrix)Log
: (output) log file for metrics and progress while trainingconfusion
: (output) filename of confusion matrix computed using a held-out test set (optional)The l2-svm.dml
algorithm is used on our training data sample to train the model.
$ ./runStandaloneSystemML.sh scripts/algorithms/l2-svm.dml -nvargs X=data/haberman.train.data.csv Y=data/haberman.train.labels.csv model=data/l2-svm-model.csv fmt="csv" Log=data/l2-svm-log.csv
The l2-svm-predict.dml
algorithm is used on our test data sample to predict the labels based on the trained model.
$ ./runStandaloneSystemML.sh scripts/algorithms/l2-svm-predict.dml -nvargs X=data/haberman.test.data.csv Y=data/haberman.test.labels.csv model=data/l2-svm-model.csv fmt="csv" confusion=data/l2-svm-confusion.csv
The console output should show the accuracy of the trained model in percent, i.e.:
15/09/01 01:32:51 INFO api.DMLScript: BEGIN DML run 09/01/2015 01:32:51 15/09/01 01:32:51 INFO conf.DMLConfig: Updating localtmpdir with value /tmp/systemml 15/09/01 01:32:51 INFO conf.DMLConfig: Updating scratch with value scratch_space 15/09/01 01:32:51 INFO conf.DMLConfig: Updating optlevel with value 2 15/09/01 01:32:51 INFO conf.DMLConfig: Updating numreducers with value 10 15/09/01 01:32:51 INFO conf.DMLConfig: Updating jvmreuse with value false 15/09/01 01:32:51 INFO conf.DMLConfig: Updating defaultblocksize with value 1000 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.appmaster with value false 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.appmaster.mem with value 2048 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.mapreduce.mem with value 2048 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.app.queue with value default 15/09/01 01:32:51 INFO conf.DMLConfig: Updating cp.parallel.matrixmult with value true 15/09/01 01:32:51 INFO conf.DMLConfig: Updating cp.parallel.textio with value true Accuracy (%): 74.14965986394557 15/09/01 01:32:52 INFO api.DMLScript: SystemML Statistics: Total execution time: 0.130 sec. Number of executed MR Jobs: 0.
The generated file l2-svm-confusion.csv
should contain the following confusion matrix of this form:
|t1 t2| |t3 t4|
t1
timest2
timest3
timest4
times.If the confusion matrix looks like this ...
107.0,38.0 0.0,2.0
... then the accuracy of the model is (t1+t4)/(t1+t2+t3+t4) = (107+2)/107+38+0+2) = 0.741496599
Refer to the Algorithms Reference for more details.
If you encounter a "java.lang.OutOfMemoryError"
you can edit the invocation script (runStandaloneSystemML.sh
or runStandaloneSystemML.bat
) to increase the memory available to the JVM, i.e:
java -Xmx16g -Xms4g -Xmn1g -cp ${CLASSPATH} org.apache.sysml.api.DMLScript \ -f ${SCRIPT_FILE} -exec singlenode -config=SystemML-config.xml \ $@