beginners-guide-python.md - systemds - Git at Google

 ---
 layout: global
 title: Beginner's Guide for Python Users
 description: Beginner's Guide for Python Users
 ---
 <!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to you under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 {% endcomment %}
 -->

 * This will become a table of contents (this text will be scraped).
 {:toc}

 <br/>

 ## Introduction

 SystemDS enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors,
 one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).

 Algorithm scripts written in DML and PyDML can be run on Hadoop, on Spark, or in Standalone mode.
 No script modifications are required to change between modes. SystemDS automatically performs advanced optimizations
 based on data and cluster characteristics, so much of the need to manually tweak algorithms is largely reduced or eliminated.
 To understand more about DML and PyDML, we recommend that you read [Beginner's Guide to DML and PyDML](https://apache.github.io/systemml/beginners-guide-to-dml-and-pydml.html).

 For convenience of Python users, SystemDS exposes several language-level APIs that allow Python users to use SystemDS
 and its algorithms without the need to know DML or PyDML. We explain these APIs in the below sections with example usecases.

 ## Download & Setup

 Before you get started on SystemDS, make sure that your environment is set up and ready to go.

 ### Install Java (need Java 8) and Apache Spark

 If you already have an Apache Spark installation, you can skip this step.

 <div class="codetabs">
 <div data-lang="OSX" markdown="1">
 ```bash
 /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
 brew tap caskroom/versions
 brew cask install java8
 brew install apache-spark
 ```
 </div>
 <div data-lang="Linux" markdown="1">
 ```bash
 ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"
 brew tap caskroom/versions
 brew cask install java8
 brew install apache-spark
 ```
 </div>
 </div>

 ### Install SystemDS

 To install released SystemDS, please use following commands:

 <div class="codetabs">
 <div data-lang="Python 2" markdown="1">
 ```bash
 pip install systemml
 ```
 </div>
 <div data-lang="Python 3" markdown="1">
 ```bash
 pip3 install systemml
 ```
 </div>
 </div>


 If you want to try out the bleeding edge version, please use following commands:

 <div class="codetabs">
 <div data-lang="Python 2" markdown="1">
 ```bash
 git checkout https://github.com/apache/systemml.git
 cd systemml
 mvn clean package -P distribution
 pip install target/systemml-1.0.0-SNAPSHOT-python.tar.gz
 ```
 </div>
 <div data-lang="Python 3" markdown="1">
 ```bash
 git checkout https://github.com/apache/systemml.git
 cd systemml
 mvn clean package -P distribution
 pip3 install target/systemml-1.0.0-SNAPSHOT-python.tar.gz
 ```
 </div>
 </div>

 ### Uninstall SystemDS
 To uninstall SystemDS, please use following command:

 <div class="codetabs">
 <div data-lang="Python 2" markdown="1">
 ```bash
 pip uninstall systemml
 ```
 </div>
 <div data-lang="Python 3" markdown="1">
 ```bash
 pip3 uninstall systemml
 ```
 </div>
 </div>

 ### Start Pyspark shell

 <div class="codetabs">
 <div data-lang="Python 2" markdown="1">
 ```bash
 pyspark
 ```
 </div>
 <div data-lang="Python 3" markdown="1">
 ```bash
 PYSPARK_PYTHON=python3 pyspark
 ```
 </div>
 </div>

 ---

 ## Matrix operations

 To get started with SystemDS, let's try few elementary matrix multiplication operations:

 ```python
 import systemml as sml
 import numpy as np
 m1 = sml.matrix(np.ones((3,3)) + 2)
 m2 = sml.matrix(np.ones((3,3)) + 3)
 m2 = m1 * (m2 + m1)
 m4 = 1.0 - m2
 m4.sum(axis=1).toNumPy()
 ```

 Output:

 ```python
 array([[-60.],
        [-60.],
        [-60.]])
 ```

 Let us now write a simple script to train [linear regression](https://apache.github.io/systemml/algorithms-regression.html#linear-regression)
 model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore
 regularization parameter as well as intercept.

 ```python
 import numpy as np
 from sklearn import datasets
 import systemml as sml
 # Load the diabetes dataset
 diabetes = datasets.load_diabetes()
 # Use only one feature
 diabetes_X = diabetes.data[:, np.newaxis, 2]
 # Split the data into training/testing sets
 X_train = diabetes_X[:-20]
 X_test = diabetes_X[-20:]
 # Split the targets into training/testing sets
 y_train = diabetes.target[:-20]
 y_test = diabetes.target[-20:]
 # Train Linear Regression model
 X = sml.matrix(X_train)
 y = sml.matrix(np.matrix(y_train).T)
 A = X.transpose().dot(X)
 b = X.transpose().dot(y)
 beta = sml.solve(A, b).toNumPy()
 y_predicted = X_test.dot(beta)
 print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
 ```

 Output:

 ```bash
 Residual sum of squares: 25282.12
 ```

 We can improve the residual error by adding an intercept and regularization parameter. To do so, we
 will use `mllearn` API described in the next section.

 ---

 ## Invoke SystemDS's algorithms

 SystemDS also exposes a subpackage [mllearn](https://apache.github.io/systemml/python-reference#mllearn-api). This subpackage allows Python users to invoke SystemDS algorithms
 using Scikit-learn or MLPipeline API.

 ### Scikit-learn interface

 In the below example, we invoke SystemDS's [Linear Regression](https://apache.github.io/systemml/algorithms-regression.html#linear-regression)
 algorithm.

 ```python
 import numpy as np
 from sklearn import datasets
 from systemml.mllearn import LinearRegression
 # Load the diabetes dataset
 diabetes = datasets.load_diabetes()
 # Use only one feature
 diabetes_X = diabetes.data[:, np.newaxis, 2]
 # Split the data into training/testing sets
 X_train = diabetes_X[:-20]
 X_test = diabetes_X[-20:]
 # Split the targets into training/testing sets
 y_train = diabetes.target[:-20]
 y_test = diabetes.target[-20:]
 # Create linear regression object
 regr = LinearRegression(spark, fit_intercept=True, C=float("inf"), solver='direct-solve')
 # Train the model using the training sets
 regr.fit(X_train, y_train)
 y_predicted = regr.predict(X_test)
 print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
 ```

 Output:

 ```bash
 Residual sum of squares: 6991.17
 ```

 As expected, by adding intercept and regularizer the residual error drops significantly.

 Here is another example that where we invoke SystemDS's [Logistic Regression](https://apache.github.io/systemml/algorithms-classification.html#multinomial-logistic-regression)
 algorithm on digits datasets.

 ```python
 # Scikit-learn way
 from sklearn import datasets, neighbors
 from systemml.mllearn import LogisticRegression
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target
 n_samples = len(X_digits)
 X_train = X_digits[:int(.9 * n_samples)]
 y_train = y_digits[:int(.9 * n_samples)]
 X_test = X_digits[int(.9 * n_samples):]
 y_test = y_digits[int(.9 * n_samples):]
 logistic = LogisticRegression(spark)
 print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
 ```

 Output:

 ```bash
 LogisticRegression score: 0.927778
 ```

 You can also save the trained model and load it later for prediction:

 ```python
 # Assuming logistic.fit(X_train, y_train) is already invoked
 logistic.save('logistic_model')
 new_logistic = LogisticRegression(spark)
 new_logistic.load('logistic_model')
 print('LogisticRegression score: %f' % new_logistic.score(X_test, y_test))
 ```

 ### Passing PySpark DataFrame

 To train the above algorithm on larger dataset, we can load the dataset into DataFrame and pass it to the `fit` method:

 ```python
 from sklearn import datasets
 from systemml.mllearn import LogisticRegression
 import pandas as pd
 from sklearn.metrics import accuracy_score
 import systemml as sml
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target
 n_samples = len(X_digits)
 # Split the data into training/testing sets and convert to PySpark DataFrame
 df_train = sml.convertToLabeledDF(sqlCtx, X_digits[:int(.9 * n_samples)], y_digits[:int(.9 * n_samples)])
 X_test = spark.createDataFrame(pd.DataFrame(X_digits[int(.9 * n_samples):]))
 logistic = LogisticRegression(spark)
 logistic.fit(df_train)
 y_predicted = logistic.predict(X_test)
 y_predicted = y_predicted.select('prediction').toPandas().as_matrix().flatten()
 y_test = y_digits[int(.9 * n_samples):]
 print('LogisticRegression score: %f' % accuracy_score(y_test, y_predicted))
 ```

 Output:

 ```bash
 LogisticRegression score: 0.922222
 ```

 ### MLPipeline interface

 In the below example, we demonstrate how the same `LogisticRegression` class can allow SystemDS to fit seamlessly into
 large data pipelines.

 ```python
 # MLPipeline way
 from pyspark.ml import Pipeline
 from systemml.mllearn import LogisticRegression
 from pyspark.ml.feature import HashingTF, Tokenizer
 training = spark.createDataFrame([
     (0, "a b c d e spark", 1.0),
     (1, "b d", 2.0),
     (2, "spark f g h", 1.0),
     (3, "hadoop mapreduce", 2.0),
     (4, "b spark who", 1.0),
     (5, "g d a y", 2.0),
     (6, "spark fly", 1.0),
     (7, "was mapreduce", 2.0),
     (8, "e spark program", 1.0),
     (9, "a e c l", 2.0),
     (10, "spark compile", 1.0),
     (11, "hadoop software", 2.0)
 ], ["id", "text", "label"])
 tokenizer = Tokenizer(inputCol="text", outputCol="words")
 hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
 lr = LogisticRegression(sqlCtx)
 pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
 model = pipeline.fit(training)
 test = spark.createDataFrame([
     (12, "spark i j k"),
     (13, "l m n"),
     (14, "mapreduce spark"),
     (15, "apache hadoop")], ["id", "text"])
 prediction = model.transform(test)
 prediction.show()
 ```

 Output:

 ```bash
 +-------+---+---------------+------------------+--------------------+--------------------+----------+
 |__INDEX| id|           text|             words|            features|         probability|prediction|
 +-------+---+---------------+------------------+--------------------+--------------------+----------+
 |    1.0| 12|    spark i j k|  [spark, i, j, k]|(20,[5,6,7],[2.0,...|[0.99999999999975...|       1.0|
 |    2.0| 13|          l m n|         [l, m, n]|(20,[8,9,10],[1.0...|[1.37552128844736...|       2.0|
 |    3.0| 14|mapreduce spark|[mapreduce, spark]|(20,[5,10],[1.0,1...|[0.99860290938153...|       1.0|
 |    4.0| 15|  apache hadoop|  [apache, hadoop]|(20,[9,14],[1.0,1...|[5.41688748236143...|       2.0|
 +-------+---+---------------+------------------+--------------------+--------------------+----------+
 ```

 ---

 ## Invoking DML/PyDML scripts using MLContext

 The below example demonstrates how to invoke the algorithm [scripts/algorithms/MultiLogReg.dml](https://github.com/apache/systemml/blob/master/scripts/algorithms/MultiLogReg.dml)
 using Python [MLContext API](https://apache.github.io/systemml/spark-mlcontext-programming-guide).

 ```python
 from sklearn import datasets
 from pyspark.sql import SQLContext
 import systemml as sml
 import pandas as pd
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target + 1
 n_samples = len(X_digits)
 # Split the data into training/testing sets and convert to PySpark DataFrame
 X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:int(.9 * n_samples)]))
 y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:int(.9 * n_samples)]))
 ml = sml.MLContext(sc)
 # Run the MultiLogReg.dml script at the given URL
 scriptUrl = "https://raw.githubusercontent.com/apache/systemml/master/scripts/algorithms/MultiLogReg.dml"
 script = sml.dml(scriptUrl).input(X=X_df, Y_vec=y_df).output("B_out")
 beta = ml.execute(script).get('B_out').toNumPy()
 ```
	---
	layout: global
	title: Beginner's Guide for Python Users
	description: Beginner's Guide for Python Users
	---
	<!--
	{% comment %}
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to you under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	{% endcomment %}
	-->

	* This will become a table of contents (this text will be scraped).
	{:toc}

	<br/>

	## Introduction

	SystemDS enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors,
	one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).

	Algorithm scripts written in DML and PyDML can be run on Hadoop, on Spark, or in Standalone mode.
	No script modifications are required to change between modes. SystemDS automatically performs advanced optimizations
	based on data and cluster characteristics, so much of the need to manually tweak algorithms is largely reduced or eliminated.
	To understand more about DML and PyDML, we recommend that you read [Beginner's Guide to DML and PyDML](https://apache.github.io/systemml/beginners-guide-to-dml-and-pydml.html).

	For convenience of Python users, SystemDS exposes several language-level APIs that allow Python users to use SystemDS
	and its algorithms without the need to know DML or PyDML. We explain these APIs in the below sections with example usecases.

	## Download & Setup

	Before you get started on SystemDS, make sure that your environment is set up and ready to go.

	### Install Java (need Java 8) and Apache Spark

	If you already have an Apache Spark installation, you can skip this step.

	<div class="codetabs">
	<div data-lang="OSX" markdown="1">
	```bash
	/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
	brew tap caskroom/versions
	brew cask install java8
	brew install apache-spark
	```
	</div>
	<div data-lang="Linux" markdown="1">
	```bash
	ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"
	brew tap caskroom/versions
	brew cask install java8
	brew install apache-spark
	```
	</div>
	</div>

	### Install SystemDS

	To install released SystemDS, please use following commands:

	<div class="codetabs">
	<div data-lang="Python 2" markdown="1">
	```bash
	pip install systemml
	```
	</div>
	<div data-lang="Python 3" markdown="1">
	```bash
	pip3 install systemml
	```
	</div>
	</div>


	If you want to try out the bleeding edge version, please use following commands:

	<div class="codetabs">
	<div data-lang="Python 2" markdown="1">
	```bash
	git checkout https://github.com/apache/systemml.git
	cd systemml
	mvn clean package -P distribution
	pip install target/systemml-1.0.0-SNAPSHOT-python.tar.gz
	```
	</div>
	<div data-lang="Python 3" markdown="1">
	```bash
	git checkout https://github.com/apache/systemml.git
	cd systemml
	mvn clean package -P distribution
	pip3 install target/systemml-1.0.0-SNAPSHOT-python.tar.gz
	```
	</div>
	</div>

	### Uninstall SystemDS
	To uninstall SystemDS, please use following command:

	<div class="codetabs">
	<div data-lang="Python 2" markdown="1">
	```bash
	pip uninstall systemml
	```
	</div>
	<div data-lang="Python 3" markdown="1">
	```bash
	pip3 uninstall systemml
	```
	</div>
	</div>

	### Start Pyspark shell

	<div class="codetabs">
	<div data-lang="Python 2" markdown="1">
	```bash
	pyspark
	```
	</div>
	<div data-lang="Python 3" markdown="1">
	```bash
	PYSPARK_PYTHON=python3 pyspark
	```
	</div>
	</div>

	---

	## Matrix operations

	To get started with SystemDS, let's try few elementary matrix multiplication operations:

	```python
	import systemml as sml
	import numpy as np
	m1 = sml.matrix(np.ones((3,3)) + 2)
	m2 = sml.matrix(np.ones((3,3)) + 3)
	m2 = m1 * (m2 + m1)
	m4 = 1.0 - m2
	m4.sum(axis=1).toNumPy()
	```

	Output:

	```python
	array([[-60.],
	[-60.],
	[-60.]])
	```

	Let us now write a simple script to train [linear regression](https://apache.github.io/systemml/algorithms-regression.html#linear-regression)
	model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore
	regularization parameter as well as intercept.

	```python
	import numpy as np
	from sklearn import datasets
	import systemml as sml
	# Load the diabetes dataset
	diabetes = datasets.load_diabetes()
	# Use only one feature
	diabetes_X = diabetes.data[:, np.newaxis, 2]
	# Split the data into training/testing sets
	X_train = diabetes_X[:-20]
	X_test = diabetes_X[-20:]
	# Split the targets into training/testing sets
	y_train = diabetes.target[:-20]
	y_test = diabetes.target[-20:]
	# Train Linear Regression model
	X = sml.matrix(X_train)
	y = sml.matrix(np.matrix(y_train).T)
	A = X.transpose().dot(X)
	b = X.transpose().dot(y)
	beta = sml.solve(A, b).toNumPy()
	y_predicted = X_test.dot(beta)
	print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
	```

	Output:

	```bash
	Residual sum of squares: 25282.12
	```

	We can improve the residual error by adding an intercept and regularization parameter. To do so, we
	will use `mllearn` API described in the next section.

	---

	## Invoke SystemDS's algorithms

	SystemDS also exposes a subpackage [mllearn](https://apache.github.io/systemml/python-reference#mllearn-api). This subpackage allows Python users to invoke SystemDS algorithms
	using Scikit-learn or MLPipeline API.

	### Scikit-learn interface

	In the below example, we invoke SystemDS's [Linear Regression](https://apache.github.io/systemml/algorithms-regression.html#linear-regression)
	algorithm.

	```python
	import numpy as np
	from sklearn import datasets
	from systemml.mllearn import LinearRegression
	# Load the diabetes dataset
	diabetes = datasets.load_diabetes()
	# Use only one feature
	diabetes_X = diabetes.data[:, np.newaxis, 2]
	# Split the data into training/testing sets
	X_train = diabetes_X[:-20]
	X_test = diabetes_X[-20:]
	# Split the targets into training/testing sets
	y_train = diabetes.target[:-20]
	y_test = diabetes.target[-20:]
	# Create linear regression object
	regr = LinearRegression(spark, fit_intercept=True, C=float("inf"), solver='direct-solve')
	# Train the model using the training sets
	regr.fit(X_train, y_train)
	y_predicted = regr.predict(X_test)
	print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
	```

	Output:

	```bash
	Residual sum of squares: 6991.17
	```

	As expected, by adding intercept and regularizer the residual error drops significantly.

	Here is another example that where we invoke SystemDS's [Logistic Regression](https://apache.github.io/systemml/algorithms-classification.html#multinomial-logistic-regression)
	algorithm on digits datasets.

	```python
	# Scikit-learn way
	from sklearn import datasets, neighbors
	from systemml.mllearn import LogisticRegression
	digits = datasets.load_digits()
	X_digits = digits.data
	y_digits = digits.target
	n_samples = len(X_digits)
	X_train = X_digits[:int(.9 * n_samples)]
	y_train = y_digits[:int(.9 * n_samples)]
	X_test = X_digits[int(.9 * n_samples):]
	y_test = y_digits[int(.9 * n_samples):]
	logistic = LogisticRegression(spark)
	print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
	```

	Output:

	```bash
	LogisticRegression score: 0.927778
	```

	You can also save the trained model and load it later for prediction:

	```python
	# Assuming logistic.fit(X_train, y_train) is already invoked
	logistic.save('logistic_model')
	new_logistic = LogisticRegression(spark)
	new_logistic.load('logistic_model')
	print('LogisticRegression score: %f' % new_logistic.score(X_test, y_test))
	```

	### Passing PySpark DataFrame

	To train the above algorithm on larger dataset, we can load the dataset into DataFrame and pass it to the `fit` method:

	```python
	from sklearn import datasets
	from systemml.mllearn import LogisticRegression
	import pandas as pd
	from sklearn.metrics import accuracy_score
	import systemml as sml
	digits = datasets.load_digits()
	X_digits = digits.data
	y_digits = digits.target
	n_samples = len(X_digits)
	# Split the data into training/testing sets and convert to PySpark DataFrame
	df_train = sml.convertToLabeledDF(sqlCtx, X_digits[:int(.9 * n_samples)], y_digits[:int(.9 * n_samples)])
	X_test = spark.createDataFrame(pd.DataFrame(X_digits[int(.9 * n_samples):]))
	logistic = LogisticRegression(spark)
	logistic.fit(df_train)
	y_predicted = logistic.predict(X_test)
	y_predicted = y_predicted.select('prediction').toPandas().as_matrix().flatten()
	y_test = y_digits[int(.9 * n_samples):]
	print('LogisticRegression score: %f' % accuracy_score(y_test, y_predicted))
	```

	Output:

	```bash
	LogisticRegression score: 0.922222
	```

	### MLPipeline interface

	In the below example, we demonstrate how the same `LogisticRegression` class can allow SystemDS to fit seamlessly into
	large data pipelines.

	```python
	# MLPipeline way
	from pyspark.ml import Pipeline
	from systemml.mllearn import LogisticRegression
	from pyspark.ml.feature import HashingTF, Tokenizer
	training = spark.createDataFrame([
	(0, "a b c d e spark", 1.0),
	(1, "b d", 2.0),
	(2, "spark f g h", 1.0),
	(3, "hadoop mapreduce", 2.0),
	(4, "b spark who", 1.0),
	(5, "g d a y", 2.0),
	(6, "spark fly", 1.0),
	(7, "was mapreduce", 2.0),
	(8, "e spark program", 1.0),
	(9, "a e c l", 2.0),
	(10, "spark compile", 1.0),
	(11, "hadoop software", 2.0)
	], ["id", "text", "label"])
	tokenizer = Tokenizer(inputCol="text", outputCol="words")
	hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
	lr = LogisticRegression(sqlCtx)
	pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
	model = pipeline.fit(training)
	test = spark.createDataFrame([
	(12, "spark i j k"),
	(13, "l m n"),
	(14, "mapreduce spark"),
	(15, "apache hadoop")], ["id", "text"])
	prediction = model.transform(test)
	prediction.show()
	```

	Output:

	```bash
	+-------+---+---------------+------------------+--------------------+--------------------+----------+
	\|__INDEX\| id\| text\| words\| features\| probability\|prediction\|
	+-------+---+---------------+------------------+--------------------+--------------------+----------+
	\| 1.0\| 12\| spark i j k\| [spark, i, j, k]\|(20,[5,6,7],[2.0,...\|[0.99999999999975...\| 1.0\|
	\| 2.0\| 13\| l m n\| [l, m, n]\|(20,[8,9,10],[1.0...\|[1.37552128844736...\| 2.0\|
	\| 3.0\| 14\|mapreduce spark\|[mapreduce, spark]\|(20,[5,10],[1.0,1...\|[0.99860290938153...\| 1.0\|
	\| 4.0\| 15\| apache hadoop\| [apache, hadoop]\|(20,[9,14],[1.0,1...\|[5.41688748236143...\| 2.0\|
	+-------+---+---------------+------------------+--------------------+--------------------+----------+
	```

	---

	## Invoking DML/PyDML scripts using MLContext

	The below example demonstrates how to invoke the algorithm [scripts/algorithms/MultiLogReg.dml](https://github.com/apache/systemml/blob/master/scripts/algorithms/MultiLogReg.dml)
	using Python [MLContext API](https://apache.github.io/systemml/spark-mlcontext-programming-guide).

	```python
	from sklearn import datasets
	from pyspark.sql import SQLContext
	import systemml as sml
	import pandas as pd
	digits = datasets.load_digits()
	X_digits = digits.data
	y_digits = digits.target + 1
	n_samples = len(X_digits)
	# Split the data into training/testing sets and convert to PySpark DataFrame
	X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:int(.9 * n_samples)]))
	y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:int(.9 * n_samples)]))
	ml = sml.MLContext(sc)
	# Run the MultiLogReg.dml script at the given URL
	scriptUrl = "https://raw.githubusercontent.com/apache/systemml/master/scripts/algorithms/MultiLogReg.dml"
	script = sml.dml(scriptUrl).input(X=X_df, Y_vec=y_df).output("B_out")
	beta = ml.execute(script).get('B_out').toNumPy()
	```