docs/beginners-guide-python.md - systemds - Git at Google

 ---
 layout: global
 title: Beginner's Guide for Python users
 description: Beginner's Guide for Python users
 ---
 <!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to you under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 {% endcomment %}
 -->

 * This will become a table of contents (this text will be scraped).
 {:toc}

 <br/>

 ## Introduction

 SystemML enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors,
 one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).

 Algorithm scripts written in DML and PyDML can be run on Hadoop, on Spark, or in Standalone mode.
 No script modifications are required to change between modes. SystemML automatically performs advanced optimizations
 based on data and cluster characteristics, so much of the need to manually tweak algorithms is largely reduced or eliminated.
 To understand more about DML and PyDML, we recommend that you read [Beginner's Guide to DML and PyDML](https://apache.github.io/incubator-systemml/beginners-guide-to-dml-and-pydml.html).

 For convenience of Python users, SystemML exposes several language-level APIs that allow Python users to use SystemML
 and its algorithms without the need to know DML or PyDML. We explain these APIs in the below sections with example usecases.

 ## Download & Setup

 Before you get started on SystemML, make sure that your environment is set up and ready to go.

 ### Install Java (need Java 8) and Apache Spark

 If you already have a Apache Spark installation, you can skip this step.

 <div class="codetabs">
 <div data-lang="OSX" markdown="1">
 ```bash
 /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
 brew tap caskroom/cask
 brew install Caskroom/cask/java
 brew install apache-spark
 ```
 </div>
 <div data-lang="Linux" markdown="1">
 ```bash
 ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"
 brew tap caskroom/cask
 brew install Caskroom/cask/java
 brew tap homebrew/versions
 brew install apache-spark16
 ```
 </div>
 </div>

 ### Install SystemML

 #### Step 1: Install SystemML Python package

 ```bash
 pip install systemml
 ```

 #### Step 2: Download SystemML Java binaries

 SystemML Python package downloads the corresponding Java binaries (along with algorithms) and places them
 into the installed location. To find the location of the downloaded Java binaries, use the following command:

 ```bash
 python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'
 ```

 #### Step 3: (Optional but recommended) Set SYSTEMML_HOME environment variable
 <div class="codetabs">
 <div data-lang="OSX" markdown="1">
 ```bash
 SYSTEMML_HOME=`python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'`
 # If you are using zsh or ksh or csh, append it to ~/.zshrc or ~/.profile or ~/.login respectively.
 echo '' >> ~/.bashrc
 echo 'export SYSTEMML_HOME='$SYSTEMML_HOME >> ~/.bashrc
 ```
 </div>
 <div data-lang="Linux" markdown="1">
 ```bash
 SYSTEMML_HOME=`python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'`
 # If you are using zsh or ksh or csh, append it to ~/.zshrc or ~/.profile or ~/.login respectively.
 echo '' >> ~/.bashrc
 echo 'export SYSTEMML_HOME='$SYSTEMML_HOME >> ~/.bashrc
 ```
 </div>
 </div>

 Note: the user is free to either use the prepackaged Java binaries
 or download them from [SystemML website](http://systemml.apache.org/download.html)
 or build them from the [source](https://github.com/apache/incubator-systemml).

 ### Start Pyspark shell

 <div class="codetabs">
 <div data-lang="OSX" markdown="1">
 ```bash
 pyspark --master local[*] --driver-class-path $SYSTEMML_HOME"/SystemML.jar"
 ```
 </div>
 <div data-lang="Linux" markdown="1">
 ```bash
 pyspark --master local[*] --driver-class-path $SYSTEMML_HOME"/SystemML.jar"
 ```
 </div>
 </div>

 ## Matrix operations

 To get started with SystemML, let's try few elementary matrix multiplication operations:

 ```python
 import systemml as sml
 import numpy as np
 sml.setSparkContext(sc)
 m1 = sml.matrix(np.ones((3,3)) + 2)
 m2 = sml.matrix(np.ones((3,3)) + 3)
 m2 = m1 * (m2 + m1)
 m4 = 1.0 - m2
 m4.sum(axis=1).toNumPyArray()
 ```

 Output:

 ```bash
 array([[-60.],
        [-60.],
        [-60.]])
 ```

 Let us now write a simple script to train [linear regression](https://apache.github.io/incubator-systemml/algorithms-regression.html#linear-regression)
 model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore regularization parameter as well as intercept.

 ```python
 import numpy as np
 from sklearn import datasets
 import systemml as sml
 from pyspark.sql import SQLContext
 # Load the diabetes dataset
 diabetes = datasets.load_diabetes()
 # Use only one feature
 diabetes_X = diabetes.data[:, np.newaxis, 2]
 # Split the data into training/testing sets
 X_train = diabetes_X[:-20]
 X_test = diabetes_X[-20:]
 # Split the targets into training/testing sets
 y_train = diabetes.target[:-20]
 y_test = diabetes.target[-20:]
 # Train Linear Regression model
 sml.setSparkContext(sc)
 X = sml.matrix(X_train)
 y = sml.matrix(y_train)
 A = X.transpose().dot(X)
 b = X.transpose().dot(y)
 beta = sml.solve(A, b).toNumPyArray()
 y_predicted = X_test.dot(beta)
 print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
 ```

 Output:

 ```bash
 Residual sum of squares: 25282.12
 ```

 We can improve the residual error by adding an intercept and regularization parameter. To do so, we will use `mllearn` API described in the next section.

 ## Invoke SystemML's algorithms

 SystemML also exposes a subpackage `mllearn`. This subpackage allows Python users to invoke SystemML algorithms
 using Scikit-learn or MLPipeline API.

 ### Scikit-learn interface

 In the below example, we invoke SystemML's [Linear Regression](https://apache.github.io/incubator-systemml/algorithms-regression.html#linear-regression)
 algorithm.

 ```python
 import numpy as np
 from sklearn import datasets
 from systemml.mllearn import LinearRegression
 from pyspark.sql import SQLContext
 # Load the diabetes dataset
 diabetes = datasets.load_diabetes()
 # Use only one feature
 diabetes_X = diabetes.data[:, np.newaxis, 2]
 # Split the data into training/testing sets
 X_train = diabetes_X[:-20]
 X_test = diabetes_X[-20:]
 # Split the targets into training/testing sets
 y_train = diabetes.target[:-20]
 y_test = diabetes.target[-20:]
 # Create linear regression object
 regr = LinearRegression(sqlCtx, fit_intercept=True, C=1, solver='direct-solve')
 # Train the model using the training sets
 regr.fit(X_train, y_train)
 y_predicted = regr.predict(X_test)
 print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
 ```

 Output:

 ```bash
 Residual sum of squares: 6991.17
 ```

 As expected, by adding intercept and regularizer the residual error drops significantly.

 Here is another example that where we invoke SystemML's [Logistic Regression](https://apache.github.io/incubator-systemml/algorithms-classification.html#multinomial-logistic-regression)
 algorithm on digits datasets.

 ```python
 # Scikit-learn way
 from sklearn import datasets, neighbors
 from systemml.mllearn import LogisticRegression
 from pyspark.sql import SQLContext
 sqlCtx = SQLContext(sc)
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target + 1
 n_samples = len(X_digits)
 X_train = X_digits[:.9 * n_samples]
 y_train = y_digits[:.9 * n_samples]
 X_test = X_digits[.9 * n_samples:]
 y_test = y_digits[.9 * n_samples:]
 logistic = LogisticRegression(sqlCtx)
 print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
 ```

 Output:

 ```bash
 LogisticRegression score: 0.922222
 ```

 ### Passing PySpark DataFrame

 To train the above algorithm on larger dataset, we can load the dataset into DataFrame and pass it to the `fit` method:

 ```python
 from sklearn import datasets, neighbors
 from systemml.mllearn import LogisticRegression
 from pyspark.sql import SQLContext
 import systemml as sml
 sqlCtx = SQLContext(sc)
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target + 1
 n_samples = len(X_digits)
 # Split the data into training/testing sets and convert to PySpark DataFrame
 df_train = sml.convertToLabeledDF(sqlContext, X_digits[:.9 * n_samples], y_digits[:.9 * n_samples])
 X_test = X_digits[.9 * n_samples:]
 y_test = y_digits[.9 * n_samples:]
 logistic = LogisticRegression(sqlCtx)
 print('LogisticRegression score: %f' % logistic.fit(df_train).score(X_test, y_test))
 ```

 Output:

 ```bash
 LogisticRegression score: 0.922222
 ```

 ### MLPipeline interface

 In the below example, we demonstrate how the same `LogisticRegression` class can allow SystemML to fit seamlessly into
 large data pipelines.

 ```python
 # MLPipeline way
 from pyspark.ml import Pipeline
 from systemml.mllearn import LogisticRegression
 from pyspark.ml.feature import HashingTF, Tokenizer
 from pyspark.sql import SQLContext
 sqlCtx = SQLContext(sc)
 training = sqlCtx.createDataFrame([
     (0L, "a b c d e spark", 1.0),
     (1L, "b d", 2.0),
     (2L, "spark f g h", 1.0),
     (3L, "hadoop mapreduce", 2.0),
     (4L, "b spark who", 1.0),
     (5L, "g d a y", 2.0),
     (6L, "spark fly", 1.0),
     (7L, "was mapreduce", 2.0),
     (8L, "e spark program", 1.0),
     (9L, "a e c l", 2.0),
     (10L, "spark compile", 1.0),
     (11L, "hadoop software", 2.0)
 ], ["id", "text", "label"])
 tokenizer = Tokenizer(inputCol="text", outputCol="words")
 hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
 lr = LogisticRegression(sqlCtx)
 pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
 model = pipeline.fit(training)
 test = sqlCtx.createDataFrame([
     (12L, "spark i j k"),
     (13L, "l m n"),
     (14L, "mapreduce spark"),
     (15L, "apache hadoop")], ["id", "text"])
 prediction = model.transform(test)
 prediction.show()
 ```

 Output:

 ```bash
 +--+---------------+--------------------+--------------------+--------------------+---+----------+
 |id|           text|               words|            features|         probability| ID|prediction|
 +--+---------------+--------------------+--------------------+--------------------+---+----------+
 |12|    spark i j k|ArrayBuffer(spark...|(20,[5,6,7],[2.0,...|[0.99999999999975...|1.0|       1.0|
 |13|          l m n|ArrayBuffer(l, m, n)|(20,[8,9,10],[1.0...|[1.37552128844736...|2.0|       2.0|
 |14|mapreduce spark|ArrayBuffer(mapre...|(20,[5,10],[1.0,1...|[0.99860290938153...|3.0|       1.0|
 |15|  apache hadoop|ArrayBuffer(apach...|(20,[9,14],[1.0,1...|[5.41688748236143...|4.0|       2.0|
 +--+---------------+--------------------+--------------------+--------------------+---+----------+
 ```

 ## Invoking DML/PyDML scripts using MLContext

 The below example demonstrates how to invoke the algorithm [scripts/algorithms/MultiLogReg.dml](https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/MultiLogReg.dml)
 using Python [MLContext API](https://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide).

 ```python
 from sklearn import datasets, neighbors
 from pyspark.sql import DataFrame, SQLContext
 import systemml as sml
 import pandas as pd
 import os
 sqlCtx = SQLContext(sc)
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target + 1
 n_samples = len(X_digits)
 # Split the data into training/testing sets and convert to PySpark DataFrame
 X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:.9 * n_samples]))
 y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:.9 * n_samples]))
 ml = sml.MLContext(sc)
 script = os.path.join(os.environ['SYSTEMML_HOME'], 'scripts', 'algorithms', 'MultiLogReg.dml')
 script = sml.dml(script).input(X=X_df, Y_vec=y_df).output("B_out")
 beta = ml.execute(script).get('B_out').toNumPy()
 ```
	---
	layout: global
	title: Beginner's Guide for Python users
	description: Beginner's Guide for Python users
	---
	<!--
	{% comment %}
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to you under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	{% endcomment %}
	-->

	* This will become a table of contents (this text will be scraped).
	{:toc}

	<br/>

	## Introduction

	SystemML enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors,
	one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).

	Algorithm scripts written in DML and PyDML can be run on Hadoop, on Spark, or in Standalone mode.
	No script modifications are required to change between modes. SystemML automatically performs advanced optimizations
	based on data and cluster characteristics, so much of the need to manually tweak algorithms is largely reduced or eliminated.
	To understand more about DML and PyDML, we recommend that you read [Beginner's Guide to DML and PyDML](https://apache.github.io/incubator-systemml/beginners-guide-to-dml-and-pydml.html).

	For convenience of Python users, SystemML exposes several language-level APIs that allow Python users to use SystemML
	and its algorithms without the need to know DML or PyDML. We explain these APIs in the below sections with example usecases.

	## Download & Setup

	Before you get started on SystemML, make sure that your environment is set up and ready to go.

	### Install Java (need Java 8) and Apache Spark

	If you already have a Apache Spark installation, you can skip this step.

	<div class="codetabs">
	<div data-lang="OSX" markdown="1">
	```bash
	/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
	brew tap caskroom/cask
	brew install Caskroom/cask/java
	brew install apache-spark
	```
	</div>
	<div data-lang="Linux" markdown="1">
	```bash
	ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"
	brew tap caskroom/cask
	brew install Caskroom/cask/java
	brew tap homebrew/versions
	brew install apache-spark16
	```
	</div>
	</div>

	### Install SystemML

	#### Step 1: Install SystemML Python package

	```bash
	pip install systemml
	```

	#### Step 2: Download SystemML Java binaries

	SystemML Python package downloads the corresponding Java binaries (along with algorithms) and places them
	into the installed location. To find the location of the downloaded Java binaries, use the following command:

	```bash
	python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'
	```

	#### Step 3: (Optional but recommended) Set SYSTEMML_HOME environment variable
	<div class="codetabs">
	<div data-lang="OSX" markdown="1">
	```bash
	SYSTEMML_HOME=`python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'`
	# If you are using zsh or ksh or csh, append it to ~/.zshrc or ~/.profile or ~/.login respectively.
	echo '' >> ~/.bashrc
	echo 'export SYSTEMML_HOME='$SYSTEMML_HOME >> ~/.bashrc
	```
	</div>
	<div data-lang="Linux" markdown="1">
	```bash
	SYSTEMML_HOME=`python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'`
	# If you are using zsh or ksh or csh, append it to ~/.zshrc or ~/.profile or ~/.login respectively.
	echo '' >> ~/.bashrc
	echo 'export SYSTEMML_HOME='$SYSTEMML_HOME >> ~/.bashrc
	```
	</div>
	</div>

	Note: the user is free to either use the prepackaged Java binaries
	or download them from [SystemML website](http://systemml.apache.org/download.html)
	or build them from the [source](https://github.com/apache/incubator-systemml).

	### Start Pyspark shell

	<div class="codetabs">
	<div data-lang="OSX" markdown="1">
	```bash
	pyspark --master local[*] --driver-class-path $SYSTEMML_HOME"/SystemML.jar"
	```
	</div>
	<div data-lang="Linux" markdown="1">
	```bash
	pyspark --master local[*] --driver-class-path $SYSTEMML_HOME"/SystemML.jar"
	```
	</div>
	</div>

	## Matrix operations

	To get started with SystemML, let's try few elementary matrix multiplication operations:

	```python
	import systemml as sml
	import numpy as np
	sml.setSparkContext(sc)
	m1 = sml.matrix(np.ones((3,3)) + 2)
	m2 = sml.matrix(np.ones((3,3)) + 3)
	m2 = m1 * (m2 + m1)
	m4 = 1.0 - m2
	m4.sum(axis=1).toNumPyArray()
	```

	Output:

	```bash
	array([[-60.],
	[-60.],
	[-60.]])
	```

	Let us now write a simple script to train [linear regression](https://apache.github.io/incubator-systemml/algorithms-regression.html#linear-regression)
	model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore regularization parameter as well as intercept.

	```python
	import numpy as np
	from sklearn import datasets
	import systemml as sml
	from pyspark.sql import SQLContext
	# Load the diabetes dataset
	diabetes = datasets.load_diabetes()
	# Use only one feature
	diabetes_X = diabetes.data[:, np.newaxis, 2]
	# Split the data into training/testing sets
	X_train = diabetes_X[:-20]
	X_test = diabetes_X[-20:]
	# Split the targets into training/testing sets
	y_train = diabetes.target[:-20]
	y_test = diabetes.target[-20:]
	# Train Linear Regression model
	sml.setSparkContext(sc)
	X = sml.matrix(X_train)
	y = sml.matrix(y_train)
	A = X.transpose().dot(X)
	b = X.transpose().dot(y)
	beta = sml.solve(A, b).toNumPyArray()
	y_predicted = X_test.dot(beta)
	print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
	```

	Output:

	```bash
	Residual sum of squares: 25282.12
	```

	We can improve the residual error by adding an intercept and regularization parameter. To do so, we will use `mllearn` API described in the next section.

	## Invoke SystemML's algorithms

	SystemML also exposes a subpackage `mllearn`. This subpackage allows Python users to invoke SystemML algorithms
	using Scikit-learn or MLPipeline API.

	### Scikit-learn interface

	In the below example, we invoke SystemML's [Linear Regression](https://apache.github.io/incubator-systemml/algorithms-regression.html#linear-regression)
	algorithm.

	```python
	import numpy as np
	from sklearn import datasets
	from systemml.mllearn import LinearRegression
	from pyspark.sql import SQLContext
	# Load the diabetes dataset
	diabetes = datasets.load_diabetes()
	# Use only one feature
	diabetes_X = diabetes.data[:, np.newaxis, 2]
	# Split the data into training/testing sets
	X_train = diabetes_X[:-20]
	X_test = diabetes_X[-20:]
	# Split the targets into training/testing sets
	y_train = diabetes.target[:-20]
	y_test = diabetes.target[-20:]
	# Create linear regression object
	regr = LinearRegression(sqlCtx, fit_intercept=True, C=1, solver='direct-solve')
	# Train the model using the training sets
	regr.fit(X_train, y_train)
	y_predicted = regr.predict(X_test)
	print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
	```

	Output:

	```bash
	Residual sum of squares: 6991.17
	```

	As expected, by adding intercept and regularizer the residual error drops significantly.

	Here is another example that where we invoke SystemML's [Logistic Regression](https://apache.github.io/incubator-systemml/algorithms-classification.html#multinomial-logistic-regression)
	algorithm on digits datasets.

	```python
	# Scikit-learn way
	from sklearn import datasets, neighbors
	from systemml.mllearn import LogisticRegression
	from pyspark.sql import SQLContext
	sqlCtx = SQLContext(sc)
	digits = datasets.load_digits()
	X_digits = digits.data
	y_digits = digits.target + 1
	n_samples = len(X_digits)
	X_train = X_digits[:.9 * n_samples]
	y_train = y_digits[:.9 * n_samples]
	X_test = X_digits[.9 * n_samples:]
	y_test = y_digits[.9 * n_samples:]
	logistic = LogisticRegression(sqlCtx)
	print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
	```

	Output:

	```bash
	LogisticRegression score: 0.922222
	```

	### Passing PySpark DataFrame

	To train the above algorithm on larger dataset, we can load the dataset into DataFrame and pass it to the `fit` method:

	```python
	from sklearn import datasets, neighbors
	from systemml.mllearn import LogisticRegression
	from pyspark.sql import SQLContext
	import systemml as sml
	sqlCtx = SQLContext(sc)
	digits = datasets.load_digits()
	X_digits = digits.data
	y_digits = digits.target + 1
	n_samples = len(X_digits)
	# Split the data into training/testing sets and convert to PySpark DataFrame
	df_train = sml.convertToLabeledDF(sqlContext, X_digits[:.9 * n_samples], y_digits[:.9 * n_samples])
	X_test = X_digits[.9 * n_samples:]
	y_test = y_digits[.9 * n_samples:]
	logistic = LogisticRegression(sqlCtx)
	print('LogisticRegression score: %f' % logistic.fit(df_train).score(X_test, y_test))
	```

	Output:

	```bash
	LogisticRegression score: 0.922222
	```

	### MLPipeline interface

	In the below example, we demonstrate how the same `LogisticRegression` class can allow SystemML to fit seamlessly into
	large data pipelines.

	```python
	# MLPipeline way
	from pyspark.ml import Pipeline
	from systemml.mllearn import LogisticRegression
	from pyspark.ml.feature import HashingTF, Tokenizer
	from pyspark.sql import SQLContext
	sqlCtx = SQLContext(sc)
	training = sqlCtx.createDataFrame([
	(0L, "a b c d e spark", 1.0),
	(1L, "b d", 2.0),
	(2L, "spark f g h", 1.0),
	(3L, "hadoop mapreduce", 2.0),
	(4L, "b spark who", 1.0),
	(5L, "g d a y", 2.0),
	(6L, "spark fly", 1.0),
	(7L, "was mapreduce", 2.0),
	(8L, "e spark program", 1.0),
	(9L, "a e c l", 2.0),
	(10L, "spark compile", 1.0),
	(11L, "hadoop software", 2.0)
	], ["id", "text", "label"])
	tokenizer = Tokenizer(inputCol="text", outputCol="words")
	hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
	lr = LogisticRegression(sqlCtx)
	pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
	model = pipeline.fit(training)
	test = sqlCtx.createDataFrame([
	(12L, "spark i j k"),
	(13L, "l m n"),
	(14L, "mapreduce spark"),
	(15L, "apache hadoop")], ["id", "text"])
	prediction = model.transform(test)
	prediction.show()
	```

	Output:

	```bash
	+--+---------------+--------------------+--------------------+--------------------+---+----------+
	\|id\| text\| words\| features\| probability\| ID\|prediction\|
	+--+---------------+--------------------+--------------------+--------------------+---+----------+
	\|12\| spark i j k\|ArrayBuffer(spark...\|(20,[5,6,7],[2.0,...\|[0.99999999999975...\|1.0\| 1.0\|
	\|13\| l m n\|ArrayBuffer(l, m, n)\|(20,[8,9,10],[1.0...\|[1.37552128844736...\|2.0\| 2.0\|
	\|14\|mapreduce spark\|ArrayBuffer(mapre...\|(20,[5,10],[1.0,1...\|[0.99860290938153...\|3.0\| 1.0\|
	\|15\| apache hadoop\|ArrayBuffer(apach...\|(20,[9,14],[1.0,1...\|[5.41688748236143...\|4.0\| 2.0\|
	+--+---------------+--------------------+--------------------+--------------------+---+----------+
	```

	## Invoking DML/PyDML scripts using MLContext

	The below example demonstrates how to invoke the algorithm [scripts/algorithms/MultiLogReg.dml](https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/MultiLogReg.dml)
	using Python [MLContext API](https://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide).

	```python
	from sklearn import datasets, neighbors
	from pyspark.sql import DataFrame, SQLContext
	import systemml as sml
	import pandas as pd
	import os
	sqlCtx = SQLContext(sc)
	digits = datasets.load_digits()
	X_digits = digits.data
	y_digits = digits.target + 1
	n_samples = len(X_digits)
	# Split the data into training/testing sets and convert to PySpark DataFrame
	X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:.9 * n_samples]))
	y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:.9 * n_samples]))
	ml = sml.MLContext(sc)
	script = os.path.join(os.environ['SYSTEMML_HOME'], 'scripts', 'algorithms', 'MultiLogReg.dml')
	script = sml.dml(script).input(X=X_df, Y_vec=y_df).output("B_out")
	beta = ml.execute(script).get('B_out').toNumPy()
	```