blob: 65eef50d8ebedbcaed0652b23abce2f70f90f5b1 [file] [log] [blame] [view]
---
layout: global
title: Beginner's Guide for Python users
description: Beginner's Guide for Python users
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->
* This will become a table of contents (this text will be scraped).
{:toc}
<br/>
## Introduction
SystemML enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors,
one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).
Algorithm scripts written in DML and PyDML can be run on Hadoop, on Spark, or in Standalone mode.
No script modifications are required to change between modes. SystemML automatically performs advanced optimizations
based on data and cluster characteristics, so much of the need to manually tweak algorithms is largely reduced or eliminated.
To understand more about DML and PyDML, we recommend that you read [Beginner's Guide to DML and PyDML](https://apache.github.io/incubator-systemml/beginners-guide-to-dml-and-pydml.html).
For convenience of Python users, SystemML exposes several language-level APIs that allow Python users to use SystemML
and its algorithms without the need to know DML or PyDML. We explain these APIs in the below sections with example usecases.
## Download & Setup
Before you get started on SystemML, make sure that your environment is set up and ready to go.
### Install Java (need Java 8) and Apache Spark
If you already have a Apache Spark installation, you can skip this step.
<div class="codetabs">
<div data-lang="OSX" markdown="1">
```bash
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew tap caskroom/cask
brew install Caskroom/cask/java
brew install apache-spark
```
</div>
<div data-lang="Linux" markdown="1">
```bash
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"
brew tap caskroom/cask
brew install Caskroom/cask/java
brew tap homebrew/versions
brew install apache-spark16
```
</div>
</div>
### Install SystemML
#### Step 1: Install SystemML Python package
```bash
pip install systemml
```
#### Step 2: Download SystemML Java binaries
SystemML Python package downloads the corresponding Java binaries (along with algorithms) and places them
into the installed location. To find the location of the downloaded Java binaries, use the following command:
```bash
python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'
```
#### Step 3: (Optional but recommended) Set SYSTEMML_HOME environment variable
<div class="codetabs">
<div data-lang="OSX" markdown="1">
```bash
SYSTEMML_HOME=`python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'`
# If you are using zsh or ksh or csh, append it to ~/.zshrc or ~/.profile or ~/.login respectively.
echo '' >> ~/.bashrc
echo 'export SYSTEMML_HOME='$SYSTEMML_HOME >> ~/.bashrc
```
</div>
<div data-lang="Linux" markdown="1">
```bash
SYSTEMML_HOME=`python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'`
# If you are using zsh or ksh or csh, append it to ~/.zshrc or ~/.profile or ~/.login respectively.
echo '' >> ~/.bashrc
echo 'export SYSTEMML_HOME='$SYSTEMML_HOME >> ~/.bashrc
```
</div>
</div>
Note: the user is free to either use the prepackaged Java binaries
or download them from [SystemML website](http://systemml.apache.org/download.html)
or build them from the [source](https://github.com/apache/incubator-systemml).
### Start Pyspark shell
<div class="codetabs">
<div data-lang="OSX" markdown="1">
```bash
pyspark --master local[*] --driver-class-path $SYSTEMML_HOME"/SystemML.jar"
```
</div>
<div data-lang="Linux" markdown="1">
```bash
pyspark --master local[*] --driver-class-path $SYSTEMML_HOME"/SystemML.jar"
```
</div>
</div>
## Matrix operations
To get started with SystemML, let's try few elementary matrix multiplication operations:
```python
import systemml as sml
import numpy as np
sml.setSparkContext(sc)
m1 = sml.matrix(np.ones((3,3)) + 2)
m2 = sml.matrix(np.ones((3,3)) + 3)
m2 = m1 * (m2 + m1)
m4 = 1.0 - m2
m4.sum(axis=1).toNumPyArray()
```
Output:
```bash
array([[-60.],
[-60.],
[-60.]])
```
Let us now write a simple script to train [linear regression](https://apache.github.io/incubator-systemml/algorithms-regression.html#linear-regression)
model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore regularization parameter as well as intercept.
```python
import numpy as np
from sklearn import datasets
import systemml as sml
from pyspark.sql import SQLContext
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
X_train = diabetes_X[:-20]
X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
y_train = diabetes.target[:-20]
y_test = diabetes.target[-20:]
# Train Linear Regression model
sml.setSparkContext(sc)
X = sml.matrix(X_train)
y = sml.matrix(y_train)
A = X.transpose().dot(X)
b = X.transpose().dot(y)
beta = sml.solve(A, b).toNumPyArray()
y_predicted = X_test.dot(beta)
print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
```
Output:
```bash
Residual sum of squares: 25282.12
```
We can improve the residual error by adding an intercept and regularization parameter. To do so, we will use `mllearn` API described in the next section.
## Invoke SystemML's algorithms
SystemML also exposes a subpackage `mllearn`. This subpackage allows Python users to invoke SystemML algorithms
using Scikit-learn or MLPipeline API.
### Scikit-learn interface
In the below example, we invoke SystemML's [Linear Regression](https://apache.github.io/incubator-systemml/algorithms-regression.html#linear-regression)
algorithm.
```python
import numpy as np
from sklearn import datasets
from systemml.mllearn import LinearRegression
from pyspark.sql import SQLContext
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
X_train = diabetes_X[:-20]
X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
y_train = diabetes.target[:-20]
y_test = diabetes.target[-20:]
# Create linear regression object
regr = LinearRegression(sqlCtx, fit_intercept=True, C=1, solver='direct-solve')
# Train the model using the training sets
regr.fit(X_train, y_train)
y_predicted = regr.predict(X_test)
print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
```
Output:
```bash
Residual sum of squares: 6991.17
```
As expected, by adding intercept and regularizer the residual error drops significantly.
Here is another example that where we invoke SystemML's [Logistic Regression](https://apache.github.io/incubator-systemml/algorithms-classification.html#multinomial-logistic-regression)
algorithm on digits datasets.
```python
# Scikit-learn way
from sklearn import datasets, neighbors
from systemml.mllearn import LogisticRegression
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target + 1
n_samples = len(X_digits)
X_train = X_digits[:.9 * n_samples]
y_train = y_digits[:.9 * n_samples]
X_test = X_digits[.9 * n_samples:]
y_test = y_digits[.9 * n_samples:]
logistic = LogisticRegression(sqlCtx)
print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
```
Output:
```bash
LogisticRegression score: 0.922222
```
### Passing PySpark DataFrame
To train the above algorithm on larger dataset, we can load the dataset into DataFrame and pass it to the `fit` method:
```python
from sklearn import datasets, neighbors
from systemml.mllearn import LogisticRegression
from pyspark.sql import SQLContext
import systemml as sml
sqlCtx = SQLContext(sc)
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target + 1
n_samples = len(X_digits)
# Split the data into training/testing sets and convert to PySpark DataFrame
df_train = sml.convertToLabeledDF(sqlContext, X_digits[:.9 * n_samples], y_digits[:.9 * n_samples])
X_test = X_digits[.9 * n_samples:]
y_test = y_digits[.9 * n_samples:]
logistic = LogisticRegression(sqlCtx)
print('LogisticRegression score: %f' % logistic.fit(df_train).score(X_test, y_test))
```
Output:
```bash
LogisticRegression score: 0.922222
```
### MLPipeline interface
In the below example, we demonstrate how the same `LogisticRegression` class can allow SystemML to fit seamlessly into
large data pipelines.
```python
# MLPipeline way
from pyspark.ml import Pipeline
from systemml.mllearn import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
training = sqlCtx.createDataFrame([
(0L, "a b c d e spark", 1.0),
(1L, "b d", 2.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 2.0),
(4L, "b spark who", 1.0),
(5L, "g d a y", 2.0),
(6L, "spark fly", 1.0),
(7L, "was mapreduce", 2.0),
(8L, "e spark program", 1.0),
(9L, "a e c l", 2.0),
(10L, "spark compile", 1.0),
(11L, "hadoop software", 2.0)
], ["id", "text", "label"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
lr = LogisticRegression(sqlCtx)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
model = pipeline.fit(training)
test = sqlCtx.createDataFrame([
(12L, "spark i j k"),
(13L, "l m n"),
(14L, "mapreduce spark"),
(15L, "apache hadoop")], ["id", "text"])
prediction = model.transform(test)
prediction.show()
```
Output:
```bash
+--+---------------+--------------------+--------------------+--------------------+---+----------+
|id| text| words| features| probability| ID|prediction|
+--+---------------+--------------------+--------------------+--------------------+---+----------+
|12| spark i j k|ArrayBuffer(spark...|(20,[5,6,7],[2.0,...|[0.99999999999975...|1.0| 1.0|
|13| l m n|ArrayBuffer(l, m, n)|(20,[8,9,10],[1.0...|[1.37552128844736...|2.0| 2.0|
|14|mapreduce spark|ArrayBuffer(mapre...|(20,[5,10],[1.0,1...|[0.99860290938153...|3.0| 1.0|
|15| apache hadoop|ArrayBuffer(apach...|(20,[9,14],[1.0,1...|[5.41688748236143...|4.0| 2.0|
+--+---------------+--------------------+--------------------+--------------------+---+----------+
```
## Invoking DML/PyDML scripts using MLContext
The below example demonstrates how to invoke the algorithm [scripts/algorithms/MultiLogReg.dml](https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/MultiLogReg.dml)
using Python [MLContext API](https://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide).
```python
from sklearn import datasets, neighbors
from pyspark.sql import DataFrame, SQLContext
import systemml as sml
import pandas as pd
import os
sqlCtx = SQLContext(sc)
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target + 1
n_samples = len(X_digits)
# Split the data into training/testing sets and convert to PySpark DataFrame
X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:.9 * n_samples]))
y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:.9 * n_samples]))
ml = sml.MLContext(sc)
script = os.path.join(os.environ['SYSTEMML_HOME'], 'scripts', 'algorithms', 'MultiLogReg.dml')
script = sml.dml(script).input(X=X_df, Y_vec=y_df).output("B_out")
beta = ml.execute(script).get('B_out').toNumPy()
```