blob: 1ae30d5d78c76163914173bb4458e9efa9a66fba [file] [log] [blame]
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Linear Methods - ML - Spark 1.5.2 Documentation</title>
<link rel="stylesheet" href="css/bootstrap.min.css">
<style>
body {
padding-top: 60px;
padding-bottom: 40px;
}
</style>
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="css/bootstrap-responsive.min.css">
<link rel="stylesheet" href="css/main.css">
<script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
<link rel="stylesheet" href="css/pygments-default.css">
<!-- Google analytics script -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-32518208-2']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<!--[if lt IE 7]>
<p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p>
<![endif]-->
<!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html -->
<div class="navbar navbar-fixed-top" id="topbar">
<div class="navbar-inner">
<div class="container">
<div class="brand"><a href="index.html">
<img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">1.5.2</span>
</div>
<ul class="nav">
<!--TODO(andyk): Add class="active" attribute to li some how.-->
<li><a href="index.html">Overview</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quick-start.html">Quick Start</a></li>
<li><a href="programming-guide.html">Spark Programming Guide</a></li>
<li class="divider"></li>
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
<li><a href="sql-programming-guide.html">DataFrames and SQL</a></li>
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
<li><a href="sparkr.html">SparkR (R on Spark)</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
<li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li>
<li><a href="api/R/index.html">R</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="cluster-overview.html">Overview</a></li>
<li><a href="submitting-applications.html">Submitting Applications</a></li>
<li class="divider"></li>
<li><a href="spark-standalone.html">Spark Standalone</a></li>
<li><a href="running-on-mesos.html">Mesos</a></li>
<li><a href="running-on-yarn.html">YARN</a></li>
<li class="divider"></li>
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
</ul>
</li>
<li class="dropdown">
<a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="configuration.html">Configuration</a></li>
<li><a href="monitoring.html">Monitoring</a></li>
<li><a href="tuning.html">Tuning Guide</a></li>
<li><a href="job-scheduling.html">Job Scheduling</a></li>
<li><a href="security.html">Security</a></li>
<li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
<li><a href="hadoop-third-party-distributions.html">3<sup>rd</sup>-Party Hadoop Distros</a></li>
<li class="divider"></li>
<li><a href="building-spark.html">Building Spark</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects">Supplemental Projects</a></li>
</ul>
</li>
</ul>
<!--<p class="navbar-text pull-right"><span class="version-text">v1.5.2</span></p>-->
</div>
</div>
</div>
<div class="container" id="content">
<h1 class="title"><a href="ml-guide.html">ML</a> - Linear Methods</h1>
<p><code>\[
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\av}{\mathbf{\alpha}}
\newcommand{\bv}{\mathbf{b}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\id}{\mathbf{I}}
\newcommand{\ind}{\mathbf{1}}
\newcommand{\0}{\mathbf{0}}
\newcommand{\unit}{\mathbf{e}}
\newcommand{\one}{\mathbf{1}}
\newcommand{\zero}{\mathbf{0}}
\]</code></p>
<p>In MLlib, we implement popular linear methods such as logistic
regression and linear least squares with $L_1$ or $L_2$ regularization.
Refer to <a href="mllib-linear-methods.html">the linear methods in mllib</a> for
details. In <code>spark.ml</code>, we also include Pipelines API for <a href="http://en.wikipedia.org/wiki/Elastic_net_regularization">Elastic
net</a>, a hybrid
of $L_1$ and $L_2$ regularization proposed in <a href="http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf">Zou et al, Regularization
and variable selection via the elastic
net</a>.
Mathematically, it is defined as a convex combination of the $L_1$ and
the $L_2$ regularization terms:
<code>\[
\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
\]</code>
By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
regularization as special cases. For example, if a <a href="https://en.wikipedia.org/wiki/Linear_regression">linear
regression</a> model is
trained with the elastic net parameter $\alpha$ set to $1$, it is
equivalent to a
<a href="http://en.wikipedia.org/wiki/Least_squares#Lasso_method">Lasso</a> model.
On the other hand, if $\alpha$ is set to $0$, the trained model reduces
to a <a href="http://en.wikipedia.org/wiki/Tikhonov_regularization">ridge
regression</a> model.
We implement Pipelines API for both linear regression and logistic
regression with elastic net regularization.</p>
<h2 id="example-logistic-regression">Example: Logistic Regression</h2>
<p>The following example shows how to train a logistic regression model
with elastic net regularization. <code>elasticNetParam</code> corresponds to
$\alpha$ and <code>regParam</code> corresponds to $\lambda$.</p>
<div class="codetabs">
<div data-lang="scala">
<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">import</span> <span class="nn">org.apache.spark.ml.classification.LogisticRegression</span>
<span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span>
<span class="c1">// Load training data</span>
<span class="k">val</span> <span class="n">training</span> <span class="k">=</span> <span class="nc">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">&quot;data/mllib/sample_libsvm_data.txt&quot;</span><span class="o">).</span><span class="n">toDF</span><span class="o">()</span>
<span class="k">val</span> <span class="n">lr</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">LogisticRegression</span><span class="o">()</span>
<span class="o">.</span><span class="n">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span>
<span class="o">.</span><span class="n">setRegParam</span><span class="o">(</span><span class="mf">0.3</span><span class="o">)</span>
<span class="o">.</span><span class="n">setElasticNetParam</span><span class="o">(</span><span class="mf">0.8</span><span class="o">)</span>
<span class="c1">// Fit the model</span>
<span class="k">val</span> <span class="n">lrModel</span> <span class="k">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="o">(</span><span class="n">training</span><span class="o">)</span>
<span class="c1">// Print the weights and intercept for logistic regression</span>
<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}&quot;</span><span class="o">)</span></code></pre></div>
</div>
<div data-lang="java">
<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.spark.ml.classification.LogisticRegression</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.ml.classification.LogisticRegressionModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.regression.LabeledPoint</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.SparkConf</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.SparkContext</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.sql.DataFrame</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.sql.SQLContext</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">LogisticRegressionWithElasticNetExample</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">SparkConf</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">()</span>
<span class="o">.</span><span class="na">setAppName</span><span class="o">(</span><span class="s">&quot;Logistic Regression with Elastic Net Example&quot;</span><span class="o">);</span>
<span class="n">SparkContext</span> <span class="n">sc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span>
<span class="n">SQLContext</span> <span class="n">sql</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">);</span>
<span class="n">String</span> <span class="n">path</span> <span class="o">=</span> <span class="s">&quot;data/mllib/sample_libsvm_data.txt&quot;</span><span class="o">;</span>
<span class="c1">// Load training data</span>
<span class="n">DataFrame</span> <span class="n">training</span> <span class="o">=</span> <span class="n">sql</span><span class="o">.</span><span class="na">createDataFrame</span><span class="o">(</span><span class="n">MLUtils</span><span class="o">.</span><span class="na">loadLibSVMFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="n">path</span><span class="o">).</span><span class="na">toJavaRDD</span><span class="o">(),</span> <span class="n">LabeledPoint</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="n">LogisticRegression</span> <span class="n">lr</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">LogisticRegression</span><span class="o">()</span>
<span class="o">.</span><span class="na">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span>
<span class="o">.</span><span class="na">setRegParam</span><span class="o">(</span><span class="mf">0.3</span><span class="o">)</span>
<span class="o">.</span><span class="na">setElasticNetParam</span><span class="o">(</span><span class="mf">0.8</span><span class="o">);</span>
<span class="c1">// Fit the model</span>
<span class="n">LogisticRegressionModel</span> <span class="n">lrModel</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="na">fit</span><span class="o">(</span><span class="n">training</span><span class="o">);</span>
<span class="c1">// Print the weights and intercept for logistic regression</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;Weights: &quot;</span> <span class="o">+</span> <span class="n">lrModel</span><span class="o">.</span><span class="na">weights</span><span class="o">()</span> <span class="o">+</span> <span class="s">&quot; Intercept: &quot;</span> <span class="o">+</span> <span class="n">lrModel</span><span class="o">.</span><span class="na">intercept</span><span class="o">());</span>
<span class="o">}</span>
<span class="o">}</span></code></pre></div>
</div>
<div data-lang="python">
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyspark.ml.classification</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">from</span> <span class="nn">pyspark.mllib.regression</span> <span class="kn">import</span> <span class="n">LabeledPoint</span>
<span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span>
<span class="c"># Load training data</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">&quot;data/mllib/sample_libsvm_data.txt&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">regParam</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="n">elasticNetParam</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="c"># Fit the model</span>
<span class="n">lrModel</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
<span class="c"># Print the weights and intercept for logistic regression</span>
<span class="k">print</span><span class="p">(</span><span class="s">&quot;Weights: &quot;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">lrModel</span><span class="o">.</span><span class="n">weights</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">&quot;Intercept: &quot;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">lrModel</span><span class="o">.</span><span class="n">intercept</span><span class="p">))</span></code></pre></div>
</div>
</div>
<p>The <code>spark.ml</code> implementation of logistic regression also supports
extracting a summary of the model over the training set. Note that the
predictions and metrics which are stored as <code>Dataframe</code> in
<code>BinaryLogisticRegressionSummary</code> are annotated <code>@transient</code> and hence
only available on the driver.</p>
<div class="codetabs">
<div data-lang="scala">
<p><a href="api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary"><code>LogisticRegressionTrainingSummary</code></a>
provides a summary for a
<a href="api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel"><code>LogisticRegressionModel</code></a>.
Currently, only binary classification is supported and the
summary must be explicitly cast to
<a href="api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary"><code>BinaryLogisticRegressionTrainingSummary</code></a>.
This will likely change when multiclass classification is supported.</p>
<p>Continuing the earlier example:</p>
<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">import</span> <span class="nn">org.apache.spark.ml.classification.BinaryLogisticRegressionSummary</span>
<span class="c1">// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example</span>
<span class="k">val</span> <span class="n">trainingSummary</span> <span class="k">=</span> <span class="n">lrModel</span><span class="o">.</span><span class="n">summary</span>
<span class="c1">// Obtain the objective per iteration.</span>
<span class="k">val</span> <span class="n">objectiveHistory</span> <span class="k">=</span> <span class="n">trainingSummary</span><span class="o">.</span><span class="n">objectiveHistory</span>
<span class="n">objectiveHistory</span><span class="o">.</span><span class="n">foreach</span><span class="o">(</span><span class="n">loss</span> <span class="k">=&gt;</span> <span class="n">println</span><span class="o">(</span><span class="n">loss</span><span class="o">))</span>
<span class="c1">// Obtain the metrics useful to judge performance on test data.</span>
<span class="c1">// We cast the summary to a BinaryLogisticRegressionSummary since the problem is a</span>
<span class="c1">// binary classification problem.</span>
<span class="k">val</span> <span class="n">binarySummary</span> <span class="k">=</span> <span class="n">trainingSummary</span><span class="o">.</span><span class="n">asInstanceOf</span><span class="o">[</span><span class="kt">BinaryLogisticRegressionSummary</span><span class="o">]</span>
<span class="c1">// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.</span>
<span class="k">val</span> <span class="n">roc</span> <span class="k">=</span> <span class="n">binarySummary</span><span class="o">.</span><span class="n">roc</span>
<span class="n">roc</span><span class="o">.</span><span class="n">show</span><span class="o">()</span>
<span class="n">println</span><span class="o">(</span><span class="n">binarySummary</span><span class="o">.</span><span class="n">areaUnderROC</span><span class="o">)</span>
<span class="c1">// Set the model threshold to maximize F-Measure</span>
<span class="k">val</span> <span class="n">fMeasure</span> <span class="k">=</span> <span class="n">binarySummary</span><span class="o">.</span><span class="n">fMeasureByThreshold</span>
<span class="k">val</span> <span class="n">maxFMeasure</span> <span class="k">=</span> <span class="n">fMeasure</span><span class="o">.</span><span class="n">select</span><span class="o">(</span><span class="n">max</span><span class="o">(</span><span class="s">&quot;F-Measure&quot;</span><span class="o">)).</span><span class="n">head</span><span class="o">().</span><span class="n">getDouble</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span>
<span class="k">val</span> <span class="n">bestThreshold</span> <span class="k">=</span> <span class="n">fMeasure</span><span class="o">.</span><span class="n">where</span><span class="o">(</span><span class="n">$</span><span class="s">&quot;F-Measure&quot;</span> <span class="o">===</span> <span class="n">maxFMeasure</span><span class="o">).</span>
<span class="n">select</span><span class="o">(</span><span class="s">&quot;threshold&quot;</span><span class="o">).</span><span class="n">head</span><span class="o">().</span><span class="n">getDouble</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span>
<span class="n">lrModel</span><span class="o">.</span><span class="n">setThreshold</span><span class="o">(</span><span class="n">bestThreshold</span><span class="o">)</span></code></pre></div>
</div>
<div data-lang="java">
<p><a href="api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html"><code>LogisticRegressionTrainingSummary</code></a>
provides a summary for a
<a href="api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html"><code>LogisticRegressionModel</code></a>.
Currently, only binary classification is supported and the
summary must be explicitly cast to
<a href="api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html"><code>BinaryLogisticRegressionTrainingSummary</code></a>.
This will likely change when multiclass classification is supported.</p>
<p>Continuing the earlier example:</p>
<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.spark.ml.classification.LogisticRegressionTrainingSummary</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.ml.classification.BinaryLogisticRegressionSummary</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.sql.functions</span><span class="o">;</span>
<span class="c1">// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier example</span>
<span class="n">LogisticRegressionTrainingSummary</span> <span class="n">trainingSummary</span> <span class="o">=</span> <span class="n">lrModel</span><span class="o">.</span><span class="na">summary</span><span class="o">();</span>
<span class="c1">// Obtain the loss per iteration.</span>
<span class="kt">double</span><span class="o">[]</span> <span class="n">objectiveHistory</span> <span class="o">=</span> <span class="n">trainingSummary</span><span class="o">.</span><span class="na">objectiveHistory</span><span class="o">();</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">double</span> <span class="n">lossPerIteration</span> <span class="o">:</span> <span class="n">objectiveHistory</span><span class="o">)</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">lossPerIteration</span><span class="o">);</span>
<span class="o">}</span>
<span class="c1">// Obtain the metrics useful to judge performance on test data.</span>
<span class="c1">// We cast the summary to a BinaryLogisticRegressionSummary since the problem is a</span>
<span class="c1">// binary classification problem.</span>
<span class="n">BinaryLogisticRegressionSummary</span> <span class="n">binarySummary</span> <span class="o">=</span> <span class="o">(</span><span class="n">BinaryLogisticRegressionSummary</span><span class="o">)</span> <span class="n">trainingSummary</span><span class="o">;</span>
<span class="c1">// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.</span>
<span class="n">DataFrame</span> <span class="n">roc</span> <span class="o">=</span> <span class="n">binarySummary</span><span class="o">.</span><span class="na">roc</span><span class="o">();</span>
<span class="n">roc</span><span class="o">.</span><span class="na">show</span><span class="o">();</span>
<span class="n">roc</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="s">&quot;FPR&quot;</span><span class="o">).</span><span class="na">show</span><span class="o">();</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">binarySummary</span><span class="o">.</span><span class="na">areaUnderROC</span><span class="o">());</span>
<span class="c1">// Get the threshold corresponding to the maximum F-Measure and rerun LogisticRegression with</span>
<span class="c1">// this selected threshold.</span>
<span class="n">DataFrame</span> <span class="n">fMeasure</span> <span class="o">=</span> <span class="n">binarySummary</span><span class="o">.</span><span class="na">fMeasureByThreshold</span><span class="o">();</span>
<span class="kt">double</span> <span class="n">maxFMeasure</span> <span class="o">=</span> <span class="n">fMeasure</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="n">functions</span><span class="o">.</span><span class="na">max</span><span class="o">(</span><span class="s">&quot;F-Measure&quot;</span><span class="o">)).</span><span class="na">head</span><span class="o">().</span><span class="na">getDouble</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span>
<span class="kt">double</span> <span class="n">bestThreshold</span> <span class="o">=</span> <span class="n">fMeasure</span><span class="o">.</span><span class="na">where</span><span class="o">(</span><span class="n">fMeasure</span><span class="o">.</span><span class="na">col</span><span class="o">(</span><span class="s">&quot;F-Measure&quot;</span><span class="o">).</span><span class="na">equalTo</span><span class="o">(</span><span class="n">maxFMeasure</span><span class="o">)).</span>
<span class="n">select</span><span class="o">(</span><span class="s">&quot;threshold&quot;</span><span class="o">).</span><span class="na">head</span><span class="o">().</span><span class="na">getDouble</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span>
<span class="n">lrModel</span><span class="o">.</span><span class="na">setThreshold</span><span class="o">(</span><span class="n">bestThreshold</span><span class="o">);</span></code></pre></div>
</div>
<!--- TODO: Add python model summaries once implemented -->
<div data-lang="python">
<p>Logistic regression model summary is not yet supported in Python.</p>
</div>
</div>
<h2 id="example-linear-regression">Example: Linear Regression</h2>
<p>The interface for working with linear regression models and model
summaries is similar to the logistic regression case. The following
example demonstrates training an elastic net regularized linear
regression model and extracting model summary statistics.</p>
<div class="codetabs">
<div data-lang="scala">
<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">import</span> <span class="nn">org.apache.spark.ml.regression.LinearRegression</span>
<span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span>
<span class="c1">// Load training data</span>
<span class="k">val</span> <span class="n">training</span> <span class="k">=</span> <span class="nc">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">&quot;data/mllib/sample_libsvm_data.txt&quot;</span><span class="o">).</span><span class="n">toDF</span><span class="o">()</span>
<span class="k">val</span> <span class="n">lr</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">LinearRegression</span><span class="o">()</span>
<span class="o">.</span><span class="n">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span>
<span class="o">.</span><span class="n">setRegParam</span><span class="o">(</span><span class="mf">0.3</span><span class="o">)</span>
<span class="o">.</span><span class="n">setElasticNetParam</span><span class="o">(</span><span class="mf">0.8</span><span class="o">)</span>
<span class="c1">// Fit the model</span>
<span class="k">val</span> <span class="n">lrModel</span> <span class="k">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="o">(</span><span class="n">training</span><span class="o">)</span>
<span class="c1">// Print the weights and intercept for linear regression</span>
<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}&quot;</span><span class="o">)</span>
<span class="c1">// Summarize the model over the training set and print out some metrics</span>
<span class="k">val</span> <span class="n">trainingSummary</span> <span class="k">=</span> <span class="n">lrModel</span><span class="o">.</span><span class="n">summary</span>
<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;numIterations: ${trainingSummary.totalIterations}&quot;</span><span class="o">)</span>
<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;objectiveHistory: ${trainingSummary.objectiveHistory.toList}&quot;</span><span class="o">)</span>
<span class="n">trainingSummary</span><span class="o">.</span><span class="n">residuals</span><span class="o">.</span><span class="n">show</span><span class="o">()</span>
<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;RMSE: ${trainingSummary.rootMeanSquaredError}&quot;</span><span class="o">)</span>
<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;r2: ${trainingSummary.r2}&quot;</span><span class="o">)</span></code></pre></div>
</div>
<div data-lang="java">
<div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.spark.ml.regression.LinearRegression</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.ml.regression.LinearRegressionModel</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.ml.regression.LinearRegressionTrainingSummary</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.regression.LabeledPoint</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.SparkConf</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.SparkContext</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.sql.DataFrame</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.spark.sql.SQLContext</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">LinearRegressionWithElasticNetExample</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="o">{</span>
<span class="n">SparkConf</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkConf</span><span class="o">()</span>
<span class="o">.</span><span class="na">setAppName</span><span class="o">(</span><span class="s">&quot;Linear Regression with Elastic Net Example&quot;</span><span class="o">);</span>
<span class="n">SparkContext</span> <span class="n">sc</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span>
<span class="n">SQLContext</span> <span class="n">sql</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">);</span>
<span class="n">String</span> <span class="n">path</span> <span class="o">=</span> <span class="s">&quot;data/mllib/sample_libsvm_data.txt&quot;</span><span class="o">;</span>
<span class="c1">// Load training data</span>
<span class="n">DataFrame</span> <span class="n">training</span> <span class="o">=</span> <span class="n">sql</span><span class="o">.</span><span class="na">createDataFrame</span><span class="o">(</span><span class="n">MLUtils</span><span class="o">.</span><span class="na">loadLibSVMFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="n">path</span><span class="o">).</span><span class="na">toJavaRDD</span><span class="o">(),</span> <span class="n">LabeledPoint</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="n">LinearRegression</span> <span class="n">lr</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">LinearRegression</span><span class="o">()</span>
<span class="o">.</span><span class="na">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span>
<span class="o">.</span><span class="na">setRegParam</span><span class="o">(</span><span class="mf">0.3</span><span class="o">)</span>
<span class="o">.</span><span class="na">setElasticNetParam</span><span class="o">(</span><span class="mf">0.8</span><span class="o">);</span>
<span class="c1">// Fit the model</span>
<span class="n">LinearRegressionModel</span> <span class="n">lrModel</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="na">fit</span><span class="o">(</span><span class="n">training</span><span class="o">);</span>
<span class="c1">// Print the weights and intercept for linear regression</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;Weights: &quot;</span> <span class="o">+</span> <span class="n">lrModel</span><span class="o">.</span><span class="na">weights</span><span class="o">()</span> <span class="o">+</span> <span class="s">&quot; Intercept: &quot;</span> <span class="o">+</span> <span class="n">lrModel</span><span class="o">.</span><span class="na">intercept</span><span class="o">());</span>
<span class="c1">// Summarize the model over the training set and print out some metrics</span>
<span class="n">LinearRegressionTrainingSummary</span> <span class="n">trainingSummary</span> <span class="o">=</span> <span class="n">lrModel</span><span class="o">.</span><span class="na">summary</span><span class="o">();</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;numIterations: &quot;</span> <span class="o">+</span> <span class="n">trainingSummary</span><span class="o">.</span><span class="na">totalIterations</span><span class="o">());</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;objectiveHistory: &quot;</span> <span class="o">+</span> <span class="n">Vectors</span><span class="o">.</span><span class="na">dense</span><span class="o">(</span><span class="n">trainingSummary</span><span class="o">.</span><span class="na">objectiveHistory</span><span class="o">()));</span>
<span class="n">trainingSummary</span><span class="o">.</span><span class="na">residuals</span><span class="o">().</span><span class="na">show</span><span class="o">();</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;RMSE: &quot;</span> <span class="o">+</span> <span class="n">trainingSummary</span><span class="o">.</span><span class="na">rootMeanSquaredError</span><span class="o">());</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;r2: &quot;</span> <span class="o">+</span> <span class="n">trainingSummary</span><span class="o">.</span><span class="na">r2</span><span class="o">());</span>
<span class="o">}</span>
<span class="o">}</span></code></pre></div>
</div>
<div data-lang="python">
<!--- TODO: Add python model summaries once implemented -->
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyspark.ml.regression</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">pyspark.mllib.regression</span> <span class="kn">import</span> <span class="n">LabeledPoint</span>
<span class="kn">from</span> <span class="nn">pyspark.mllib.util</span> <span class="kn">import</span> <span class="n">MLUtils</span>
<span class="c"># Load training data</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="s">&quot;data/mllib/sample_libsvm_data.txt&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">regParam</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="n">elasticNetParam</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="c"># Fit the model</span>
<span class="n">lrModel</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
<span class="c"># Print the weights and intercept for linear regression</span>
<span class="k">print</span><span class="p">(</span><span class="s">&quot;Weights: &quot;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">lrModel</span><span class="o">.</span><span class="n">weights</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">&quot;Intercept: &quot;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">lrModel</span><span class="o">.</span><span class="n">intercept</span><span class="p">))</span>
<span class="c"># Linear regression model summary is not yet supported in Python.</span></code></pre></div>
</div>
</div>
<h1 id="optimization">Optimization</h1>
<p>The optimization algorithm underlying the implementation is called
<a href="http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf">Orthant-Wise Limited-memory
QuasiNewton</a>
(OWL-QN). It is an extension of L-BFGS that can effectively handle L1
regularization and elastic net.</p>
</div> <!-- /container -->
<script src="js/vendor/jquery-1.8.0.min.js"></script>
<script src="js/vendor/bootstrap.min.js"></script>
<script src="js/vendor/anchor.min.js"></script>
<script src="js/main.js"></script>
<!-- MathJax Section -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<script>
// Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS.
// We could use "//cdn.mathjax...", but that won't support "file://".
(function(d, script) {
script = d.createElement('script');
script.type = 'text/javascript';
script.async = true;
script.onload = function(){
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
processEscapes: true,
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
}
});
};
script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
d.getElementsByTagName('head')[0].appendChild(script);
}(document));
</script>
</body>
</html>