blob: be3152285a3b5ac22963acad91340d44a77aed33 [file] [log] [blame]
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
<head>
<title>SystemML Algorithms Reference - Regression - SystemML 1.2.0</title>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/main.css">
<link rel="stylesheet" href="css/pygments-default.css">
<link rel="shortcut icon" href="img/favicon.png">
</head>
<body>
<!--[if lt IE 7]>
<p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p>
<![endif]-->
<header class="navbar navbar-default navbar-fixed-top" id="topbar">
<div class="container">
<div class="navbar-header">
<div class="navbar-brand brand projectlogo">
<a href="http://systemml.apache.org/"><img class="logo" src="img/systemml-logo.png" alt="Apache SystemML" title="Apache SystemML"/></a>
</div>
<div class="navbar-brand brand projecttitle">
<a href="http://systemml.apache.org/">Apache SystemML<sup id="trademark"></sup></a><br/>
<span class="version">1.2.0</span>
</div>
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<nav class="navbar-collapse collapse">
<ul class="nav navbar-nav navbar-right">
<li><a href="index.html">Overview</a></li>
<li><a href="https://github.com/apache/systemml">GitHub</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Documentation<b class="caret"></b></a>
<ul class="dropdown-menu" role="menu">
<li><b>Running SystemML:</b></li>
<li><a href="https://github.com/apache/systemml">SystemML GitHub README</a></li>
<li><a href="spark-mlcontext-programming-guide.html">Spark MLContext</a></li>
<li><a href="spark-batch-mode.html">Spark Batch Mode</a>
<li><a href="hadoop-batch-mode.html">Hadoop Batch Mode</a>
<li><a href="standalone-guide.html">Standalone Guide</a></li>
<li><a href="jmlc.html">Java Machine Learning Connector (JMLC)</a>
<li class="divider"></li>
<li><b>Language Guides:</b></li>
<li><a href="dml-language-reference.html">DML Language Reference</a></li>
<li><a href="beginners-guide-to-dml-and-pydml.html">Beginner's Guide to DML and PyDML</a></li>
<li><a href="beginners-guide-python.html">Beginner's Guide for Python Users</a></li>
<li><a href="python-reference.html">Reference Guide for Python Users</a></li>
<li class="divider"></li>
<li><b>ML Algorithms:</b></li>
<li><a href="algorithms-reference.html">Algorithms Reference</a></li>
<li class="divider"></li>
<li><b>Tools:</b></li>
<li><a href="debugger-guide.html">Debugger Guide</a></li>
<li><a href="developer-tools-systemml.html">IDE Guide</a></li>
<li class="divider"></li>
<li><b>Other:</b></li>
<li><a href="contributing-to-systemml.html">Contributing to SystemML</a></li>
<li><a href="engine-dev-guide.html">Engine Developer Guide</a></li>
<li><a href="troubleshooting-guide.html">Troubleshooting Guide</a></li>
<li><a href="release-process.html">Release Process</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu" role="menu">
<li><a href="./api/java/index.html">Java</a></li>
<li><a href="./api/python/index.html">Python</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Issues<b class="caret"></b></a>
<ul class="dropdown-menu" role="menu">
<li><b>JIRA:</b></li>
<li><a href="https://issues.apache.org/jira/browse/SYSTEMML">SystemML JIRA</a></li>
</ul>
</li>
</ul>
</nav>
</div>
</header>
<div class="container" id="content">
<h1 class="title"><a href="algorithms-reference.html">SystemML Algorithms Reference</a></h1>
<!--
-->
<h1 id="regression">4. Regression</h1>
<h2 id="linear-regression">4.1. Linear Regression</h2>
<h3 id="description">Description</h3>
<p>Linear Regression scripts are used to model the relationship between one
numerical response variable and one or more explanatory (feature)
variables. The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$
where $x_i$ is a numerical vector of feature variables and $y_i$ is a
numerical response value for each training data record. The feature
vectors are provided as a matrix $X$ of size $n\,{\times}\,m$, where $n$
is the number of records and $m$ is the number of features. The observed
response values are provided as a 1-column matrix $Y$, with a numerical
value $y_i$ for each $x_i$ in the corresponding row of matrix $X$.</p>
<p>In linear regression, we predict the distribution of the response $y_i$
based on a fixed linear combination of the features in $x_i$. We assume
that there exist constant regression coefficients
$\beta_0, \beta_1, \ldots, \beta_m$ and a constant residual
variance $\sigma^2$ such that</p>
<script type="math/tex; mode=display">\begin{equation}
y_i \sim Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\,
\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m}
\end{equation}</script>
<p>Distribution
$y_i \sim Normal(\mu_i, \sigma^2)$
models the &#8220;unexplained&#8221; residual noise and is assumed independent
across different records.</p>
<p>The goal is to estimate the regression coefficients and the residual
variance. Once they are accurately estimated, we can make predictions
about $y_i$ given $x_i$ in new records. We can also use the $\beta_j$’s
to analyze the influence of individual features on the response value,
and assess the quality of this model by comparing residual variance in
the response, left after prediction, with its total variance.</p>
<p>There are two scripts in our library, both doing the same estimation,
but using different computational methods. Depending on the size and the
sparsity of the feature matrix $X$, one or the other script may be more
efficient. The &#8220;direct solve&#8221; script <code>LinearRegDS.dml</code> is more
efficient when the number of features $m$ is relatively small
($m \sim 1000$ or less) and matrix $X$ is either tall or fairly dense
(has ${\gg}:m^2$ nonzeros); otherwise, the &#8220;conjugate gradient&#8221; script
<code>LinearRegCG.dml</code> is more efficient. If $m &gt; 50000$, use only
<code>LinearRegCG.dml</code>.</p>
<h3 id="usage">Usage</h3>
<p><strong>Linear Regression - Direct Solve</strong>:</p>
<div class="codetabs">
<div data-lang="Python">
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="c"># C = 1/reg (to disable regularization, use float("inf"))</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">sqlCtx</span><span class="p">,</span> <span class="n">fit_intercept</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">normalize</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">),</span> <span class="n">solver</span><span class="o">=</span><span class="s">'direct-solve'</span><span class="p">)</span>
<span class="c"># X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame or SciPy Sparse Matrix</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df_train</span><span class="p">)</span></code></pre></div>
</div>
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f LinearRegDS.dml
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
O=[file]
icpt=[int]
reg=[double]
fmt=[format]
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f LinearRegDS.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
O=[file]
icpt=[int]
reg=[double]
fmt=[format]
</code></pre>
</div>
</div>
<p><strong>Linear Regression - Conjugate Gradient</strong>:</p>
<div class="codetabs">
<div data-lang="Python">
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="c"># C = 1/reg (to disable regularization, use float("inf"))</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">sqlCtx</span><span class="p">,</span> <span class="n">fit_intercept</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">normalize</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">0.000001</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">),</span> <span class="n">solver</span><span class="o">=</span><span class="s">'newton-cg'</span><span class="p">)</span>
<span class="c"># X_train, y_train and X_test can be NumPy matrices or Pandas DataFrames or SciPy Sparse matrices</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df_train</span><span class="p">)</span></code></pre></div>
</div>
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f LinearRegCG.dml
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
O=[file]
Log=[file]
icpt=[int]
reg=[double]
tol=[double]
maxi=[int]
fmt=[format]
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f LinearRegCG.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
O=[file]
Log=[file]
icpt=[int]
reg=[double]
tol=[double]
maxi=[int]
fmt=[format]
</code></pre>
</div>
</div>
<h3 id="arguments-for-spark-and-hadoop-invocation">Arguments for Spark and Hadoop invocation</h3>
<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors, each row
constitutes one feature vector</p>
<p><strong>Y</strong>: Location to read the 1-column matrix of response values</p>
<p><strong>B</strong>: Location to store the estimated regression parameters (the $\beta_j$’s),
with the intercept parameter $\beta_0$ at position
B[$m\,{+}\,1$, 1] if available</p>
<p><strong>O</strong>: (default: <code>" "</code>) Location to store the CSV-file of summary statistics defined
in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>, the default is to print it to the
standard output</p>
<p><strong>Log</strong>: (default: <code>" "</code>, <code>LinearRegCG.dml</code> only) Location to store
iteration-specific variables for monitoring and debugging purposes, see
<a href="algorithms-regression.html#table8"><strong>Table 8</strong></a>
for details.</p>
<p><strong>icpt</strong>: (default: <code>0</code>) Intercept presence and shifting/rescaling the features
in $X$:</p>
<ul>
<li>0 = no intercept (hence no $\beta_0$), no shifting or
rescaling of the features</li>
<li>1 = add intercept, but do not shift/rescale the features
in $X$</li>
<li>2 = add intercept, shift/rescale the features in $X$ to
mean 0, variance 1</li>
</ul>
<p><strong>reg</strong>: (default: <code>0.000001</code>) L2-regularization parameter $\lambda\geq 0$; set to nonzero for highly
dependent, sparse, or numerous ($m \gtrsim n/10$) features</p>
<p><strong>tol</strong>: (default: <code>0.000001</code>, <code>LinearRegCG.dml</code> only) Tolerance $\varepsilon\geq 0$ used in the
convergence criterion: we terminate conjugate gradient iterations when
the $\beta$-residual reduces in L2-norm by this factor</p>
<p><strong>maxi</strong>: (default: <code>0</code>, <code>LinearRegCG.dml</code> only) Maximum number of conjugate
gradient iterations, or <code>0</code> if no maximum limit provided</p>
<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>,
<code>mm</code>, or <code>csv</code>; see read/write functions in
SystemML Language Reference for details.</p>
<p>Please see <a href="https://apache.github.io/systemml/python-reference#mllearn-api">mllearn documentation</a> for
more details on the Python API.</p>
<h3 id="examples">Examples</h3>
<p><strong>Linear Regression - Direct Solve</strong>:</p>
<div class="codetabs">
<div data-lang="Python">
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span>
<span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="c"># Load the diabetes dataset</span>
<span class="n">diabetes</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_diabetes</span><span class="p">()</span>
<span class="c"># Use only one feature</span>
<span class="n">diabetes_X</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">data</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
<span class="c"># Split the data into training/testing sets</span>
<span class="n">diabetes_X_train</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span>
<span class="n">diabetes_X_test</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span>
<span class="c"># Split the targets into training/testing sets</span>
<span class="n">diabetes_y_train</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span>
<span class="n">diabetes_y_test</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span>
<span class="c"># Create linear regression object</span>
<span class="n">regr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="n">solver</span><span class="o">=</span><span class="s">'direct-solve'</span><span class="p">)</span>
<span class="c"># Train the model using the training sets</span>
<span class="n">regr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">diabetes_X_train</span><span class="p">,</span> <span class="n">diabetes_y_train</span><span class="p">)</span>
<span class="c"># The mean square error</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Residual sum of squares: </span><span class="si">%.2</span><span class="s">f"</span> <span class="o">%</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">regr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">diabetes_X_test</span><span class="p">)</span> <span class="o">-</span> <span class="n">diabetes_y_test</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span></code></pre></div>
</div>
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f LinearRegDS.dml
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
fmt=csv
O=/user/ml/stats.csv
icpt=2
reg=1.0
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f LinearRegDS.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
fmt=csv
O=/user/ml/stats.csv
icpt=2
reg=1.0
</code></pre>
</div>
</div>
<p><strong>Linear Regression - Conjugate Gradient</strong>:</p>
<div class="codetabs">
<div data-lang="Python">
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span>
<span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="c"># Load the diabetes dataset</span>
<span class="n">diabetes</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_diabetes</span><span class="p">()</span>
<span class="c"># Use only one feature</span>
<span class="n">diabetes_X</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">data</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
<span class="c"># Split the data into training/testing sets</span>
<span class="n">diabetes_X_train</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span>
<span class="n">diabetes_X_test</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span>
<span class="c"># Split the targets into training/testing sets</span>
<span class="n">diabetes_y_train</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span>
<span class="n">diabetes_y_test</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span>
<span class="c"># Create linear regression object</span>
<span class="n">regr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="n">solver</span><span class="o">=</span><span class="s">'newton-cg'</span><span class="p">)</span>
<span class="c"># Train the model using the training sets</span>
<span class="n">regr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">diabetes_X_train</span><span class="p">,</span> <span class="n">diabetes_y_train</span><span class="p">)</span>
<span class="c"># The mean square error</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Residual sum of squares: </span><span class="si">%.2</span><span class="s">f"</span> <span class="o">%</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">regr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">diabetes_X_test</span><span class="p">)</span> <span class="o">-</span> <span class="n">diabetes_y_test</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span></code></pre></div>
</div>
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f LinearRegCG.dml
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
fmt=csv
O=/user/ml/stats.csv
icpt=2
reg=1.0
tol=0.00000001
maxi=100
Log=/user/ml/log.csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f LinearRegCG.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
fmt=csv
O=/user/ml/stats.csv
icpt=2
reg=1.0
tol=0.00000001
maxi=100
Log=/user/ml/log.csv
</code></pre>
</div>
</div>
<hr />
<p><a name="table7"></a>
<strong>Table 7</strong>: Besides $\beta$, linear regression scripts compute a few summary statistics
listed below. The statistics are provided in CSV format, one comma-separated name-value
pair per each line.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>AVG_TOT_Y</td>
<td>Average of the response value $Y$</td>
</tr>
<tr>
<td>STDEV_TOT_Y</td>
<td>Standard Deviation of the response value $Y$</td>
</tr>
<tr>
<td>AVG_RES_Y</td>
<td>Average of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$, i.e. residual bias</td>
</tr>
<tr>
<td>STDEV_RES_Y</td>
<td>Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$</td>
</tr>
<tr>
<td>DISPERSION</td>
<td>GLM-style dispersion, i.e. residual sum of squares / #deg. fr.</td>
</tr>
<tr>
<td>R2</td>
<td>$R^2$ of residual with bias included vs. total average</td>
</tr>
<tr>
<td>ADJUSTED_R2</td>
<td>Adjusted $R^2$ of residual with bias included vs. total average</td>
</tr>
<tr>
<td>R2_NOBIAS</td>
<td>Plain $R^2$ of residual with bias subtracted vs. total average</td>
</tr>
<tr>
<td>ADJUSTED_R2_NOBIAS</td>
<td>Adjusted $R^2$ of residual with bias subtracted vs. total average</td>
</tr>
<tr>
<td>R2_VS_0</td>
<td>* $R^2$ of residual with bias included vs. zero constant</td>
</tr>
<tr>
<td>ADJUSTED_R2_VS_0</td>
<td>* Adjusted $R^2$ of residual with bias included vs. zero constant</td>
</tr>
</tbody>
</table>
<p>* The last two statistics are only printed if there is no intercept (<code>icpt=0</code>)</p>
<hr />
<p><a name="table8"></a>
<strong>Table 8</strong>: The <code>Log</code> file for the <code>LinearRegCG.dml</code> script
contains the above iteration variables in CSV format, each line
containing a triple (Name, Iteration#, Value) with Iteration# being 0
for initial values.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>CG_RESIDUAL_NORM</td>
<td>L2-norm of conjug. grad. residual, which is $A$ %*% $\beta - t(X)$ %*% $y$ where $A = t(X)$ %*% $X + diag(\lambda)$, or a similar quantity</td>
</tr>
<tr>
<td>CG_RESIDUAL_RATIO</td>
<td>Ratio of current L2-norm of conjug. grad. residual over the initial</td>
</tr>
</tbody>
</table>
<hr />
<h3 id="details">Details</h3>
<p>To solve a linear regression problem over feature matrix $X$ and
response vector $Y$, we can find coefficients
$\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize the
joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed
statistical model (1). Since the joint likelihood of the
independent
$y_i \sim Normal(\mu_i, \sigma^2)$
is proportional to the product of
$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the
logarithm of this product, then multiply by $-2\sigma^2 &lt; 0$ to obtain a
least squares problem:</p>
<script type="math/tex; mode=display">\begin{equation}
\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\,
\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
\,\,\to\,\,\min
\end{equation}</script>
<p>This may not be enough, however. The minimum may
sometimes be attained over infinitely many $\beta$-vectors, for example
if $X$ has an all-0 column, or has linearly dependent columns, or has
fewer rows than columns . Even if (2) has a unique
solution, other $\beta$-vectors may be just a little suboptimal<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, yet
give significantly different predictions for new feature vectors. This
results in <em>overfitting</em>: prediction error for the training data ($X$
and $Y$) is much smaller than for the test data (new records).</p>
<p>Overfitting and degeneracy in the data is commonly mitigated by adding a
regularization penalty term to the least squares function:</p>
<script type="math/tex; mode=display">\begin{equation}
\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
\,+\,\, \lambda \sum_{j=1}^m \beta_j^2
\,\,\to\,\,\min
\end{equation}</script>
<p>The choice of $\lambda&gt;0$, the regularization
constant, typically involves cross-validation where the dataset is
repeatedly split into a training part (to estimate the $\beta_j$’s) and
a test part (to evaluate prediction accuracy), with the goal of
maximizing the test accuracy. In our scripts, $\lambda$ is provided as
input parameter <code>reg</code>.</p>
<p>The solution to the least squares problem (3), through
taking the derivative and setting it to 0, has the matrix linear
equation form</p>
<script type="math/tex; mode=display">\begin{equation}
A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\,
\textrm{where}\,\,\,
A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} diag(\hspace{0.5pt}
\underbrace{\lambda,\ldots, \lambda}_{\scriptstyle m}, 0)
\end{equation}</script>
<p>where $[X,\,1]$ is $X$ with an extra column of 1s
appended on the right, and the diagonal matrix of $\lambda$’s has a zero
to keep the intercept $\beta_0$ unregularized. If the intercept is
disabled by setting $icpt=0$, the equation is simply $X^T X \beta = X^T Y$.</p>
<p>We implemented two scripts for solving equation (4): one
is a &#8220;direct solver&#8221; that computes $A$ and then solves
$A\beta = [X,\,1]^T Y$ by calling an external package, the other
performs linear conjugate gradient (CG) iterations without ever
materializing $A$. The CG algorithm closely follows Algorithm 5.2 in
Chapter 5 of <a href="algorithms-bibliography.html">[Nocedal2006]</a>. Each step in the CG algorithm
computes a matrix-vector multiplication $q = Ap$ by first computing
$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$. Usually the number of
such multiplications, one per CG iteration, is much smaller than $m$.
The user can put a hard bound on it with input
parameter <code>maxi</code>, or use the default maximum of $m+1$ (or $m$
if no intercept) by having <code>maxi=0</code>. The CG iterations
terminate when the L2-norm of vector $r = A\beta - [X,\,1]^T Y$
decreases from its initial value (for $\beta=0$) by the tolerance factor
specified in input parameter <code>tol</code>.</p>
<p>The CG algorithm is more efficient if computing
$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$,
an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix. The Direct Solver (DS) is more
efficient if $X$ takes up a lot more memory than $A$ (i.e. $X$ has a lot
more nonzeros than $m^2$) and if $m^2$ is small enough for the external
solver ($m \lesssim 50000$). A more precise determination between CG
and DS is subject to further research.</p>
<p>In addition to the $\beta$-vector, the scripts estimate the residual
standard deviation $\sigma$ and the $R^2$, the ratio of &#8220;explained&#8221;
variance to the total variance of the response variable. These
statistics only make sense if the number of degrees of freedom
$n\,{-}\,m\,{-}\,1$ is positive and the regularization constant
$\lambda$ is negligible or zero. The formulas for $\sigma$ and
$R^2$ are:</p>
<script type="math/tex; mode=display">R^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad
\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad
R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}</script>
<p>where</p>
<script type="math/tex; mode=display">\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i -
\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad
\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2</script>
<p>Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the
estimated regression coefficients and the feature vectors. They may be
biased when no intercept is present, hence the RSS formula subtracts the
bias.</p>
<p>Lastly, note that by choosing the input option <code>icpt=2</code> the
user can shift and rescale the columns of $X$ to have zero average and
the variance of 1. This is particularly important when using
regularization over highly disbalanced features, because regularization
tends to penalize small-variance columns (which need large $\beta_j$’s)
more than large-variance columns (with small $\beta_j$’s). At the end,
the estimated regression coefficients are shifted and rescaled to apply
to the original features.</p>
<h3 id="returns">Returns</h3>
<p>The estimated regression coefficients (the $\hat{\beta}_j$’s) are
populated into a matrix and written to an HDFS file whose path/name was
provided as the <code>B</code> input argument. What this matrix
contains, and its size, depends on the input argument <code>icpt</code>,
which specifies the user’s intercept and rescaling choice:</p>
<ul>
<li><strong>icpt=0</strong>: No intercept, matrix $B$ has size $m\,{\times}\,1$, with
$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to $m = {}$ncol$(X)$.</li>
<li><strong>icpt=1</strong>: There is intercept, but no shifting/rescaling of $X$; matrix $B$ has
size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from
1 to $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept
coefficient.</li>
<li><strong>icpt=2</strong>: There is intercept, and the features in $X$ are shifted to mean$ = 0$
and rescaled to variance$ = 1$; then there are two versions of
the $\hat{\beta}_j$’s, one for the original features and another for the
shifted/rescaled features. Now matrix $B$ has size
$(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and
$B[\cdot, 2]$ for the shifted/rescaled features, in the above format.
Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are
obtained from $B[\cdot, 2]$ by complementary shifting and rescaling.</li>
</ul>
<p>The estimated summary statistics, including residual standard
deviation $\sigma$ and the $R^2$, are printed out or sent into a file
(if specified) in CSV format as defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>.
For conjugate gradient iterations, a log file with monitoring variables
can also be made available, see <a href="algorithms-regression.html#table8"><strong>Table 8</strong></a>.</p>
<hr />
<h2 id="stepwise-linear-regression">4.2. Stepwise Linear Regression</h2>
<h3 id="description-1">Description</h3>
<p>Our stepwise linear regression script selects a linear model based on
the Akaike information criterion (AIC): the model that gives rise to the
lowest AIC is computed.</p>
<h3 id="usage-1">Usage</h3>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f StepLinearRegDS.dml
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
S=[file]
O=[file]
icpt=[int]
thr=[double]
fmt=[format]
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f StepLinearRegDS.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
S=[file]
O=[file]
icpt=[int]
thr=[double]
fmt=[format]
</code></pre>
</div>
</div>
<h3 id="arguments-for-spark-and-hadoop-invocation-1">Arguments for Spark and Hadoop invocation</h3>
<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors, each row
contains one feature vector.</p>
<p><strong>Y</strong>: Location (on HDFS) to read the 1-column matrix of response values</p>
<p><strong>B</strong>: Location (on HDFS) to store the estimated regression parameters (the
$\beta_j$’s), with the intercept parameter $\beta_0$ at position
B[$m\,{+}\,1$, 1] if available</p>
<p><strong>S</strong>: (default: <code>" "</code>) Location (on HDFS) to store the selected feature-ids in the
order as computed by the algorithm; by default the selected feature-ids
are forwarded to the standard output.</p>
<p><strong>O</strong>: (default: <code>" "</code>) Location (on HDFS) to store the CSV-file of summary
statistics defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>; by default the
summary statistics are forwarded to the standard output.</p>
<p><strong>icpt</strong>: (default: <code>0</code>) Intercept presence and shifting/rescaling the features
in $X$:</p>
<ul>
<li>0 = no intercept (hence no $\beta_0$), no shifting or
rescaling of the features;</li>
<li>1 = add intercept, but do not shift/rescale the features
in $X$;</li>
<li>2 = add intercept, shift/rescale the features in $X$ to
mean 0, variance 1</li>
</ul>
<p><strong>thr</strong>: (default: <code>0.01</code>) Threshold to stop the algorithm: if the decrease in the value
of the AIC falls below <code>thr</code> no further features are being
checked and the algorithm stops.</p>
<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>,
<code>mm</code>, or <code>csv</code>; see read/write functions in
SystemML Language Reference for details.</p>
<h3 id="examples-1">Examples</h3>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f StepLinearRegDS.dml
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
S=/user/ml/selected.csv
O=/user/ml/stats.csv
icpt=2
thr=0.05
fmt=csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f StepLinearRegDS.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
S=/user/ml/selected.csv
O=/user/ml/stats.csv
icpt=2
thr=0.05
fmt=csv
</code></pre>
</div>
</div>
<h3 id="details-1">Details</h3>
<p>Stepwise linear regression iteratively selects predictive variables in
an automated procedure. Currently, our implementation supports forward
selection: starting from an empty model (without any variable) the
algorithm examines the addition of each variable based on the AIC as a
model comparison criterion. The AIC is defined as</p>
<script type="math/tex; mode=display">\begin{equation}
AIC = -2 \log{L} + 2 edf,\label{eq:AIC}
\end{equation}</script>
<p>where $L$ denotes the
likelihood of the fitted model and $edf$ is the equivalent degrees of
freedom, i.e., the number of estimated parameters. This procedure is
repeated until including no additional variable improves the model by a
certain threshold specified in the input parameter <code>thr</code>.</p>
<p>For fitting a model in each iteration we use the <code>direct solve</code> method
as in the script <code>LinearRegDS.dml</code> discussed in
<a href="algorithms-regression.html#linear-regression">Linear Regression</a>.</p>
<h3 id="returns-1">Returns</h3>
<p>Similar to the outputs from <code>LinearRegDS.dml</code> the stepwise
linear regression script computes the estimated regression coefficients
and stores them in matrix $B$ on HDFS. The format of matrix $B$ is
identical to the one produced by the scripts for linear regression (see
<a href="algorithms-regression.html#linear-regression">Linear Regression</a>). Additionally, <code>StepLinearRegDS.dml</code>
outputs the variable indices (stored in the 1-column matrix $S$) in the
order they have been selected by the algorithm, i.e., $i$th entry in
matrix $S$ corresponds to the variable which improves the AIC the most
in $i$th iteration. If the model with the lowest AIC includes no
variables matrix $S$ will be empty (contains one 0). Moreover, the
estimated summary statistics as defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>
are printed out or stored in a file (if requested). In the case where an
empty model achieves the best AIC these statistics will not be produced.</p>
<hr />
<h2 id="generalized-linear-models">4.3. Generalized Linear Models</h2>
<h3 id="description-2">Description</h3>
<p>Generalized Linear Models 
[<a href="algorithms-bibliography.html">Gill2000</a>,
<a href="algorithms-bibliography.html">McCullagh1989</a>,
<a href="algorithms-bibliography.html">Nelder1972</a>]
extend the methodology of linear
and logistic regression to a variety of distributions commonly assumed
as noise effects in the response variable. As before, we are given a
collection of records $(x_1, y_1)$, …, $(x_n, y_n)$ where $x_i$ is a
numerical vector of explanatory (feature) variables of size $\dim x_i = m$, and $y_i$
is the response (dependent) variable observed for this vector. GLMs
assume that some linear combination of the features in $x_i$ determines
the <em>mean</em> $\mu_i$ of $y_i$, while the observed $y_i$ is a random
outcome of a noise distribution
$Prob[y\mid \mu_i]\,$<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>
with that mean $\mu_i$:</p>
<script type="math/tex; mode=display">x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}
\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim Prob[y\mid \mu_i]</script>
<p>In linear regression the response mean $\mu_i$ <em>equals</em> some linear
combination over $x_i$, denoted above by $\eta_i$. In logistic
regression with <script type="math/tex">y\in\{0, 1\}</script> (Bernoulli) the mean of $y$ is the same
as $Prob[y=1]$
and equals $1/(1+e^{-\eta_i})$, the logistic function of $\eta_i$. In
GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone
function called the <em>link function</em>: $\eta_i = g(\mu_i)$. The unknown
linear combination parameters $\beta_j$ are assumed to be the same for
all records.</p>
<p>The goal of the regression is to estimate the parameters $\beta_j$ from
the observed data. Once the $\beta_j$’s are accurately estimated, we can
make predictions about $y$ for a new feature vector $x$. To do so,
compute $\eta$ from $x$ and use the inverted link function
$\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of $y$; then use the
distribution $Prob[y\mid \mu]$
to make predictions about $y$. Both $g(\mu)$ and
$Prob[y\mid \mu]$
are user-provided. Our GLM script supports a standard set of
distributions and link functions, see below for details.</p>
<h3 id="usage-2">Usage</h3>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM.dml
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
fmt=[format]
O=[file]
Log=[file]
dfam=[int]
vpow=[double]
link=[int]
lpow=[double]
yneg=[double]
icpt=[int]
reg=[double]
tol=[double]
disp=[double]
moi=[int]
mii=[int]
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
fmt=[format]
O=[file]
Log=[file]
dfam=[int]
vpow=[double]
link=[int]
lpow=[double]
yneg=[double]
icpt=[int]
reg=[double]
tol=[double]
disp=[double]
moi=[int]
mii=[int]
</code></pre>
</div>
</div>
<h3 id="arguments-for-spark-and-hadoop-invocation-2">Arguments for Spark and Hadoop invocation</h3>
<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors; each row
constitutes an example.</p>
<p><strong>Y</strong>: Location to read the response matrix, which may have 1 or 2 columns</p>
<p><strong>B</strong>: Location to store the estimated regression parameters (the $\beta_j$’s),
with the intercept parameter $\beta_0$ at position
B[$m\,{+}\,1$, 1] if available</p>
<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>,
<code>mm</code>, or <code>csv</code>; see read/write functions in
SystemML Language Reference for details.</p>
<p><strong>O</strong>: (default: <code>" "</code>) Location to write certain summary statistics described
in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a>,
by default it is standard output.</p>
<p><strong>Log</strong>: (default: <code>" "</code>) Location to store iteration-specific variables for monitoring
and debugging purposes, see <a href="algorithms-regression.html#table10"><strong>Table 10</strong></a> for details.</p>
<p><strong>dfam</strong>: (default: <code>1</code>) Distribution family code to specify
$Prob[y\mid \mu]$,
see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>:</p>
<ul>
<li>1 = power distributions with $Var(y) = \mu^{\alpha}$</li>
<li>2 = binomial or Bernoulli</li>
</ul>
<p><strong>vpow</strong>: (default: <code>0.0</code>) When <code>dfam=1</code>, this provides the $q$ in
$Var(y) = a\mu^q$, the power
dependence of the variance of $y$ on its mean. In particular, use:</p>
<ul>
<li>0.0 = Gaussian</li>
<li>1.0 = Poisson</li>
<li>2.0 = Gamma</li>
<li>3.0 = inverse Gaussian</li>
</ul>
<p><strong>link</strong>: (default: <code>0</code>) Link function code to determine the link
function $\eta = g(\mu)$:</p>
<ul>
<li>0 = canonical link (depends on the distribution family),
see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a></li>
<li>1 = power functions</li>
<li>2 = logit</li>
<li>3 = probit</li>
<li>4 = cloglog</li>
<li>5 = cauchit</li>
</ul>
<p><strong>lpow</strong>: (default: <code>1.0</code>) When link=1, this provides the $s$ in
$\eta = \mu^s$, the power link function; <code>lpow=0.0</code> gives the
log link $\eta = \log\mu$. Common power links:</p>
<ul>
<li>-2.0 = $1/\mu^2$</li>
<li>-1.0 = reciprocal</li>
<li>0.0 = log</li>
<li>0.5 = sqrt</li>
<li>1.0 = identity</li>
</ul>
<p><strong>yneg</strong>: (default: <code>0.0</code>) When <code>dfam=2</code> and the response matrix $Y$ has
1 column, this specifies the $y$-value used for Bernoulli &#8220;No&#8221; label
(&#8220;Yes&#8221; is $1$):
0.0 when $y\in\{0, 1\}$; -1.0 when
$y\in\{-1, 1\}$</p>
<p><strong>icpt</strong>: (default: <code>0</code>) Intercept and shifting/rescaling of the features in $X$:</p>
<ul>
<li>0 = no intercept (hence no $\beta_0$), no
shifting/rescaling of the features</li>
<li>1 = add intercept, but do not shift/rescale the features
in $X$</li>
<li>2 = add intercept, shift/rescale the features in $X$ to
mean 0, variance 1</li>
</ul>
<p><strong>reg</strong>: (default: <code>0.0</code>) L2-regularization parameter ($\lambda$)</p>
<p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance ($\varepsilon$) used in the convergence criterion: we
terminate the outer iterations when the deviance changes by less than
this factor; see below for details</p>
<p><strong>disp</strong>: (default: <code>0.0</code>) Dispersion parameter, or 0.0 to estimate it from
data</p>
<p><strong>moi</strong>: (default: <code>200</code>) Maximum number of outer (Fisher scoring) iterations</p>
<p><strong>mii</strong>: (default: <code>0</code>) Maximum number of inner (conjugate gradient) iterations, or 0
if no maximum limit provided</p>
<h3 id="examples-2">Examples</h3>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM.dml
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
fmt=csv
dfam=2
link=2
yneg=-1.0
icpt=2
reg=0.01
tol=0.00000001
disp=1.0
moi=100
mii=10
O=/user/ml/stats.csv
Log=/user/ml/log.csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
fmt=csv
dfam=2
link=2
yneg=-1.0
icpt=2
reg=0.01
tol=0.00000001
disp=1.0
moi=100
mii=10
O=/user/ml/stats.csv
Log=/user/ml/log.csv
</code></pre>
</div>
</div>
<hr />
<p><a name="table9"></a>
<strong>Table 9</strong>: Besides $\beta$, GLM regression script computes a few summary
statistics listed below. They are provided in CSV format, one
comma-separated name-value pair per each line.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>TERMINATION_CODE</td>
<td>A positive integer indicating success/failure as follows: <br />1 = Converged successfully; 2 = Maximum # of iterations reached; 3 = Input (X, Y) out of range; 4 = Distribution/link not supported</td>
</tr>
<tr>
<td>BETA_MIN</td>
<td>Smallest beta value (regression coefficient), excluding the intercept</td>
</tr>
<tr>
<td>BETA_MIN_INDEX</td>
<td>Column index for the smallest beta value</td>
</tr>
<tr>
<td>BETA_MAX</td>
<td>Largest beta value (regression coefficient), excluding the intercept</td>
</tr>
<tr>
<td>BETA_MAX_INDEX</td>
<td>Column index for the largest beta value</td>
</tr>
<tr>
<td>INTERCEPT</td>
<td>Intercept value, or NaN if there is no intercept (if <code>icpt=0</code>)</td>
</tr>
<tr>
<td>DISPERSION</td>
<td>Dispersion used to scale deviance, provided in disp input argument or estimated (same as DISPERSION_EST) if disp argument is $\leq 0$</td>
</tr>
<tr>
<td>DISPERSION_EST</td>
<td>Dispersion estimated from the dataset</td>
</tr>
<tr>
<td>DEVIANCE_UNSCALED</td>
<td>Deviance from the saturated model, assuming dispersion $= 1.0$</td>
</tr>
<tr>
<td>DEVIANCE_SCALED</td>
<td>Deviance from the saturated model, scaled by DISPERSION value</td>
</tr>
</tbody>
</table>
<hr />
<p><a name="table10"></a>
<strong>Table 10</strong>: The Log file for GLM regression contains the below
iteration variables in CSV format, each line containing a triple (Name,
Iteration#, Value) with Iteration# being 0 for initial values.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>NUM_CG_ITERS</td>
<td>Number of inner (Conj. Gradient) iterations in this outer iteration</td>
</tr>
<tr>
<td>IS_TRUST_REACHED</td>
<td>1 = trust region boundary was reached, 0 = otherwise</td>
</tr>
<tr>
<td>POINT_STEP_NORM</td>
<td>L2-norm of iteration step from old point ($\beta$-vector) to new point</td>
</tr>
<tr>
<td>OBJECTIVE</td>
<td>The loss function we minimize (negative partial log-likelihood)</td>
</tr>
<tr>
<td>OBJ_DROP_REAL</td>
<td>Reduction in the objective during this iteration, actual value</td>
</tr>
<tr>
<td>OBJ_DROP_PRED</td>
<td>Reduction in the objective predicted by a quadratic approximation</td>
</tr>
<tr>
<td>OBJ_DROP_RATIO</td>
<td>Actual-to-predicted reduction ratio, used to update the trust region</td>
</tr>
<tr>
<td>GRADIENT_NORM</td>
<td>L2-norm of the loss function gradient (omitted if point is rejected)</td>
</tr>
<tr>
<td>LINEAR_TERM_MIN</td>
<td>The minimum value of $X$ %*% $\beta$, used to check for overflows</td>
</tr>
<tr>
<td>LINEAR_TERM_MAX</td>
<td>The maximum value of $X$ %*% $\beta$, used to check for overflows</td>
</tr>
<tr>
<td>IS_POINT_UPDATED</td>
<td>1 = new point accepted; 0 = new point rejected, old point restored</td>
</tr>
<tr>
<td>TRUST_DELTA</td>
<td>Updated trust region size, the &#8220;delta&#8221;</td>
</tr>
</tbody>
</table>
<hr />
<p><a name="table11"></a>
<strong>Table 11</strong>: Common GLM distribution families and link functions. (Here &#8220;*&#8221; stands
for &#8220;any value.&#8221;)</p>
<table>
<thead>
<tr>
<th style="text-align: center">dfam</th>
<th style="text-align: center">vpow</th>
<th style="text-align: center">link</th>
<th style="text-align: center">lpow</th>
<th style="text-align: center">Distribution<br />Family</th>
<th style="text-align: center">Link<br /> Function</th>
<th style="text-align: center">Canonical</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">0.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">-1.0</td>
<td style="text-align: center">Gaussian</td>
<td style="text-align: center">inverse</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">0.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 0.0</td>
<td style="text-align: center">Gaussian</td>
<td style="text-align: center">log</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">0.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 1.0</td>
<td style="text-align: center">Gaussian</td>
<td style="text-align: center">identity</td>
<td style="text-align: center">Yes</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">1.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 0.0</td>
<td style="text-align: center">Poisson</td>
<td style="text-align: center">log</td>
<td style="text-align: center">Yes</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">1.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 0.5</td>
<td style="text-align: center">Poisson</td>
<td style="text-align: center">sq.root</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">1.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 1.0</td>
<td style="text-align: center">Poisson</td>
<td style="text-align: center">identity</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">2.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">-1.0</td>
<td style="text-align: center">Gamma</td>
<td style="text-align: center">inverse</td>
<td style="text-align: center">Yes</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">2.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 0.0</td>
<td style="text-align: center">Gamma</td>
<td style="text-align: center">log</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">2.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 1.0</td>
<td style="text-align: center">Gamma</td>
<td style="text-align: center">identity</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">3.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">-2.0</td>
<td style="text-align: center">Inverse Gauss</td>
<td style="text-align: center">$1/\mu^2$</td>
<td style="text-align: center">Yes</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">3.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">-1.0</td>
<td style="text-align: center">Inverse Gauss</td>
<td style="text-align: center">inverse</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">3.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 0.0</td>
<td style="text-align: center">Inverse Gauss</td>
<td style="text-align: center">log</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">3.0</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 1.0</td>
<td style="text-align: center">Inverse Gauss</td>
<td style="text-align: center">identity</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">*</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 0.0</td>
<td style="text-align: center">Binomial</td>
<td style="text-align: center">log</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">*</td>
<td style="text-align: center">1</td>
<td style="text-align: center"> 0.5</td>
<td style="text-align: center">Binomial</td>
<td style="text-align: center">sq.root</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">*</td>
<td style="text-align: center">2</td>
<td style="text-align: center">  *</td>
<td style="text-align: center">Binomial</td>
<td style="text-align: center">logit</td>
<td style="text-align: center">Yes</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">*</td>
<td style="text-align: center">3</td>
<td style="text-align: center">  *</td>
<td style="text-align: center">Binomial</td>
<td style="text-align: center">probit</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">*</td>
<td style="text-align: center">4</td>
<td style="text-align: center">  *</td>
<td style="text-align: center">Binomial</td>
<td style="text-align: center">cloglog</td>
<td style="text-align: center">&#160;</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">*</td>
<td style="text-align: center">5</td>
<td style="text-align: center">  *</td>
<td style="text-align: center">Binomial</td>
<td style="text-align: center">cauchit</td>
<td style="text-align: center">&#160;</td>
</tr>
</tbody>
</table>
<hr />
<p><a name="table12"></a>
<strong>Table 12</strong>: The supported non-power link functions for the Bernoulli and the
binomial distributions. Here $\mu$ is the Bernoulli mean.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Link Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>Logit</td>
<td>$\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$</td>
</tr>
<tr>
<td>Probit</td>
<td><script type="math/tex">\displaystyle \mu = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut} e^{-\frac{t^2}{2}} dt</script></td>
</tr>
<tr>
<td>Cloglog</td>
<td>$\displaystyle \eta = \log \big(- \log(1 - \mu)\big)^{\mathstrut}$</td>
</tr>
<tr>
<td>Cauchit</td>
<td>$\displaystyle \eta = \tan\pi(\mu - 1/2)$</td>
</tr>
</tbody>
</table>
<hr />
<h3 id="details-2">Details</h3>
<p>In GLM, the noise distribution
$Prob[y\mid \mu]$
of the response variable $y$ given its mean $\mu$ is restricted to have
the <em>exponential family</em> form</p>
<script type="math/tex; mode=display">\begin{equation}
Y \sim\, Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a}
+ c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = E(Y) = b'(\theta).
\end{equation}</script>
<p>Changing the mean in such a distribution simply
multiplies all
$Prob[y\mid \mu]$
by $e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so
that they again integrate to 1. Parameter $\theta$ is called
<em>canonical</em>, and the function $\theta = b&#8217;^{\,-1}(\mu)$ that relates it
to the mean is called the <em>canonical link</em>; constant $a$ is called
<em>dispersion</em> and rescales the variance of $y$. Many common distributions
can be put into this form, see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>. The canonical
parameter $\theta$ is often chosen to coincide with $\eta$, the linear
combination of the regression features; other choices for $\eta$ are
possible too.</p>
<p>Rather than specifying the canonical link, GLM distributions are
commonly defined by their variance
$Var(y)$ as the function of
the mean $\mu$. It can be shown from Eq. 5 that
$Var(y) = a\,b&#8217;&#8217;(\theta) = a\,b&#8217;&#8216;(b&#8217;^{\,-1}(\mu))$.
For example, for the Bernoulli distribution
$Var(y) = \mu(1-\mu)$, for the
Poisson distribution
$Var(y) = \mu$, and for the Gaussian distribution
$Var(y) = a\cdot 1 = \sigma^2$.
It turns out that for many common distributions
$Var(y) = a\mu^q$, a power
function. We support all distributions where
$Var(y) = a\mu^q$, as well as
the Bernoulli and the binomial distributions.</p>
<p>For distributions with
$Var(y) = a\mu^q$ the
canonical link is also a power function, namely
$\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose
canonical link is $\theta = \log\mu$. We support all power link
functions in the form $\eta = \mu^s$, dropping any constant factor, with
$\eta = \log\mu$ for $s=0$. The binomial distribution has its own family
of link functions, which includes logit (the canonical link), probit,
cloglog, and cauchit (see <a href="algorithms-regression.html#table12"><strong>Table 12</strong></a>); we support
these only for the binomial and Bernoulli distributions. Links and
distributions are specified via four input parameters:
<code>dfam</code>, <code>vpow</code>, <code>link</code>, and
<code>lpow</code> (see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>).</p>
<p>The observed response values are provided to the regression script as a
matrix $Y$ having 1 or 2 columns. If a power distribution family is
selected (<code>dfam=1</code>), matrix $Y$ must have 1 column that
provides $y_i$ for each $x_i$ in the corresponding row of matrix $X$.
When dfam=2 and $Y$ has 1 column, we assume the Bernoulli
distribution for <script type="math/tex">y_i\in\{y_{\mathrm{neg}}, 1\}</script> with $y_{\mathrm{neg}}$
from the input parameter <code>yneg</code>. When <code>dfam=2</code> and
$Y$ has 2 columns, we assume the binomial distribution; for each row $i$
in $X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide the positive and the
negative binomial counts respectively. Internally we convert the
1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.</p>
<p>We estimate the regression parameters via L2-regularized negative
log-likelihood minimization:</p>
<script type="math/tex; mode=display">f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big)
\,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min</script>
<p>where
$\theta_i$ and $b(\theta_i)$ are from (6); note that $a$ and
$c(y, a)$ are constant w.r.t. $\beta$ and can be ignored here. The
canonical parameter $\theta_i$ depends on both $\beta$ and $x_i$:</p>
<script type="math/tex; mode=display">\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\,
\big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)</script>
<p>The user-provided (via <code>reg</code>) regularization coefficient
$\lambda\geq 0$ can be used to mitigate overfitting and degeneracy in
the data. Note that the intercept is never regularized.</p>
<p>Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring
approximation to the difference
$\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$,
recomputed at each iteration:</p>
<script type="math/tex; mode=display">\begin{gathered}
\varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z,
\,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!diag(w) X \,+\, \lambda I\\
\textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta,
\,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\
\forall\,i = 1\ldots n: \,\,\,\,
w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1}
\!\!\!\!\!\!,\,\,\,\,\,\,\,\,\,
u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1}
\!\!\!\!\!\!.\,\,\,\,\end{gathered}</script>
<p>Here
$v(\mu_i)=Var(y_i)/a$, the
variance of $y_i$ as the function of the mean, and
$g&#8217;(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative. The
Fisher scoring approximation is minimized by trust-region conjugate
gradient iterations (called the <em>inner</em> iterations, with the Fisher
scoring iterations as the <em>outer</em> iterations), which approximately solve
the following problem:</p>
<script type="math/tex; mode=display">1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\,
\|z\|_2 \leq \delta</script>
<p>The conjugate gradient algorithm closely follows
Algorithm 7.2 on page 171 of <a href="algorithms-bibliography.html">[Nocedal2006]</a>. The trust region
size $\delta$ is initialized as
$0.5\sqrt{m}\,/ \max\nolimits_i |x_i|_2$ and updated as described
in <a href="algorithms-bibliography.html">[Nocedal2006]</a>.
The user can specify the maximum number of
the outer and the inner iterations with input parameters
<code>moi</code> and <code>mii</code>, respectively. The Fisher scoring
algorithm terminates successfully if
$2|\varDelta f(z; \beta)| &lt; (D_1(\beta) + 0.1)\hspace{0.5pt}{\varepsilon}$
where ${\varepsilon}&gt; 0$ is a tolerance supplied by the user via
<code>tol</code>, and $D_1(\beta)$ is the unit-dispersion deviance
estimated as</p>
<script type="math/tex; mode=display">D_1(\beta) \,\,=\,\, 2 \cdot \big(Prob[Y \mid \!
\begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1]
\,\,-\,\,Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)</script>
<p>The deviance estimate is also produced as part of the output. Once the
Fisher scoring algorithm terminates, if requested by the user, we
estimate the dispersion $a$ from (6) using Pearson residuals</p>
<script type="math/tex; mode=display">\begin{equation}
\hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)}
\end{equation}</script>
<p>and use it to adjust our deviance estimate:
$D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$. If input argument
disp is 0.0 we estimate $\hat{a}$, otherwise
we use its value as $a$. Note that in (7) $m$ counts
the intercept ($m \leftarrow m+1$) if it is present.</p>
<h3 id="returns-2">Returns</h3>
<p>The estimated regression parameters (the $\hat{\beta}_j$’s) are
populated into a matrix and written to an HDFS file whose path/name was
provided as the <code>B</code> input argument. What this matrix
contains, and its size, depends on the input argument <code>icpt</code>,
which specifies the user’s intercept and rescaling choice:</p>
<ul>
<li><strong>icpt=0</strong>: No intercept, matrix $B$ has size $m\,{\times}\,1$, with
$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to $m = {}$ncol$(X)$.</li>
<li><strong>icpt=1</strong>: There is intercept, but no shifting/rescaling of $X$; matrix $B$ has
size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from
1 to $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept
coefficient.</li>
<li><strong>icpt=2</strong>: There is intercept, and the features in $X$ are shifted to mean${} = 0$
and rescaled to variance${} = 1$; then there are two versions of
the $\hat{\beta}_j$’s, one for the original features and another for the
shifted/rescaled features. Now matrix $B$ has size
$(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and
$B[\cdot, 2]$ for the shifted/rescaled features, in the above format.
Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are
obtained from $B[\cdot, 2]$ by complementary shifting and rescaling.</li>
</ul>
<p>Our script also estimates the dispersion $\hat{a}$ (or takes it from the
user’s input) and the deviances $D_1(\hat{\beta})$ and
$D_{\hat{a}}(\hat{\beta})$, see <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a> for details. A
log file with variables monitoring progress through the iterations can
also be made available, see <a href="algorithms-regression.html#table10"><strong>Table 10</strong></a>.</p>
<h3 id="see-also">See Also</h3>
<p>In case of binary classification problems, consider using L2-SVM or
binary logistic regression; for multiclass classification, use
multiclass SVM or multinomial logistic regression. For the special cases
of linear regression and logistic regression, it may be more efficient
to use the corresponding specialized scripts instead of GLM.</p>
<hr />
<h2 id="stepwise-generalized-linear-regression">4.4. Stepwise Generalized Linear Regression</h2>
<h3 id="description-3">Description</h3>
<p>Our stepwise generalized linear regression script selects a model based
on the Akaike information criterion (AIC): the model that gives rise to
the lowest AIC is provided. Note that currently only the Bernoulli
distribution family is supported (see below for details).</p>
<h3 id="usage-3">Usage</h3>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f StepGLM.dml
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
S=[file]
O=[file]
link=[int]
yneg=[double]
icpt=[int]
tol=[double]
disp=[double]
moi=[int]
mii=[int]
thr=[double]
fmt=[format]
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f StepGLM.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=&lt;file&gt;
Y=&lt;file&gt;
B=&lt;file&gt;
S=[file]
O=[file]
link=[int]
yneg=[double]
icpt=[int]
tol=[double]
disp=[double]
moi=[int]
mii=[int]
thr=[double]
fmt=[format]
</code></pre>
</div>
</div>
<h3 id="arguments-for-spark-and-hadoop-invocation-3">Arguments for Spark and Hadoop invocation</h3>
<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors; each row is an
example.</p>
<p><strong>Y</strong>: Location (on HDFS) to read the response matrix, which may have 1 or 2
columns</p>
<p><strong>B</strong>: Location (on HDFS) to store the estimated regression parameters (the
$\beta_j$’s), with the intercept parameter $\beta_0$ at position
B[$m\,{+}\,1$, 1] if available</p>
<p><strong>S</strong>: (default: <code>" "</code>) Location (on HDFS) to store the selected feature-ids in the
order as computed by the algorithm, by default it is standard output.</p>
<p><strong>O</strong>: (default: <code>" "</code>) Location (on HDFS) to write certain summary statistics
described in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a>, by default it is standard
output.</p>
<p><strong>link</strong>: (default: <code>2</code>) Link function code to determine the link
function $\eta = g(\mu)$, see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>; currently the
following link functions are supported:</p>
<ul>
<li>1 = log</li>
<li>2 = logit</li>
<li>3 = probit</li>
<li>4 = cloglog</li>
</ul>
<p><strong>yneg</strong>: (default: <code>0.0</code>) Response value for Bernoulli &#8220;No&#8221; label, usually 0.0 or -1.0</p>
<p><strong>icpt</strong>: (default: <code>0</code>) Intercept and shifting/rescaling of the features in $X$:</p>
<ul>
<li>0 = no intercept (hence no $\beta_0$), no
shifting/rescaling of the features</li>
<li>1 = add intercept, but do not shift/rescale the features
in $X$</li>
<li>2 = add intercept, shift/rescale the features in $X$ to
mean 0, variance 1</li>
</ul>
<p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance ($\epsilon$) used in the convergence criterion: we
terminate the outer iterations when the deviance changes by less than
this factor; see below for details.</p>
<p><strong>disp</strong>: (default: <code>0.0</code>) Dispersion parameter, or <code>0.0</code> to estimate it from
data</p>
<p><strong>moi</strong>: (default: <code>200</code>) Maximum number of outer (Fisher scoring) iterations</p>
<p><strong>mii</strong>: (default: <code>0</code>) Maximum number of inner (conjugate gradient) iterations, or 0
if no maximum limit provided</p>
<p><strong>thr</strong>: (default: <code>0.01</code>) Threshold to stop the algorithm: if the decrease in the value
of the AIC falls below <code>thr</code> no further features are being
checked and the algorithm stops.</p>
<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>,
<code>mm</code>, or <code>csv</code>; see read/write functions in
SystemML Language Reference for details.</p>
<h3 id="examples-3">Examples</h3>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f StepGLM.dml
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
S=/user/ml/selected.csv
O=/user/ml/stats.csv
link=2
yneg=-1.0
icpt=2
tol=0.000001
moi=100
mii=10
thr=0.05
fmt=csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f StepGLM.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=/user/ml/X.mtx
Y=/user/ml/Y.mtx
B=/user/ml/B.mtx
S=/user/ml/selected.csv
O=/user/ml/stats.csv
link=2
yneg=-1.0
icpt=2
tol=0.000001
moi=100
mii=10
thr=0.05
fmt=csv
</code></pre>
</div>
</div>
<h3 id="details-3">Details</h3>
<p>Similar to <code>StepLinearRegDS.dml</code> our stepwise GLM script
builds a model by iteratively selecting predictive variables using a
forward selection strategy based on the AIC (5). Note that
currently only the Bernoulli distribution family (<code>fam=2</code> in
<a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>) together with the following link functions
are supported: <code>log</code>, <code>logit</code>, <code>probit</code>, and <code>cloglog</code> (link
<script type="math/tex">\in\{1,2,3,4\}</script> in <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>).</p>
<h3 id="returns-3">Returns</h3>
<p>Similar to the outputs from <code>GLM.dml</code> the stepwise GLM script
computes the estimated regression coefficients and stores them in matrix
$B$ on HDFS; matrix $B$ follows the same format as the one produced by
<code>GLM.dml</code> (see <a href="algorithms-regression.html#generalized-linear-models">Generalized Linear Models</a>). Additionally,
<code>StepGLM.dml</code> outputs the variable indices (stored in the
1-column matrix $S$) in the order they have been selected by the
algorithm, i.e., $i$th entry in matrix $S$ stores the variable which
improves the AIC the most in $i$th iteration. If the model with the
lowest AIC includes no variables matrix $S$ will be empty. Moreover, the
estimated summary statistics as defined in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a> are
printed out or stored in a file on HDFS (if requested); these statistics
will be provided only if the selected model is nonempty, i.e., contains
at least one variable.</p>
<hr />
<h2 id="regression-scoring-and-prediction">4.5. Regression Scoring and Prediction</h2>
<h3 id="description-4">Description</h3>
<p>Script <code>GLM-predict.dml</code> is intended to cover all linear
model based regressions, including linear regression, binomial and
multinomial logistic regression, and GLM regressions (Poisson, gamma,
binomial with probit link etc.). Having just one scoring script for all
these regressions simplifies maintenance and enhancement while ensuring
compatible interpretations for output statistics.</p>
<p>The script performs two functions, prediction and scoring. To perform
prediction, the script takes two matrix inputs: a collection of records
$X$ (without the response attribute) and the estimated regression
parameters $B$, also known as $\beta$. To perform scoring, in addition
to $X$ and $B$, the script takes the matrix of actual response
values $Y$ that are compared to the predictions made with $X$ and $B$.
Of course there are other, non-matrix, input arguments that specify the
model and the output format, see below for the full list.</p>
<p>We assume that our test/scoring dataset is given by
$n\,{\times}\,m$-matrix $X$ of numerical feature vectors, where each
row $x_i$ represents one feature vector of one record; we have $\dim x_i = m$. Each
record also includes the response variable $y_i$ that may be numerical,
single-label categorical, or multi-label categorical. A single-label
categorical $y_i$ is an integer category label, one label per record; a
multi-label $y_i$ is a vector of integer counts, one count for each
possible label, which represents multiple single-label events
(observations) for the same $x_i$. Internally we convert single-label
categoricals into multi-label categoricals by replacing each label $l$
with an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$. In
prediction-only tasks the actual $y_i$’s are not needed to the script,
but they are needed for scoring.</p>
<p>To perform prediction, the script matrix-multiplies $X$ and $B$, adding
the intercept if available, then applies the inverse of the model’s link
function. All GLMs assume that the linear combination of the features
in $x_i$ and the betas in $B$ determines the means $\mu_i$ of
the $y_i$’s (in numerical or multi-label categorical form) with
$\dim \mu_i = \dim y_i$. The observed $y_i$ is assumed to follow a
specified GLM family distribution
$Prob[y\mid \mu_i]$
with mean(s) $\mu_i$:</p>
<script type="math/tex; mode=display">x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}
\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim Prob[y\mid \mu_i]</script>
<p>If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number. Then
our script’s output matrix $M$ is the $n\,{\times}\,1$-vector of these
means $\mu_i$. Note that $\mu_i$ predicts the mean of $y_i$, not the
actual $y_i$. For example, in Poisson distribution, the mean is usually
fractional, but the actual $y_i$ is always integer.</p>
<p>If $y_i$ is categorical, i.e. a vector of label counts for record $i$,
then $\mu_i$ is a vector of non-negative real numbers, one number
<script type="math/tex">\mu_{i,l}</script> per each label $l$. In this case we divide the <script type="math/tex">\mu_{i,l}</script>
by their sum $\sum_l \mu_{i,l}$ to obtain predicted label
probabilities <script type="math/tex">p_{i,l}\in [0, 1]</script>. The output matrix $M$ is the
$n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the
number of records and $k\,{+}\,1$ is the number of categories<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. Note
again that we do not predict the labels themselves, nor their actual
counts per record, but we predict the labels’ probabilities.</p>
<p>Going from predicted probabilities to predicted labels, in the
single-label categorical case, requires extra information such as the
cost of false positive versus false negative errors. For example, if
there are 5 categories and we <em>accurately</em> predicted their probabilities
as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the highest-probability
label would be wrong 70% of the time, whereas picking the
lowest-probability label might be right if, say, it represents a
diagnosis of cancer or another rare and serious outcome. Hence, we keep
this step outside the scope of <code>GLM-predict.dml</code> for now.</p>
<h3 id="usage-4">Usage</h3>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM-predict.dml
-nvargs X=&lt;file&gt;
Y=[file]
B=&lt;file&gt;
M=[file]
O=[file]
dfam=[int]
vpow=[double]
link=[int]
lpow=[double]
disp=[double]
fmt=[format]
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM-predict.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs X=&lt;file&gt;
Y=[file]
B=&lt;file&gt;
M=[file]
O=[file]
dfam=[int]
vpow=[double]
link=[int]
lpow=[double]
disp=[double]
fmt=[format]
</code></pre>
</div>
</div>
<h3 id="arguments-for-spark-and-hadoop-invocation-4">Arguments for Spark and Hadoop invocation</h3>
<p><strong>X</strong>: Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature
vectors, each row constitutes one feature vector (one record)</p>
<p><strong>Y</strong>: (default: <code>" "</code>) Location to read the response matrix $Y$ needed for scoring
(but optional for prediction), with the following dimensions:</p>
<ul>
<li>$n {\times} 1$: acceptable for all distributions
(<code>dfam=1</code> or <code>2</code> or <code>3</code>)</li>
<li>$n {\times} 2$: for binomial (<code>dfam=2</code>) if given by
(#pos, #neg) counts</li>
<li>$n {\times} k\,{+}\,1$: for multinomial (<code>dfam=3</code>) if
given by category counts</li>
</ul>
<p><strong>M</strong>: (default: <code>" "</code>) Location to write, if requested, the matrix of predicted
response means (for <code>dfam=1</code>) or probabilities (for
<code>dfam=2</code> or <code>3</code>):</p>
<ul>
<li>$n {\times} 1$: for power-type distributions (<code>dfam=1</code>)</li>
<li>$n {\times} 2$: for binomial distribution (<code>dfam=2</code>),
col# 2 is the &#8220;No&#8221; probability</li>
<li>$n {\times} k\,{+}\,1$: for multinomial logit (<code>dfam=3</code>),
col# $k\,{+}\,1$ is for the baseline</li>
</ul>
<p><strong>B</strong>: Location to read matrix $B$ of the betas, i.e. estimated GLM regression
parameters, with the intercept at row# $m\,{+}\,1$ if available:</p>
<ul>
<li>$\dim(B) \,=\, m {\times} k$: do not add intercept</li>
<li>$\dim(B) \,=\, m\,{+}\,1 {\times} k$: add intercept as given by the
last $B$-row</li>
<li>if $k &gt; 1$, use only $B[, 1]$ unless it is Multinomial Logit
(<code>dfam=3</code>)</li>
</ul>
<p><strong>O</strong>: (default: <code>" "</code>) Location to store the CSV-file with goodness-of-fit
statistics defined in <a href="algorithms-regression.html#table13"><strong>Table 13</strong></a>,
the default is to
print them to the standard output</p>
<p><strong>dfam</strong>: (default: <code>1</code>) GLM distribution family code to specify the type of
distribution
$Prob[y\,|\,\mu]$
that we assume:</p>
<ul>
<li>1 = power distributions with
$Var(y) = \mu^{\alpha}$, see
<a href="algorithms-regression.html#table11"><strong>Table 11</strong></a></li>
<li>2 = binomial</li>
<li>3 = multinomial logit</li>
</ul>
<p><strong>vpow</strong>: (default: <code>0.0</code>) Power for variance defined as (mean)$^{\textrm{power}}$
(ignored if <code>dfam</code>$\,{\neq}\,1$): when <code>dfam=1</code>,
this provides the $q$ in
$Var(y) = a\mu^q$, the power
dependence of the variance of $y$ on its mean. In particular, use:</p>
<ul>
<li>0.0 = Gaussian</li>
<li>1.0 = Poisson</li>
<li>2.0 = Gamma</li>
<li>3.0 = inverse Gaussian</li>
</ul>
<p><strong>link</strong>: (default: <code>0</code>) Link function code to determine the link
function $\eta = g(\mu)$, ignored for multinomial logit
(<code>dfam=3</code>):</p>
<ul>
<li>0 = canonical link (depends on the distribution family),
see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a></li>
<li>1 = power functions</li>
<li>2 = logit</li>
<li>3 = probit</li>
<li>4 = cloglog</li>
<li>5 = cauchit</li>
</ul>
<p><strong>lpow</strong>: (default: <code>1.0</code>) Power for link function defined as
(mean)$^{\textrm{power}}$ (ignored if <code>link</code>$\,{\neq}\,1$):
when <code>link=1</code>, this provides the $s$ in $\eta = \mu^s$, the
power link function; <code>lpow=0.0</code> gives the log link
$\eta = \log\mu$. Common power links:</p>
<ul>
<li>-2.0 = $1/\mu^2$</li>
<li>-1.0 = reciprocal</li>
<li>0.0 = log</li>
<li>0.5 = sqrt</li>
<li>1.0 = identity</li>
</ul>
<p><strong>disp</strong>: (default: <code>1.0</code>) Dispersion value, when available; must be positive</p>
<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix M file output format, such as
<code>text</code>, <code>mm</code>, or <code>csv</code>; see read/write
functions in SystemML Language Reference for details.</p>
<h3 id="examples-4">Examples</h3>
<p>Note that in the examples below the value for the <code>disp</code> input
argument is set arbitrarily. The correct dispersion value should be
computed from the training data during model estimation, or omitted if
unknown (which sets it to <code>1.0</code>).</p>
<p><strong>Linear regression example</strong>:</p>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM-predict.dml
-nvargs dfam=1
vpow=0.0
link=1
lpow=1.0
disp=5.67
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Means.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM-predict.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs dfam=1
vpow=0.0
link=1
lpow=1.0
disp=5.67
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Means.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
</div>
<p><strong>Linear regression example, prediction only (no Y given)</strong>:</p>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM-predict.dml
-nvargs dfam=1
vpow=0.0
link=1
lpow=1.0
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Means.mtx
fmt=csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM-predict.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs dfam=1
vpow=0.0
link=1
lpow=1.0
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Means.mtx
fmt=csv
</code></pre>
</div>
</div>
<p><strong>Binomial logistic regression example</strong>:</p>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM-predict.dml
-nvargs dfam=2
link=2
disp=3.0004464
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Probabilities.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM-predict.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs dfam=2
link=2
disp=3.0004464
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Probabilities.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
</div>
<p><strong>Binomial probit regression example</strong>:</p>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM-predict.dml
-nvargs dfam=2
link=3
disp=3.0004464
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Probabilities.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM-predict.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs dfam=2
link=3
disp=3.0004464
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Probabilities.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
</div>
<p><strong>Multinomial logistic regression example</strong>:</p>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM-predict.dml
-nvargs dfam=3
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Probabilities.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM-predict.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs dfam=3
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Probabilities.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
</div>
<p><strong>Poisson regression with the log link example</strong>:</p>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM-predict.dml
-nvargs dfam=1
vpow=1.0
link=1
lpow=0.0
disp=3.45
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Means.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM-predict.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs dfam=1
vpow=1.0
link=1
lpow=0.0
disp=3.45
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Means.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
</div>
<p><strong>Gamma regression with the inverse (reciprocal) link example</strong>:</p>
<div class="codetabs">
<div data-lang="Hadoop">
<pre><code>hadoop jar SystemML.jar -f GLM-predict.dml
-nvargs dfam=1
vpow=2.0
link=1
lpow=-1.0
disp=1.99118
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Means.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
<div data-lang="Spark">
<pre><code>$SPARK_HOME/bin/spark-submit --master yarn
--deploy-mode cluster
--conf spark.driver.maxResultSize=0
SystemML.jar
-f GLM-predict.dml
-config SystemML-config.xml
-exec hybrid_spark
-nvargs dfam=1
vpow=2.0
link=1
lpow=-1.0
disp=1.99118
X=/user/ml/X.mtx
B=/user/ml/B.mtx
M=/user/ml/Means.mtx
fmt=csv
Y=/user/ml/Y.mtx
O=/user/ml/stats.csv
</code></pre>
</div>
</div>
<hr />
<p><a name="table13"></a>
<strong>Table 13</strong>: The goodness-of-fit statistics are provided in CSV format, one per each line, with four
columns: (Name, CID, Disp?, Value). The columns are:
&#8220;Name&#8221; is the string identifier for the statistic;
&#8220;CID&#8221; is an optional integer value that specifies the $Y$-column index for per-column statistics
(note that a bi/multinomial one-column Y-input is converted into multi-column);
&#8220;Disp?&#8221; is an optional Boolean value ($TRUE$ or $FALSE$) that tells us
whether or not scaling by the input dispersion parameter <code>disp</code> has been applied to this
statistic;
&#8220;Value&#8221; is the value of the statistic.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th style="text-align: center">CID</th>
<th style="text-align: center">Disp?</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOGLHOOD_Z</td>
<td style="text-align: center">&#160;</td>
<td style="text-align: center">+</td>
<td>Log-likelihood $Z$-score (in st. dev.&#8217;s from the mean)</td>
</tr>
<tr>
<td>LOGLHOOD_Z_PVAL</td>
<td style="text-align: center">&#160;</td>
<td style="text-align: center">+</td>
<td>Log-likelihood $Z$-score p-value, two-sided</td>
</tr>
<tr>
<td>PEARSON_X2</td>
<td style="text-align: center">&#160;</td>
<td style="text-align: center">+</td>
<td>Pearson residual $X^2$-statistic</td>
</tr>
<tr>
<td>PEARSON_X2_BY_DF</td>
<td style="text-align: center">&#160;</td>
<td style="text-align: center">+</td>
<td>Pearson $X^2$ divided by degrees of freedom</td>
</tr>
<tr>
<td>PEARSON_X2_PVAL</td>
<td style="text-align: center">&#160;</td>
<td style="text-align: center">+</td>
<td>Pearson $X^2$ p-value</td>
</tr>
<tr>
<td>DEVIANCE_G2</td>
<td style="text-align: center">&#160;</td>
<td style="text-align: center">+</td>
<td>Deviance from the saturated model $G^2$-statistic</td>
</tr>
<tr>
<td>DEVIANCE_G2_BY_DF</td>
<td style="text-align: center">&#160;</td>
<td style="text-align: center">+</td>
<td>Deviance $G^2$ divided by degrees of freedom</td>
</tr>
<tr>
<td>DEVIANCE_G2_PVAL</td>
<td style="text-align: center">&#160;</td>
<td style="text-align: center">+</td>
<td>Deviance $G^2$ p-value</td>
</tr>
<tr>
<td>AVG_TOT_Y</td>
<td style="text-align: center">+</td>
<td style="text-align: center">&#160;</td>
<td>$Y$-column average for an individual response value</td>
</tr>
<tr>
<td>STDEV_TOT_Y</td>
<td style="text-align: center">+</td>
<td style="text-align: center">&#160;</td>
<td>$Y$-column st. dev. for an individual response value</td>
</tr>
<tr>
<td>AVG_RES_Y</td>
<td style="text-align: center">+</td>
<td style="text-align: center">&#160;</td>
<td>$Y$-column residual average of $Y - pred. mean(Y|X)$</td>
</tr>
<tr>
<td>STDEV_RES_Y</td>
<td style="text-align: center">+</td>
<td style="text-align: center">&#160;</td>
<td>$Y$-column residual st. dev. of $Y - pred. mean(Y|X)$</td>
</tr>
<tr>
<td>PRED_STDEV_RES</td>
<td style="text-align: center">+</td>
<td style="text-align: center">+</td>
<td>Model-predicted $Y$-column residual st. deviation</td>
</tr>
<tr>
<td>R2</td>
<td style="text-align: center">+</td>
<td style="text-align: center">&#160;</td>
<td>$R^2$ of $Y$-column residual with bias included</td>
</tr>
<tr>
<td>ADJUSTED_R2</td>
<td style="text-align: center">+</td>
<td style="text-align: center">&#160;</td>
<td>Adjusted $R^2$ of $Y$-column residual w. bias included</td>
</tr>
<tr>
<td>R2_NOBIAS</td>
<td style="text-align: center">+</td>
<td style="text-align: center">&#160;</td>
<td>$R^2$ of $Y$-column residual, bias subtracted</td>
</tr>
<tr>
<td>ADJUSTED_R2_NOBIAS</td>
<td style="text-align: center">+</td>
<td style="text-align: center">&#160;</td>
<td>Adjusted $R^2$ of $Y$-column residual, bias subtracted</td>
</tr>
</tbody>
</table>
<hr />
<h3 id="details-4">Details</h3>
<p>The output matrix $M$ of predicted means (or probabilities) is computed
by matrix-multiplying $X$ with the first column of $B$ or with the
whole $B$ in the multinomial case, adding the intercept if available
(conceptually, appending an extra column of ones to $X$); then applying
the inverse of the model’s link function. The difference between &#8220;means&#8221;
and &#8220;probabilities&#8221; in the categorical case becomes significant when
there are ${\geq}\,2$ observations per record (with the multi-label
records) or when the labels such as $-1$ and $1$ are viewed and averaged
as numerical response values (with the single-label records). To avoid
any or information loss, we separately return the predicted probability
of each category label for each record.</p>
<p>When the &#8220;actual&#8221; response values $Y$ are available, the summary
statistics are computed and written out as described in
<a href="algorithms-regression.html#table13"><strong>Table 13</strong></a>. Below we discuss each of these statistics
in detail. Note that in the categorical case (binomial and multinomial)
$Y$ is internally represented as the matrix of observation counts for
each label in each record, rather than just the label ID for each
record. The input $Y$ may already be a matrix of counts, in which case
it is used as-is. But if $Y$ is given as a vector of response labels,
each response label is converted into an indicator vector
$(0,\ldots,0,1_l,0,\ldots,0)$ where $l$ is the label ID for this record.
All negative (e.g. $-1$) or zero label IDs are converted to the
$1 +$ maximum label ID. The largest label ID is viewed as the
&#8220;baseline&#8221; as explained in the section on Multinomial Logistic
Regression. We assume that there are $k\geq 1$ non-baseline categories
and one (last) baseline category.</p>
<p>We also estimate residual variances for each response value, although we
do not output them, but use them only inside the summary statistics,
scaled and unscaled by the input dispersion parameter <code>disp</code>,
as described below.</p>
<p><code>LOGLHOOD_Z</code> and <code>LOGLHOOD_Z_PVAL</code> statistics measure how far the
log-likelihood of $Y$ deviates from its expected value according to the
model. The script implements them only for the binomial and the
multinomial distributions, returning <code>NaN</code> for all other distributions.
Pearson’s $X^2$ and deviance $G^2$ often perform poorly with bi- and
multinomial distributions due to low cell counts, hence we need this
extra goodness-of-fit measure. To compute these statistics, we use:</p>
<ul>
<li>the $n\times (k\,{+}\,1)$-matrix $Y$ of multi-label response counts, in
which $y_{i,j}$ is the number of times label $j$ was observed in
record $i$</li>
<li>the model-estimated probability matrix $P$ of the same dimensions that
satisfies <script type="math/tex">\sum_{j=1}^{k+1} p_{i,j} = 1</script> for all $i=1,\ldots,n$ and
where $p_{i,j}$ is the model probability of observing label $j$ in
record $i$</li>
<li>the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of
observations in record $i$ (all $N_i = 1$ if each record has only one
response label)</li>
</ul>
<p>We start by computing the multinomial log-likelihood of $Y$ given $P$
and $N$, as well as the expected log-likelihood given a random $Y$ and
the variance of this log-likelihood if $Y$ indeed follows the proposed
distribution:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\ell (Y) \,\,&=\,\, \log Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1} \,y_{i,j}\log p_{i,j} \\
E_Y \ell (Y) \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j}
\,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\
Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2
- \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right)
\end{aligned} %]]></script>
<p>Then we compute the $Z$-score as the difference between the actual and
the expected log-likelihood $\ell(Y)$ divided by its expected standard
deviation, and its two-sided p-value in the Normal distribution
assumption ($\ell(Y)$ should approach normality due to the Central Limit
Theorem):</p>
<script type="math/tex; mode=display">Z \,=\, \frac {\ell(Y) - E_Y \ell(Y)}{\sqrt{Var_Y \ell(Y)}};\quad
\mathop{\textrm{p-value}}(Z) \,=\, Prob\Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big]</script>
<p>A low p-value would indicate &#8220;underfitting&#8221; if $Z\ll 0$ or &#8220;overfitting&#8221;
if $Z\gg 0$. Here &#8220;overfitting&#8221; means that higher-probability labels
occur more often than their probabilities suggest.</p>
<p>We also apply the dispersion input (<code>disp</code>) to compute the
&#8220;scaled&#8221; version of the $Z$-score and its p-value. Since $\ell(Y)$ is a
linear function of $Y$, multiplying the GLM-predicted variance of $Y$ by
<code>disp</code> results in multiplying
$Var_Y \ell(Y)$ by the same
<code>disp</code>. This, in turn, translates into dividing the $Z$-score
by the square root of the dispersion:</p>
<script type="math/tex; mode=display">Z_{\texttt{disp}} \,=\, \big(\ell(Y) \,-\, E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot Var_Y \ell(Y)}
\,=\, Z / \sqrt{\texttt{disp}}</script>
<p>Finally, we recalculate the p-value
with this new $Z$-score.</p>
<p><code>PEARSON_X2</code>, <code>PEARSON_X2_BY_DF</code>, and <code>PEARSON_X2_PVAL</code>:
Pearson’s residual $X^2$-statistic is a commonly used goodness-of-fit
measure for linear models <a href="algorithms-bibliography.html">[McCullagh1989]</a>.
The idea is to measure how
well the model-predicted means and variances match the actual behavior
of response values. For each record $i$, we estimate the mean $\mu_i$
and the variance $v_i$ (or <code>disp</code> $\cdot v_i$) and use them to
normalize the residual: $r_i = (y_i - \mu_i) / \sqrt{v_i}$. These
normalized residuals are then squared, aggregated by summation, and
tested against an appropriate $\chi^2$ distribution. The computation
of $X^2$ is slightly different for categorical data (bi- and
multinomial) than it is for numerical data, since $y_i$ has multiple
correlated dimensions <a href="algorithms-bibliography.html">[McCullagh1989]</a>:</p>
<script type="math/tex; mode=display">X^2\,\textrm{(numer.)} \,=\, \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad
X^2\,\textrm{(categ.)} \,=\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i
\hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}}</script>
<p>The number of
degrees of freedom #d.f. for the $\chi^2$ distribution is $n - m$ for
numerical data and $(n - m)k$ for categorical data, where
$k = \mathop{\texttt{ncol}}(Y) - 1$. Given the dispersion parameter
<code>disp</code> the $X^2$ statistic is scaled by division: <script type="math/tex">X^2_{\texttt{disp}} = X^2 / \texttt{disp}</script>. If the
dispersion is accurate, $X^2 / \texttt{disp}$ should be close to #d.f.
In fact, $X^2 / \textrm{#d.f.}$ over the <em>training</em> data is the
dispersion estimator used in our <code>GLM.dml</code> script,
see (7). Here we provide $X^2 / \textrm{#d.f.}$ and
$X^2_{\texttt{disp}} / \textrm{#d.f.}$ as
<code>PEARSON_X2_BY_DF</code> to enable dispersion comparison between
the training data and the test data.</p>
<p>NOTE: For categorical data, both Pearson’s $X^2$ and the deviance $G^2$
are unreliable (i.e. do not approach the $\chi^2$ distribution) unless
the predicted means of multi-label counts
<script type="math/tex">\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}</script> are fairly large: all
${\geq}\,1$ and 80% are at least $5$ <a href="algorithms-bibliography.html">[Cochran1954]</a>. They should not
be used for &#8220;one label per record&#8221; categoricals.</p>
<p><code>DEVIANCE_G2</code>, <code>DEVIANCE_G2_BY_DF</code>, and
<code>DEVIANCE_G2_PVAL</code>: Deviance $G^2$ is the log of the
likelihood ratio between the &#8220;saturated&#8221; model and the linear model
being tested for the given dataset, multiplied by two:</p>
<script type="math/tex; mode=display">\begin{equation}
G^2 \,=\, 2 \,\log \frac{Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}{Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]}
\end{equation}</script>
<p>The &#8220;saturated&#8221; model sets the mean
$\mu_i^{\mathrm{sat}}$ to equal $y_i$ for every record (for categorical
data, <script type="math/tex">p_{i,j}^{sat} = y_{i,j} / N_i</script>), which represents the
&#8220;perfect fit.&#8221; For records with $y_{i,j} \in {0, N_i}$ or otherwise at
a boundary, by continuity we set $0 \log 0 = 0$. The GLM likelihood
functions defined in (6) become simplified in
ratio (8) due to canceling out the term $c(y, a)$
since it is the same in both models.</p>
<p>The log of a likelihood ratio between two nested models, times two, is
known to approach a $\chi^2$ distribution as $n\to\infty$ if both models
have fixed parameter spaces. But this is not the case for the
&#8220;saturated&#8221; model: it adds more parameters with each record. In
practice, however, $\chi^2$ distributions are used to compute the
p-value of $G^2$ <a href="algorithms-bibliography.html">[McCullagh1989]</a>. The number of degrees of
freedom #d.f. and the treatment of dispersion are the same as for
Pearson’s $X^2$, see above.</p>
<h3 id="column-wise-statistics">Column-Wise Statistics</h3>
<p>The rest of the statistics are computed separately for each column
of $Y$. As explained above, $Y$ has two or more columns in bi- and
multinomial case, either at input or after conversion. Moreover, each
<script type="math/tex">y_{i,j}</script> in record $i$ with $N_i \geq 2$ is counted as $N_i$ separate
observations <script type="math/tex">y_{i,j,l}</script> of 0 or 1 (where $l=1,\ldots,N_i$) with
<script type="math/tex">y_{i,j}</script> ones and <script type="math/tex">N_i-y_{i,j}</script> zeros. For power distributions,
including linear regression, $Y$ has only one column and all $N_i = 1$,
so the statistics are computed for all $Y$ with each record counted
once. Below we denote <script type="math/tex">N = \sum_{i=1}^n N_i \,\geq n</script>. Here is the total
average and the residual average (residual bias) of $y_{i,j,l}$ for each
$Y$-column:</p>
<script type="math/tex; mode=display">\texttt{AVG_TOT_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n y_{i,j}; \quad
\texttt{AVG_RES_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})</script>
<p>Dividing by $N$ (rather than $n$) gives the averages for <script type="math/tex">y_{i,j,l}</script>
(rather than <script type="math/tex">y_{i,j}</script>). The total variance, and the standard deviation,
for individual observations <script type="math/tex">y_{i,j,l}</script> is estimated from the total
variance for response values <script type="math/tex">y_{i,j}</script> using independence assumption:
<script type="math/tex">Var \,y_{i,j} = Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} Var y_{i,j,l}</script>.
This allows us to estimate the sum of squares for $y_{i,j,l}$ via the
sum of squares for <script type="math/tex">y_{i,j}</script>:</p>
<script type="math/tex; mode=display">\texttt{STDEV_TOT_Y}_j \,=\,
\Bigg[\frac{1}{N-1} \sum_{i=1}^n \Big( y_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n y_{i'\!,j}\Big)^2\Bigg]^{1/2}</script>
<p>Analogously, we estimate the standard deviation of the residual
<script type="math/tex">y_{i,j,l} - \mu_{i,j,l}</script>:</p>
<script type="math/tex; mode=display">\texttt{STDEV_RES_Y}_j \,=\,
\Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n \Big( y_{i,j} - \mu_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}</script>
<p>Here $m&#8217;=m$ if $m$ includes the intercept as a feature and $m&#8217;=m+1$ if
it does not. The estimated standard deviations can be compared to the
model-predicted residual standard deviation computed from the predicted
means by the GLM variance formula and scaled by the dispersion:</p>
<script type="math/tex; mode=display">\texttt{PRED_STDEV_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2}</script>
<p>We also compute the $R^2$ statistics for each column of $Y$, see
<a href="algorithms-regression.html#table14"><strong>Table 14</strong></a> and <a href="algorithms-regression.html#table15"><strong>Table 15</strong></a> for details. We compute two versions
of $R^2$: in one version the residual sum-of-squares (RSS) includes any
bias in the residual that might be present (due to the lack of, or
inaccuracy in, the intercept); in the other version of RSS the bias is
subtracted by &#8220;centering&#8221; the residual. In both cases we subtract the
bias in the total sum-of-squares (in the denominator), and $m&#8217;$ equals
$m$ with the intercept or $m+1$ without the intercept.</p>
<hr />
<p><a name="table14"></a>
<strong>Table 14</strong>: $R^2$ where the residual sum-of-squares includes the bias contribution.</p>
<table>
<thead>
<tr>
<th>Statistic</th>
<th>Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\texttt{R2}_j$</td>
<td><script type="math/tex">\displaystyle 1 - \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}}</script></td>
</tr>
<tr>
<td>$\texttt{ADJUSTED_R2}_j$</td>
<td><script type="math/tex">\displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}} \, \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}}</script></td>
</tr>
</tbody>
</table>
<hr />
<p><a name="table15"></a>
<strong>Table 15</strong>: $R^2$ where the residual sum-of-squares is centered so that the bias is subtracted.</p>
<table>
<thead>
<tr>
<th>Statistic</th>
<th>Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\texttt{R2_NOBIAS}_j$</td>
<td><script type="math/tex">\displaystyle 1 - \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}}</script></td>
</tr>
<tr>
<td>$\texttt{ADJUSTED_R2_NOBIAS}_j$</td>
<td><script type="math/tex">\displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}}</script></td>
</tr>
</tbody>
</table>
<hr />
<h3 id="returns-4">Returns</h3>
<p>The matrix of predicted means (if the response is numerical) or
probabilities (if the response is categorical), see Description
subsection above for more information. Given <code>Y</code>, we return
some statistics in CSV format as described in
<a href="algorithms-regression.html#table13"><strong>Table 13</strong></a> and in the above text.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Smaller likelihood difference between two models suggests less
statistical evidence to pick one model over the other.&#160;<a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>$Prob[y\mid \mu_i]$
is given by a density function if $y$ is continuous.&#160;<a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:3">
<p>We use $k+1$ because there are $k$ non-baseline categories and one
baseline category, with regression parameters $B$ having
$k$ columns.&#160;<a href="#fnref:3" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
</div> <!-- /container -->
<script src="js/vendor/jquery-1.12.0.min.js"></script>
<script src="js/vendor/bootstrap.min.js"></script>
<script src="js/vendor/anchor.min.js"></script>
<script src="js/main.js"></script>
<!-- Analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-71553733-1', 'auto');
ga('send', 'pageview');
</script>
<!-- MathJax Section -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<script>
// Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS.
// We could use "//cdn.mathjax...", but that won't support "file://".
(function(d, script) {
script = d.createElement('script');
script.type = 'text/javascript';
script.async = true;
script.onload = function(){
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
processEscapes: true,
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
}
});
};
script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
d.getElementsByTagName('head')[0].appendChild(script);
}(document));
</script>
</body>
</html>