| <!DOCTYPE html> |
| <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--> |
| <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]--> |
| <!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]--> |
| <!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]--> |
| <head> |
| <title>SystemML Algorithms Reference - Regression - SystemML 1.2.0</title> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| |
| <meta name="viewport" content="width=device-width"> |
| <link rel="stylesheet" href="css/bootstrap.min.css"> |
| <link rel="stylesheet" href="css/main.css"> |
| <link rel="stylesheet" href="css/pygments-default.css"> |
| <link rel="shortcut icon" href="img/favicon.png"> |
| </head> |
| <body> |
| <!--[if lt IE 7]> |
| <p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p> |
| <![endif]--> |
| |
| <header class="navbar navbar-default navbar-fixed-top" id="topbar"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <div class="navbar-brand brand projectlogo"> |
| <a href="http://systemml.apache.org/"><img class="logo" src="img/systemml-logo.png" alt="Apache SystemML" title="Apache SystemML"/></a> |
| </div> |
| <div class="navbar-brand brand projecttitle"> |
| <a href="http://systemml.apache.org/">Apache SystemML<sup id="trademark">™</sup></a><br/> |
| <span class="version">1.2.0</span> |
| </div> |
| <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| </div> |
| <nav class="navbar-collapse collapse"> |
| <ul class="nav navbar-nav navbar-right"> |
| <li><a href="index.html">Overview</a></li> |
| <li><a href="https://github.com/apache/systemml">GitHub</a></li> |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Documentation<b class="caret"></b></a> |
| <ul class="dropdown-menu" role="menu"> |
| <li><b>Running SystemML:</b></li> |
| <li><a href="https://github.com/apache/systemml">SystemML GitHub README</a></li> |
| <li><a href="spark-mlcontext-programming-guide.html">Spark MLContext</a></li> |
| <li><a href="spark-batch-mode.html">Spark Batch Mode</a> |
| <li><a href="hadoop-batch-mode.html">Hadoop Batch Mode</a> |
| <li><a href="standalone-guide.html">Standalone Guide</a></li> |
| <li><a href="jmlc.html">Java Machine Learning Connector (JMLC)</a> |
| <li class="divider"></li> |
| <li><b>Language Guides:</b></li> |
| <li><a href="dml-language-reference.html">DML Language Reference</a></li> |
| <li><a href="beginners-guide-to-dml-and-pydml.html">Beginner's Guide to DML and PyDML</a></li> |
| <li><a href="beginners-guide-python.html">Beginner's Guide for Python Users</a></li> |
| <li><a href="python-reference.html">Reference Guide for Python Users</a></li> |
| <li class="divider"></li> |
| <li><b>ML Algorithms:</b></li> |
| <li><a href="algorithms-reference.html">Algorithms Reference</a></li> |
| <li class="divider"></li> |
| <li><b>Tools:</b></li> |
| <li><a href="debugger-guide.html">Debugger Guide</a></li> |
| <li><a href="developer-tools-systemml.html">IDE Guide</a></li> |
| <li class="divider"></li> |
| <li><b>Other:</b></li> |
| <li><a href="contributing-to-systemml.html">Contributing to SystemML</a></li> |
| <li><a href="engine-dev-guide.html">Engine Developer Guide</a></li> |
| <li><a href="troubleshooting-guide.html">Troubleshooting Guide</a></li> |
| <li><a href="release-process.html">Release Process</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a> |
| <ul class="dropdown-menu" role="menu"> |
| <li><a href="./api/java/index.html">Java</a></li> |
| <li><a href="./api/python/index.html">Python</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Issues<b class="caret"></b></a> |
| <ul class="dropdown-menu" role="menu"> |
| <li><b>JIRA:</b></li> |
| <li><a href="https://issues.apache.org/jira/browse/SYSTEMML">SystemML JIRA</a></li> |
| |
| </ul> |
| </li> |
| </ul> |
| </nav> |
| </div> |
| </header> |
| |
| <div class="container" id="content"> |
| |
| <h1 class="title"><a href="algorithms-reference.html">SystemML Algorithms Reference</a></h1> |
| |
| |
| <!-- |
| |
| --> |
| |
| <h1 id="regression">4. Regression</h1> |
| |
| <h2 id="linear-regression">4.1. Linear Regression</h2> |
| |
| <h3 id="description">Description</h3> |
| |
| <p>Linear Regression scripts are used to model the relationship between one |
| numerical response variable and one or more explanatory (feature) |
| variables. The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ |
| where $x_i$ is a numerical vector of feature variables and $y_i$ is a |
| numerical response value for each training data record. The feature |
| vectors are provided as a matrix $X$ of size $n\,{\times}\,m$, where $n$ |
| is the number of records and $m$ is the number of features. The observed |
| response values are provided as a 1-column matrix $Y$, with a numerical |
| value $y_i$ for each $x_i$ in the corresponding row of matrix $X$.</p> |
| |
| <p>In linear regression, we predict the distribution of the response $y_i$ |
| based on a fixed linear combination of the features in $x_i$. We assume |
| that there exist constant regression coefficients |
| $\beta_0, \beta_1, \ldots, \beta_m$ and a constant residual |
| variance $\sigma^2$ such that</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| y_i \sim Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\, |
| \mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} |
| \end{equation}</script> |
| |
| <p>Distribution |
| $y_i \sim Normal(\mu_i, \sigma^2)$ |
| models the “unexplained” residual noise and is assumed independent |
| across different records.</p> |
| |
| <p>The goal is to estimate the regression coefficients and the residual |
| variance. Once they are accurately estimated, we can make predictions |
| about $y_i$ given $x_i$ in new records. We can also use the $\beta_j$’s |
| to analyze the influence of individual features on the response value, |
| and assess the quality of this model by comparing residual variance in |
| the response, left after prediction, with its total variance.</p> |
| |
| <p>There are two scripts in our library, both doing the same estimation, |
| but using different computational methods. Depending on the size and the |
| sparsity of the feature matrix $X$, one or the other script may be more |
| efficient. The “direct solve” script <code>LinearRegDS.dml</code> is more |
| efficient when the number of features $m$ is relatively small |
| ($m \sim 1000$ or less) and matrix $X$ is either tall or fairly dense |
| (has ${\gg}:m^2$ nonzeros); otherwise, the “conjugate gradient” script |
| <code>LinearRegCG.dml</code> is more efficient. If $m > 50000$, use only |
| <code>LinearRegCG.dml</code>.</p> |
| |
| <h3 id="usage">Usage</h3> |
| |
| <p><strong>Linear Regression - Direct Solve</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Python"> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span> |
| <span class="c"># C = 1/reg (to disable regularization, use float("inf"))</span> |
| <span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">sqlCtx</span><span class="p">,</span> <span class="n">fit_intercept</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">normalize</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">),</span> <span class="n">solver</span><span class="o">=</span><span class="s">'direct-solve'</span><span class="p">)</span> |
| <span class="c"># X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame or SciPy Sparse Matrix</span> |
| <span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span> |
| <span class="c"># df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"</span> |
| <span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df_train</span><span class="p">)</span></code></pre></div> |
| |
| </div> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f LinearRegDS.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| O=[file] |
| icpt=[int] |
| reg=[double] |
| fmt=[format] |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f LinearRegDS.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| O=[file] |
| icpt=[int] |
| reg=[double] |
| fmt=[format] |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>Linear Regression - Conjugate Gradient</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Python"> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span> |
| <span class="c"># C = 1/reg (to disable regularization, use float("inf"))</span> |
| <span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">sqlCtx</span><span class="p">,</span> <span class="n">fit_intercept</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">normalize</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">0.000001</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">),</span> <span class="n">solver</span><span class="o">=</span><span class="s">'newton-cg'</span><span class="p">)</span> |
| <span class="c"># X_train, y_train and X_test can be NumPy matrices or Pandas DataFrames or SciPy Sparse matrices</span> |
| <span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span> |
| <span class="c"># df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"</span> |
| <span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df_train</span><span class="p">)</span></code></pre></div> |
| |
| </div> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f LinearRegCG.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| O=[file] |
| Log=[file] |
| icpt=[int] |
| reg=[double] |
| tol=[double] |
| maxi=[int] |
| fmt=[format] |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f LinearRegCG.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| O=[file] |
| Log=[file] |
| icpt=[int] |
| reg=[double] |
| tol=[double] |
| maxi=[int] |
| fmt=[format] |
| </code></pre> |
| </div> |
| </div> |
| |
| <h3 id="arguments-for-spark-and-hadoop-invocation">Arguments for Spark and Hadoop invocation</h3> |
| |
| <p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors, each row |
| constitutes one feature vector</p> |
| |
| <p><strong>Y</strong>: Location to read the 1-column matrix of response values</p> |
| |
| <p><strong>B</strong>: Location to store the estimated regression parameters (the $\beta_j$’s), |
| with the intercept parameter $\beta_0$ at position |
| B[$m\,{+}\,1$, 1] if available</p> |
| |
| <p><strong>O</strong>: (default: <code>" "</code>) Location to store the CSV-file of summary statistics defined |
| in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>, the default is to print it to the |
| standard output</p> |
| |
| <p><strong>Log</strong>: (default: <code>" "</code>, <code>LinearRegCG.dml</code> only) Location to store |
| iteration-specific variables for monitoring and debugging purposes, see |
| <a href="algorithms-regression.html#table8"><strong>Table 8</strong></a> |
| for details.</p> |
| |
| <p><strong>icpt</strong>: (default: <code>0</code>) Intercept presence and shifting/rescaling the features |
| in $X$:</p> |
| |
| <ul> |
| <li>0 = no intercept (hence no $\beta_0$), no shifting or |
| rescaling of the features</li> |
| <li>1 = add intercept, but do not shift/rescale the features |
| in $X$</li> |
| <li>2 = add intercept, shift/rescale the features in $X$ to |
| mean 0, variance 1</li> |
| </ul> |
| |
| <p><strong>reg</strong>: (default: <code>0.000001</code>) L2-regularization parameter $\lambda\geq 0$; set to nonzero for highly |
| dependent, sparse, or numerous ($m \gtrsim n/10$) features</p> |
| |
| <p><strong>tol</strong>: (default: <code>0.000001</code>, <code>LinearRegCG.dml</code> only) Tolerance $\varepsilon\geq 0$ used in the |
| convergence criterion: we terminate conjugate gradient iterations when |
| the $\beta$-residual reduces in L2-norm by this factor</p> |
| |
| <p><strong>maxi</strong>: (default: <code>0</code>, <code>LinearRegCG.dml</code> only) Maximum number of conjugate |
| gradient iterations, or <code>0</code> if no maximum limit provided</p> |
| |
| <p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, |
| <code>mm</code>, or <code>csv</code>; see read/write functions in |
| SystemML Language Reference for details.</p> |
| |
| <p>Please see <a href="https://apache.github.io/systemml/python-reference#mllearn-api">mllearn documentation</a> for |
| more details on the Python API.</p> |
| |
| <h3 id="examples">Examples</h3> |
| |
| <p><strong>Linear Regression - Direct Solve</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Python"> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> |
| <span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span> |
| <span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span> |
| <span class="c"># Load the diabetes dataset</span> |
| <span class="n">diabetes</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_diabetes</span><span class="p">()</span> |
| <span class="c"># Use only one feature</span> |
| <span class="n">diabetes_X</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">data</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span> |
| <span class="c"># Split the data into training/testing sets</span> |
| <span class="n">diabetes_X_train</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> |
| <span class="n">diabetes_X_test</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span> |
| <span class="c"># Split the targets into training/testing sets</span> |
| <span class="n">diabetes_y_train</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> |
| <span class="n">diabetes_y_test</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span> |
| <span class="c"># Create linear regression object</span> |
| <span class="n">regr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="n">solver</span><span class="o">=</span><span class="s">'direct-solve'</span><span class="p">)</span> |
| <span class="c"># Train the model using the training sets</span> |
| <span class="n">regr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">diabetes_X_train</span><span class="p">,</span> <span class="n">diabetes_y_train</span><span class="p">)</span> |
| <span class="c"># The mean square error</span> |
| <span class="k">print</span><span class="p">(</span><span class="s">"Residual sum of squares: </span><span class="si">%.2</span><span class="s">f"</span> <span class="o">%</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">regr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">diabetes_X_test</span><span class="p">)</span> <span class="o">-</span> <span class="n">diabetes_y_test</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span></code></pre></div> |
| |
| </div> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f LinearRegDS.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| reg=1.0 |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f LinearRegDS.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| reg=1.0 |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>Linear Regression - Conjugate Gradient</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Python"> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> |
| <span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span> |
| <span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span> |
| <span class="c"># Load the diabetes dataset</span> |
| <span class="n">diabetes</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_diabetes</span><span class="p">()</span> |
| <span class="c"># Use only one feature</span> |
| <span class="n">diabetes_X</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">data</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span> |
| <span class="c"># Split the data into training/testing sets</span> |
| <span class="n">diabetes_X_train</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> |
| <span class="n">diabetes_X_test</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span> |
| <span class="c"># Split the targets into training/testing sets</span> |
| <span class="n">diabetes_y_train</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> |
| <span class="n">diabetes_y_test</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span> |
| <span class="c"># Create linear regression object</span> |
| <span class="n">regr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="n">solver</span><span class="o">=</span><span class="s">'newton-cg'</span><span class="p">)</span> |
| <span class="c"># Train the model using the training sets</span> |
| <span class="n">regr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">diabetes_X_train</span><span class="p">,</span> <span class="n">diabetes_y_train</span><span class="p">)</span> |
| <span class="c"># The mean square error</span> |
| <span class="k">print</span><span class="p">(</span><span class="s">"Residual sum of squares: </span><span class="si">%.2</span><span class="s">f"</span> <span class="o">%</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">regr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">diabetes_X_test</span><span class="p">)</span> <span class="o">-</span> <span class="n">diabetes_y_test</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span></code></pre></div> |
| |
| </div> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f LinearRegCG.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| reg=1.0 |
| tol=0.00000001 |
| maxi=100 |
| Log=/user/ml/log.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f LinearRegCG.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| reg=1.0 |
| tol=0.00000001 |
| maxi=100 |
| Log=/user/ml/log.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <hr /> |
| |
| <p><a name="table7"></a> |
| <strong>Table 7</strong>: Besides $\beta$, linear regression scripts compute a few summary statistics |
| listed below. The statistics are provided in CSV format, one comma-separated name-value |
| pair per each line.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Name</th> |
| <th>Meaning</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>AVG_TOT_Y</td> |
| <td>Average of the response value $Y$</td> |
| </tr> |
| <tr> |
| <td>STDEV_TOT_Y</td> |
| <td>Standard Deviation of the response value $Y$</td> |
| </tr> |
| <tr> |
| <td>AVG_RES_Y</td> |
| <td>Average of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$, i.e. residual bias</td> |
| </tr> |
| <tr> |
| <td>STDEV_RES_Y</td> |
| <td>Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$</td> |
| </tr> |
| <tr> |
| <td>DISPERSION</td> |
| <td>GLM-style dispersion, i.e. residual sum of squares / #deg. fr.</td> |
| </tr> |
| <tr> |
| <td>R2</td> |
| <td>$R^2$ of residual with bias included vs. total average</td> |
| </tr> |
| <tr> |
| <td>ADJUSTED_R2</td> |
| <td>Adjusted $R^2$ of residual with bias included vs. total average</td> |
| </tr> |
| <tr> |
| <td>R2_NOBIAS</td> |
| <td>Plain $R^2$ of residual with bias subtracted vs. total average</td> |
| </tr> |
| <tr> |
| <td>ADJUSTED_R2_NOBIAS</td> |
| <td>Adjusted $R^2$ of residual with bias subtracted vs. total average</td> |
| </tr> |
| <tr> |
| <td>R2_VS_0</td> |
| <td>* $R^2$ of residual with bias included vs. zero constant</td> |
| </tr> |
| <tr> |
| <td>ADJUSTED_R2_VS_0</td> |
| <td>* Adjusted $R^2$ of residual with bias included vs. zero constant</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <p>* The last two statistics are only printed if there is no intercept (<code>icpt=0</code>)</p> |
| |
| <hr /> |
| |
| <p><a name="table8"></a> |
| <strong>Table 8</strong>: The <code>Log</code> file for the <code>LinearRegCG.dml</code> script |
| contains the above iteration variables in CSV format, each line |
| containing a triple (Name, Iteration#, Value) with Iteration# being 0 |
| for initial values.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Name</th> |
| <th>Meaning</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>CG_RESIDUAL_NORM</td> |
| <td>L2-norm of conjug. grad. residual, which is $A$ %*% $\beta - t(X)$ %*% $y$ where $A = t(X)$ %*% $X + diag(\lambda)$, or a similar quantity</td> |
| </tr> |
| <tr> |
| <td>CG_RESIDUAL_RATIO</td> |
| <td>Ratio of current L2-norm of conjug. grad. residual over the initial</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <h3 id="details">Details</h3> |
| |
| <p>To solve a linear regression problem over feature matrix $X$ and |
| response vector $Y$, we can find coefficients |
| $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize the |
| joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed |
| statistical model (1). Since the joint likelihood of the |
| independent |
| $y_i \sim Normal(\mu_i, \sigma^2)$ |
| is proportional to the product of |
| $\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the |
| logarithm of this product, then multiply by $-2\sigma^2 < 0$ to obtain a |
| least squares problem:</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| \sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, |
| \sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 |
| \,\,\to\,\,\min |
| \end{equation}</script> |
| |
| <p>This may not be enough, however. The minimum may |
| sometimes be attained over infinitely many $\beta$-vectors, for example |
| if $X$ has an all-0 column, or has linearly dependent columns, or has |
| fewer rows than columns . Even if (2) has a unique |
| solution, other $\beta$-vectors may be just a little suboptimal<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, yet |
| give significantly different predictions for new feature vectors. This |
| results in <em>overfitting</em>: prediction error for the training data ($X$ |
| and $Y$) is much smaller than for the test data (new records).</p> |
| |
| <p>Overfitting and degeneracy in the data is commonly mitigated by adding a |
| regularization penalty term to the least squares function:</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| \sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 |
| \,+\,\, \lambda \sum_{j=1}^m \beta_j^2 |
| \,\,\to\,\,\min |
| \end{equation}</script> |
| |
| <p>The choice of $\lambda>0$, the regularization |
| constant, typically involves cross-validation where the dataset is |
| repeatedly split into a training part (to estimate the $\beta_j$’s) and |
| a test part (to evaluate prediction accuracy), with the goal of |
| maximizing the test accuracy. In our scripts, $\lambda$ is provided as |
| input parameter <code>reg</code>.</p> |
| |
| <p>The solution to the least squares problem (3), through |
| taking the derivative and setting it to 0, has the matrix linear |
| equation form</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\, |
| \textrm{where}\,\,\, |
| A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} diag(\hspace{0.5pt} |
| \underbrace{\lambda,\ldots, \lambda}_{\scriptstyle m}, 0) |
| \end{equation}</script> |
| |
| <p>where $[X,\,1]$ is $X$ with an extra column of 1s |
| appended on the right, and the diagonal matrix of $\lambda$’s has a zero |
| to keep the intercept $\beta_0$ unregularized. If the intercept is |
| disabled by setting $icpt=0$, the equation is simply $X^T X \beta = X^T Y$.</p> |
| |
| <p>We implemented two scripts for solving equation (4): one |
| is a “direct solver” that computes $A$ and then solves |
| $A\beta = [X,\,1]^T Y$ by calling an external package, the other |
| performs linear conjugate gradient (CG) iterations without ever |
| materializing $A$. The CG algorithm closely follows Algorithm 5.2 in |
| Chapter 5 of <a href="algorithms-bibliography.html">[Nocedal2006]</a>. Each step in the CG algorithm |
| computes a matrix-vector multiplication $q = Ap$ by first computing |
| $[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$. Usually the number of |
| such multiplications, one per CG iteration, is much smaller than $m$. |
| The user can put a hard bound on it with input |
| parameter <code>maxi</code>, or use the default maximum of $m+1$ (or $m$ |
| if no intercept) by having <code>maxi=0</code>. The CG iterations |
| terminate when the L2-norm of vector $r = A\beta - [X,\,1]^T Y$ |
| decreases from its initial value (for $\beta=0$) by the tolerance factor |
| specified in input parameter <code>tol</code>.</p> |
| |
| <p>The CG algorithm is more efficient if computing |
| $[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$, |
| an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix. The Direct Solver (DS) is more |
| efficient if $X$ takes up a lot more memory than $A$ (i.e. $X$ has a lot |
| more nonzeros than $m^2$) and if $m^2$ is small enough for the external |
| solver ($m \lesssim 50000$). A more precise determination between CG |
| and DS is subject to further research.</p> |
| |
| <p>In addition to the $\beta$-vector, the scripts estimate the residual |
| standard deviation $\sigma$ and the $R^2$, the ratio of “explained” |
| variance to the total variance of the response variable. These |
| statistics only make sense if the number of degrees of freedom |
| $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant |
| $\lambda$ is negligible or zero. The formulas for $\sigma$ and |
| $R^2$ are:</p> |
| |
| <script type="math/tex; mode=display">R^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad |
| \sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad |
| R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}</script> |
| |
| <p>where</p> |
| |
| <script type="math/tex; mode=display">\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i - |
| \frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad |
| \mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2</script> |
| |
| <p>Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the |
| estimated regression coefficients and the feature vectors. They may be |
| biased when no intercept is present, hence the RSS formula subtracts the |
| bias.</p> |
| |
| <p>Lastly, note that by choosing the input option <code>icpt=2</code> the |
| user can shift and rescale the columns of $X$ to have zero average and |
| the variance of 1. This is particularly important when using |
| regularization over highly disbalanced features, because regularization |
| tends to penalize small-variance columns (which need large $\beta_j$’s) |
| more than large-variance columns (with small $\beta_j$’s). At the end, |
| the estimated regression coefficients are shifted and rescaled to apply |
| to the original features.</p> |
| |
| <h3 id="returns">Returns</h3> |
| |
| <p>The estimated regression coefficients (the $\hat{\beta}_j$’s) are |
| populated into a matrix and written to an HDFS file whose path/name was |
| provided as the <code>B</code> input argument. What this matrix |
| contains, and its size, depends on the input argument <code>icpt</code>, |
| which specifies the user’s intercept and rescaling choice:</p> |
| |
| <ul> |
| <li><strong>icpt=0</strong>: No intercept, matrix $B$ has size $m\,{\times}\,1$, with |
| $B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to $m = {}$ncol$(X)$.</li> |
| <li><strong>icpt=1</strong>: There is intercept, but no shifting/rescaling of $X$; matrix $B$ has |
| size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from |
| 1 to $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept |
| coefficient.</li> |
| <li><strong>icpt=2</strong>: There is intercept, and the features in $X$ are shifted to mean$ = 0$ |
| and rescaled to variance$ = 1$; then there are two versions of |
| the $\hat{\beta}_j$’s, one for the original features and another for the |
| shifted/rescaled features. Now matrix $B$ has size |
| $(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and |
| $B[\cdot, 2]$ for the shifted/rescaled features, in the above format. |
| Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are |
| obtained from $B[\cdot, 2]$ by complementary shifting and rescaling.</li> |
| </ul> |
| |
| <p>The estimated summary statistics, including residual standard |
| deviation $\sigma$ and the $R^2$, are printed out or sent into a file |
| (if specified) in CSV format as defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>. |
| For conjugate gradient iterations, a log file with monitoring variables |
| can also be made available, see <a href="algorithms-regression.html#table8"><strong>Table 8</strong></a>.</p> |
| |
| <hr /> |
| |
| <h2 id="stepwise-linear-regression">4.2. Stepwise Linear Regression</h2> |
| |
| <h3 id="description-1">Description</h3> |
| |
| <p>Our stepwise linear regression script selects a linear model based on |
| the Akaike information criterion (AIC): the model that gives rise to the |
| lowest AIC is computed.</p> |
| |
| <h3 id="usage-1">Usage</h3> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f StepLinearRegDS.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| S=[file] |
| O=[file] |
| icpt=[int] |
| thr=[double] |
| fmt=[format] |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f StepLinearRegDS.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| S=[file] |
| O=[file] |
| icpt=[int] |
| thr=[double] |
| fmt=[format] |
| </code></pre> |
| </div> |
| </div> |
| |
| <h3 id="arguments-for-spark-and-hadoop-invocation-1">Arguments for Spark and Hadoop invocation</h3> |
| |
| <p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors, each row |
| contains one feature vector.</p> |
| |
| <p><strong>Y</strong>: Location (on HDFS) to read the 1-column matrix of response values</p> |
| |
| <p><strong>B</strong>: Location (on HDFS) to store the estimated regression parameters (the |
| $\beta_j$’s), with the intercept parameter $\beta_0$ at position |
| B[$m\,{+}\,1$, 1] if available</p> |
| |
| <p><strong>S</strong>: (default: <code>" "</code>) Location (on HDFS) to store the selected feature-ids in the |
| order as computed by the algorithm; by default the selected feature-ids |
| are forwarded to the standard output.</p> |
| |
| <p><strong>O</strong>: (default: <code>" "</code>) Location (on HDFS) to store the CSV-file of summary |
| statistics defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>; by default the |
| summary statistics are forwarded to the standard output.</p> |
| |
| <p><strong>icpt</strong>: (default: <code>0</code>) Intercept presence and shifting/rescaling the features |
| in $X$:</p> |
| |
| <ul> |
| <li>0 = no intercept (hence no $\beta_0$), no shifting or |
| rescaling of the features;</li> |
| <li>1 = add intercept, but do not shift/rescale the features |
| in $X$;</li> |
| <li>2 = add intercept, shift/rescale the features in $X$ to |
| mean 0, variance 1</li> |
| </ul> |
| |
| <p><strong>thr</strong>: (default: <code>0.01</code>) Threshold to stop the algorithm: if the decrease in the value |
| of the AIC falls below <code>thr</code> no further features are being |
| checked and the algorithm stops.</p> |
| |
| <p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, |
| <code>mm</code>, or <code>csv</code>; see read/write functions in |
| SystemML Language Reference for details.</p> |
| |
| <h3 id="examples-1">Examples</h3> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f StepLinearRegDS.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| S=/user/ml/selected.csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| thr=0.05 |
| fmt=csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f StepLinearRegDS.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| S=/user/ml/selected.csv |
| O=/user/ml/stats.csv |
| icpt=2 |
| thr=0.05 |
| fmt=csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <h3 id="details-1">Details</h3> |
| |
| <p>Stepwise linear regression iteratively selects predictive variables in |
| an automated procedure. Currently, our implementation supports forward |
| selection: starting from an empty model (without any variable) the |
| algorithm examines the addition of each variable based on the AIC as a |
| model comparison criterion. The AIC is defined as</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| AIC = -2 \log{L} + 2 edf,\label{eq:AIC} |
| \end{equation}</script> |
| |
| <p>where $L$ denotes the |
| likelihood of the fitted model and $edf$ is the equivalent degrees of |
| freedom, i.e., the number of estimated parameters. This procedure is |
| repeated until including no additional variable improves the model by a |
| certain threshold specified in the input parameter <code>thr</code>.</p> |
| |
| <p>For fitting a model in each iteration we use the <code>direct solve</code> method |
| as in the script <code>LinearRegDS.dml</code> discussed in |
| <a href="algorithms-regression.html#linear-regression">Linear Regression</a>.</p> |
| |
| <h3 id="returns-1">Returns</h3> |
| |
| <p>Similar to the outputs from <code>LinearRegDS.dml</code> the stepwise |
| linear regression script computes the estimated regression coefficients |
| and stores them in matrix $B$ on HDFS. The format of matrix $B$ is |
| identical to the one produced by the scripts for linear regression (see |
| <a href="algorithms-regression.html#linear-regression">Linear Regression</a>). Additionally, <code>StepLinearRegDS.dml</code> |
| outputs the variable indices (stored in the 1-column matrix $S$) in the |
| order they have been selected by the algorithm, i.e., $i$th entry in |
| matrix $S$ corresponds to the variable which improves the AIC the most |
| in $i$th iteration. If the model with the lowest AIC includes no |
| variables matrix $S$ will be empty (contains one 0). Moreover, the |
| estimated summary statistics as defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a> |
| are printed out or stored in a file (if requested). In the case where an |
| empty model achieves the best AIC these statistics will not be produced.</p> |
| |
| <hr /> |
| |
| <h2 id="generalized-linear-models">4.3. Generalized Linear Models</h2> |
| |
| <h3 id="description-2">Description</h3> |
| |
| <p>Generalized Linear Models |
| [<a href="algorithms-bibliography.html">Gill2000</a>, |
| <a href="algorithms-bibliography.html">McCullagh1989</a>, |
| <a href="algorithms-bibliography.html">Nelder1972</a>] |
| extend the methodology of linear |
| and logistic regression to a variety of distributions commonly assumed |
| as noise effects in the response variable. As before, we are given a |
| collection of records $(x_1, y_1)$, …, $(x_n, y_n)$ where $x_i$ is a |
| numerical vector of explanatory (feature) variables of size $\dim x_i = m$, and $y_i$ |
| is the response (dependent) variable observed for this vector. GLMs |
| assume that some linear combination of the features in $x_i$ determines |
| the <em>mean</em> $\mu_i$ of $y_i$, while the observed $y_i$ is a random |
| outcome of a noise distribution |
| $Prob[y\mid \mu_i]\,$<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> |
| with that mean $\mu_i$:</p> |
| |
| <script type="math/tex; mode=display">x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} |
| \,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim Prob[y\mid \mu_i]</script> |
| |
| <p>In linear regression the response mean $\mu_i$ <em>equals</em> some linear |
| combination over $x_i$, denoted above by $\eta_i$. In logistic |
| regression with <script type="math/tex">y\in\{0, 1\}</script> (Bernoulli) the mean of $y$ is the same |
| as $Prob[y=1]$ |
| and equals $1/(1+e^{-\eta_i})$, the logistic function of $\eta_i$. In |
| GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone |
| function called the <em>link function</em>: $\eta_i = g(\mu_i)$. The unknown |
| linear combination parameters $\beta_j$ are assumed to be the same for |
| all records.</p> |
| |
| <p>The goal of the regression is to estimate the parameters $\beta_j$ from |
| the observed data. Once the $\beta_j$’s are accurately estimated, we can |
| make predictions about $y$ for a new feature vector $x$. To do so, |
| compute $\eta$ from $x$ and use the inverted link function |
| $\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of $y$; then use the |
| distribution $Prob[y\mid \mu]$ |
| to make predictions about $y$. Both $g(\mu)$ and |
| $Prob[y\mid \mu]$ |
| are user-provided. Our GLM script supports a standard set of |
| distributions and link functions, see below for details.</p> |
| |
| <h3 id="usage-2">Usage</h3> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| fmt=[format] |
| O=[file] |
| Log=[file] |
| dfam=[int] |
| vpow=[double] |
| link=[int] |
| lpow=[double] |
| yneg=[double] |
| icpt=[int] |
| reg=[double] |
| tol=[double] |
| disp=[double] |
| moi=[int] |
| mii=[int] |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| fmt=[format] |
| O=[file] |
| Log=[file] |
| dfam=[int] |
| vpow=[double] |
| link=[int] |
| lpow=[double] |
| yneg=[double] |
| icpt=[int] |
| reg=[double] |
| tol=[double] |
| disp=[double] |
| moi=[int] |
| mii=[int] |
| </code></pre> |
| </div> |
| </div> |
| |
| <h3 id="arguments-for-spark-and-hadoop-invocation-2">Arguments for Spark and Hadoop invocation</h3> |
| |
| <p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors; each row |
| constitutes an example.</p> |
| |
| <p><strong>Y</strong>: Location to read the response matrix, which may have 1 or 2 columns</p> |
| |
| <p><strong>B</strong>: Location to store the estimated regression parameters (the $\beta_j$’s), |
| with the intercept parameter $\beta_0$ at position |
| B[$m\,{+}\,1$, 1] if available</p> |
| |
| <p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, |
| <code>mm</code>, or <code>csv</code>; see read/write functions in |
| SystemML Language Reference for details.</p> |
| |
| <p><strong>O</strong>: (default: <code>" "</code>) Location to write certain summary statistics described |
| in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a>, |
| by default it is standard output.</p> |
| |
| <p><strong>Log</strong>: (default: <code>" "</code>) Location to store iteration-specific variables for monitoring |
| and debugging purposes, see <a href="algorithms-regression.html#table10"><strong>Table 10</strong></a> for details.</p> |
| |
| <p><strong>dfam</strong>: (default: <code>1</code>) Distribution family code to specify |
| $Prob[y\mid \mu]$, |
| see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>:</p> |
| |
| <ul> |
| <li>1 = power distributions with $Var(y) = \mu^{\alpha}$</li> |
| <li>2 = binomial or Bernoulli</li> |
| </ul> |
| |
| <p><strong>vpow</strong>: (default: <code>0.0</code>) When <code>dfam=1</code>, this provides the $q$ in |
| $Var(y) = a\mu^q$, the power |
| dependence of the variance of $y$ on its mean. In particular, use:</p> |
| |
| <ul> |
| <li>0.0 = Gaussian</li> |
| <li>1.0 = Poisson</li> |
| <li>2.0 = Gamma</li> |
| <li>3.0 = inverse Gaussian</li> |
| </ul> |
| |
| <p><strong>link</strong>: (default: <code>0</code>) Link function code to determine the link |
| function $\eta = g(\mu)$:</p> |
| |
| <ul> |
| <li>0 = canonical link (depends on the distribution family), |
| see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a></li> |
| <li>1 = power functions</li> |
| <li>2 = logit</li> |
| <li>3 = probit</li> |
| <li>4 = cloglog</li> |
| <li>5 = cauchit</li> |
| </ul> |
| |
| <p><strong>lpow</strong>: (default: <code>1.0</code>) When link=1, this provides the $s$ in |
| $\eta = \mu^s$, the power link function; <code>lpow=0.0</code> gives the |
| log link $\eta = \log\mu$. Common power links:</p> |
| |
| <ul> |
| <li>-2.0 = $1/\mu^2$</li> |
| <li>-1.0 = reciprocal</li> |
| <li>0.0 = log</li> |
| <li>0.5 = sqrt</li> |
| <li>1.0 = identity</li> |
| </ul> |
| |
| <p><strong>yneg</strong>: (default: <code>0.0</code>) When <code>dfam=2</code> and the response matrix $Y$ has |
| 1 column, this specifies the $y$-value used for Bernoulli “No” label |
| (“Yes” is $1$): |
| 0.0 when $y\in\{0, 1\}$; -1.0 when |
| $y\in\{-1, 1\}$</p> |
| |
| <p><strong>icpt</strong>: (default: <code>0</code>) Intercept and shifting/rescaling of the features in $X$:</p> |
| |
| <ul> |
| <li>0 = no intercept (hence no $\beta_0$), no |
| shifting/rescaling of the features</li> |
| <li>1 = add intercept, but do not shift/rescale the features |
| in $X$</li> |
| <li>2 = add intercept, shift/rescale the features in $X$ to |
| mean 0, variance 1</li> |
| </ul> |
| |
| <p><strong>reg</strong>: (default: <code>0.0</code>) L2-regularization parameter ($\lambda$)</p> |
| |
| <p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance ($\varepsilon$) used in the convergence criterion: we |
| terminate the outer iterations when the deviance changes by less than |
| this factor; see below for details</p> |
| |
| <p><strong>disp</strong>: (default: <code>0.0</code>) Dispersion parameter, or 0.0 to estimate it from |
| data</p> |
| |
| <p><strong>moi</strong>: (default: <code>200</code>) Maximum number of outer (Fisher scoring) iterations</p> |
| |
| <p><strong>mii</strong>: (default: <code>0</code>) Maximum number of inner (conjugate gradient) iterations, or 0 |
| if no maximum limit provided</p> |
| |
| <h3 id="examples-2">Examples</h3> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| dfam=2 |
| link=2 |
| yneg=-1.0 |
| icpt=2 |
| reg=0.01 |
| tol=0.00000001 |
| disp=1.0 |
| moi=100 |
| mii=10 |
| O=/user/ml/stats.csv |
| Log=/user/ml/log.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| fmt=csv |
| dfam=2 |
| link=2 |
| yneg=-1.0 |
| icpt=2 |
| reg=0.01 |
| tol=0.00000001 |
| disp=1.0 |
| moi=100 |
| mii=10 |
| O=/user/ml/stats.csv |
| Log=/user/ml/log.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <hr /> |
| |
| <p><a name="table9"></a> |
| <strong>Table 9</strong>: Besides $\beta$, GLM regression script computes a few summary |
| statistics listed below. They are provided in CSV format, one |
| comma-separated name-value pair per each line.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Name</th> |
| <th>Meaning</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>TERMINATION_CODE</td> |
| <td>A positive integer indicating success/failure as follows: <br />1 = Converged successfully; 2 = Maximum # of iterations reached; 3 = Input (X, Y) out of range; 4 = Distribution/link not supported</td> |
| </tr> |
| <tr> |
| <td>BETA_MIN</td> |
| <td>Smallest beta value (regression coefficient), excluding the intercept</td> |
| </tr> |
| <tr> |
| <td>BETA_MIN_INDEX</td> |
| <td>Column index for the smallest beta value</td> |
| </tr> |
| <tr> |
| <td>BETA_MAX</td> |
| <td>Largest beta value (regression coefficient), excluding the intercept</td> |
| </tr> |
| <tr> |
| <td>BETA_MAX_INDEX</td> |
| <td>Column index for the largest beta value</td> |
| </tr> |
| <tr> |
| <td>INTERCEPT</td> |
| <td>Intercept value, or NaN if there is no intercept (if <code>icpt=0</code>)</td> |
| </tr> |
| <tr> |
| <td>DISPERSION</td> |
| <td>Dispersion used to scale deviance, provided in disp input argument or estimated (same as DISPERSION_EST) if disp argument is $\leq 0$</td> |
| </tr> |
| <tr> |
| <td>DISPERSION_EST</td> |
| <td>Dispersion estimated from the dataset</td> |
| </tr> |
| <tr> |
| <td>DEVIANCE_UNSCALED</td> |
| <td>Deviance from the saturated model, assuming dispersion $= 1.0$</td> |
| </tr> |
| <tr> |
| <td>DEVIANCE_SCALED</td> |
| <td>Deviance from the saturated model, scaled by DISPERSION value</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <p><a name="table10"></a> |
| <strong>Table 10</strong>: The Log file for GLM regression contains the below |
| iteration variables in CSV format, each line containing a triple (Name, |
| Iteration#, Value) with Iteration# being 0 for initial values.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Name</th> |
| <th>Meaning</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>NUM_CG_ITERS</td> |
| <td>Number of inner (Conj. Gradient) iterations in this outer iteration</td> |
| </tr> |
| <tr> |
| <td>IS_TRUST_REACHED</td> |
| <td>1 = trust region boundary was reached, 0 = otherwise</td> |
| </tr> |
| <tr> |
| <td>POINT_STEP_NORM</td> |
| <td>L2-norm of iteration step from old point ($\beta$-vector) to new point</td> |
| </tr> |
| <tr> |
| <td>OBJECTIVE</td> |
| <td>The loss function we minimize (negative partial log-likelihood)</td> |
| </tr> |
| <tr> |
| <td>OBJ_DROP_REAL</td> |
| <td>Reduction in the objective during this iteration, actual value</td> |
| </tr> |
| <tr> |
| <td>OBJ_DROP_PRED</td> |
| <td>Reduction in the objective predicted by a quadratic approximation</td> |
| </tr> |
| <tr> |
| <td>OBJ_DROP_RATIO</td> |
| <td>Actual-to-predicted reduction ratio, used to update the trust region</td> |
| </tr> |
| <tr> |
| <td>GRADIENT_NORM</td> |
| <td>L2-norm of the loss function gradient (omitted if point is rejected)</td> |
| </tr> |
| <tr> |
| <td>LINEAR_TERM_MIN</td> |
| <td>The minimum value of $X$ %*% $\beta$, used to check for overflows</td> |
| </tr> |
| <tr> |
| <td>LINEAR_TERM_MAX</td> |
| <td>The maximum value of $X$ %*% $\beta$, used to check for overflows</td> |
| </tr> |
| <tr> |
| <td>IS_POINT_UPDATED</td> |
| <td>1 = new point accepted; 0 = new point rejected, old point restored</td> |
| </tr> |
| <tr> |
| <td>TRUST_DELTA</td> |
| <td>Updated trust region size, the “delta”</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <p><a name="table11"></a> |
| <strong>Table 11</strong>: Common GLM distribution families and link functions. (Here “*” stands |
| for “any value.”)</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th style="text-align: center">dfam</th> |
| <th style="text-align: center">vpow</th> |
| <th style="text-align: center">link</th> |
| <th style="text-align: center">lpow</th> |
| <th style="text-align: center">Distribution<br />Family</th> |
| <th style="text-align: center">Link<br /> Function</th> |
| <th style="text-align: center">Canonical</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">0.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">-1.0</td> |
| <td style="text-align: center">Gaussian</td> |
| <td style="text-align: center">inverse</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">0.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 0.0</td> |
| <td style="text-align: center">Gaussian</td> |
| <td style="text-align: center">log</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">0.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 1.0</td> |
| <td style="text-align: center">Gaussian</td> |
| <td style="text-align: center">identity</td> |
| <td style="text-align: center">Yes</td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">1.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 0.0</td> |
| <td style="text-align: center">Poisson</td> |
| <td style="text-align: center">log</td> |
| <td style="text-align: center">Yes</td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">1.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 0.5</td> |
| <td style="text-align: center">Poisson</td> |
| <td style="text-align: center">sq.root</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">1.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 1.0</td> |
| <td style="text-align: center">Poisson</td> |
| <td style="text-align: center">identity</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">2.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">-1.0</td> |
| <td style="text-align: center">Gamma</td> |
| <td style="text-align: center">inverse</td> |
| <td style="text-align: center">Yes</td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">2.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 0.0</td> |
| <td style="text-align: center">Gamma</td> |
| <td style="text-align: center">log</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">2.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 1.0</td> |
| <td style="text-align: center">Gamma</td> |
| <td style="text-align: center">identity</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">3.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">-2.0</td> |
| <td style="text-align: center">Inverse Gauss</td> |
| <td style="text-align: center">$1/\mu^2$</td> |
| <td style="text-align: center">Yes</td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">3.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">-1.0</td> |
| <td style="text-align: center">Inverse Gauss</td> |
| <td style="text-align: center">inverse</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">3.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 0.0</td> |
| <td style="text-align: center">Inverse Gauss</td> |
| <td style="text-align: center">log</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center">3.0</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 1.0</td> |
| <td style="text-align: center">Inverse Gauss</td> |
| <td style="text-align: center">identity</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">2</td> |
| <td style="text-align: center">*</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 0.0</td> |
| <td style="text-align: center">Binomial</td> |
| <td style="text-align: center">log</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">2</td> |
| <td style="text-align: center">*</td> |
| <td style="text-align: center">1</td> |
| <td style="text-align: center"> 0.5</td> |
| <td style="text-align: center">Binomial</td> |
| <td style="text-align: center">sq.root</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">2</td> |
| <td style="text-align: center">*</td> |
| <td style="text-align: center">2</td> |
| <td style="text-align: center"> *</td> |
| <td style="text-align: center">Binomial</td> |
| <td style="text-align: center">logit</td> |
| <td style="text-align: center">Yes</td> |
| </tr> |
| <tr> |
| <td style="text-align: center">2</td> |
| <td style="text-align: center">*</td> |
| <td style="text-align: center">3</td> |
| <td style="text-align: center"> *</td> |
| <td style="text-align: center">Binomial</td> |
| <td style="text-align: center">probit</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">2</td> |
| <td style="text-align: center">*</td> |
| <td style="text-align: center">4</td> |
| <td style="text-align: center"> *</td> |
| <td style="text-align: center">Binomial</td> |
| <td style="text-align: center">cloglog</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| <tr> |
| <td style="text-align: center">2</td> |
| <td style="text-align: center">*</td> |
| <td style="text-align: center">5</td> |
| <td style="text-align: center"> *</td> |
| <td style="text-align: center">Binomial</td> |
| <td style="text-align: center">cauchit</td> |
| <td style="text-align: center"> </td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <p><a name="table12"></a> |
| <strong>Table 12</strong>: The supported non-power link functions for the Bernoulli and the |
| binomial distributions. Here $\mu$ is the Bernoulli mean.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Name</th> |
| <th>Link Function</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>Logit</td> |
| <td>$\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$</td> |
| </tr> |
| <tr> |
| <td>Probit</td> |
| <td><script type="math/tex">\displaystyle \mu = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut} e^{-\frac{t^2}{2}} dt</script></td> |
| </tr> |
| <tr> |
| <td>Cloglog</td> |
| <td>$\displaystyle \eta = \log \big(- \log(1 - \mu)\big)^{\mathstrut}$</td> |
| </tr> |
| <tr> |
| <td>Cauchit</td> |
| <td>$\displaystyle \eta = \tan\pi(\mu - 1/2)$</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <h3 id="details-2">Details</h3> |
| |
| <p>In GLM, the noise distribution |
| $Prob[y\mid \mu]$ |
| of the response variable $y$ given its mean $\mu$ is restricted to have |
| the <em>exponential family</em> form</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| Y \sim\, Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a} |
| + c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = E(Y) = b'(\theta). |
| \end{equation}</script> |
| |
| <p>Changing the mean in such a distribution simply |
| multiplies all |
| $Prob[y\mid \mu]$ |
| by $e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so |
| that they again integrate to 1. Parameter $\theta$ is called |
| <em>canonical</em>, and the function $\theta = b’^{\,-1}(\mu)$ that relates it |
| to the mean is called the <em>canonical link</em>; constant $a$ is called |
| <em>dispersion</em> and rescales the variance of $y$. Many common distributions |
| can be put into this form, see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>. The canonical |
| parameter $\theta$ is often chosen to coincide with $\eta$, the linear |
| combination of the regression features; other choices for $\eta$ are |
| possible too.</p> |
| |
| <p>Rather than specifying the canonical link, GLM distributions are |
| commonly defined by their variance |
| $Var(y)$ as the function of |
| the mean $\mu$. It can be shown from Eq. 5 that |
| $Var(y) = a\,b’’(\theta) = a\,b’‘(b’^{\,-1}(\mu))$. |
| For example, for the Bernoulli distribution |
| $Var(y) = \mu(1-\mu)$, for the |
| Poisson distribution |
| $Var(y) = \mu$, and for the Gaussian distribution |
| $Var(y) = a\cdot 1 = \sigma^2$. |
| It turns out that for many common distributions |
| $Var(y) = a\mu^q$, a power |
| function. We support all distributions where |
| $Var(y) = a\mu^q$, as well as |
| the Bernoulli and the binomial distributions.</p> |
| |
| <p>For distributions with |
| $Var(y) = a\mu^q$ the |
| canonical link is also a power function, namely |
| $\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose |
| canonical link is $\theta = \log\mu$. We support all power link |
| functions in the form $\eta = \mu^s$, dropping any constant factor, with |
| $\eta = \log\mu$ for $s=0$. The binomial distribution has its own family |
| of link functions, which includes logit (the canonical link), probit, |
| cloglog, and cauchit (see <a href="algorithms-regression.html#table12"><strong>Table 12</strong></a>); we support |
| these only for the binomial and Bernoulli distributions. Links and |
| distributions are specified via four input parameters: |
| <code>dfam</code>, <code>vpow</code>, <code>link</code>, and |
| <code>lpow</code> (see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>).</p> |
| |
| <p>The observed response values are provided to the regression script as a |
| matrix $Y$ having 1 or 2 columns. If a power distribution family is |
| selected (<code>dfam=1</code>), matrix $Y$ must have 1 column that |
| provides $y_i$ for each $x_i$ in the corresponding row of matrix $X$. |
| When dfam=2 and $Y$ has 1 column, we assume the Bernoulli |
| distribution for <script type="math/tex">y_i\in\{y_{\mathrm{neg}}, 1\}</script> with $y_{\mathrm{neg}}$ |
| from the input parameter <code>yneg</code>. When <code>dfam=2</code> and |
| $Y$ has 2 columns, we assume the binomial distribution; for each row $i$ |
| in $X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide the positive and the |
| negative binomial counts respectively. Internally we convert the |
| 1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.</p> |
| |
| <p>We estimate the regression parameters via L2-regularized negative |
| log-likelihood minimization:</p> |
| |
| <script type="math/tex; mode=display">f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big) |
| \,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min</script> |
| |
| <p>where |
| $\theta_i$ and $b(\theta_i)$ are from (6); note that $a$ and |
| $c(y, a)$ are constant w.r.t. $\beta$ and can be ignored here. The |
| canonical parameter $\theta_i$ depends on both $\beta$ and $x_i$:</p> |
| |
| <script type="math/tex; mode=display">\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\, |
| \big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)</script> |
| |
| <p>The user-provided (via <code>reg</code>) regularization coefficient |
| $\lambda\geq 0$ can be used to mitigate overfitting and degeneracy in |
| the data. Note that the intercept is never regularized.</p> |
| |
| <p>Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring |
| approximation to the difference |
| $\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$, |
| recomputed at each iteration:</p> |
| |
| <script type="math/tex; mode=display">\begin{gathered} |
| \varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z, |
| \,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!diag(w) X \,+\, \lambda I\\ |
| \textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta, |
| \,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\ |
| \forall\,i = 1\ldots n: \,\,\,\, |
| w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1} |
| \!\!\!\!\!\!,\,\,\,\,\,\,\,\,\, |
| u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1} |
| \!\!\!\!\!\!.\,\,\,\,\end{gathered}</script> |
| |
| <p>Here |
| $v(\mu_i)=Var(y_i)/a$, the |
| variance of $y_i$ as the function of the mean, and |
| $g’(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative. The |
| Fisher scoring approximation is minimized by trust-region conjugate |
| gradient iterations (called the <em>inner</em> iterations, with the Fisher |
| scoring iterations as the <em>outer</em> iterations), which approximately solve |
| the following problem:</p> |
| |
| <script type="math/tex; mode=display">1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\, |
| \|z\|_2 \leq \delta</script> |
| |
| <p>The conjugate gradient algorithm closely follows |
| Algorithm 7.2 on page 171 of <a href="algorithms-bibliography.html">[Nocedal2006]</a>. The trust region |
| size $\delta$ is initialized as |
| $0.5\sqrt{m}\,/ \max\nolimits_i |x_i|_2$ and updated as described |
| in <a href="algorithms-bibliography.html">[Nocedal2006]</a>. |
| The user can specify the maximum number of |
| the outer and the inner iterations with input parameters |
| <code>moi</code> and <code>mii</code>, respectively. The Fisher scoring |
| algorithm terminates successfully if |
| $2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}{\varepsilon}$ |
| where ${\varepsilon}> 0$ is a tolerance supplied by the user via |
| <code>tol</code>, and $D_1(\beta)$ is the unit-dispersion deviance |
| estimated as</p> |
| |
| <script type="math/tex; mode=display">D_1(\beta) \,\,=\,\, 2 \cdot \big(Prob[Y \mid \! |
| \begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1] |
| \,\,-\,\,Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)</script> |
| |
| <p>The deviance estimate is also produced as part of the output. Once the |
| Fisher scoring algorithm terminates, if requested by the user, we |
| estimate the dispersion $a$ from (6) using Pearson residuals</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| \hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)} |
| \end{equation}</script> |
| |
| <p>and use it to adjust our deviance estimate: |
| $D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$. If input argument |
| disp is 0.0 we estimate $\hat{a}$, otherwise |
| we use its value as $a$. Note that in (7) $m$ counts |
| the intercept ($m \leftarrow m+1$) if it is present.</p> |
| |
| <h3 id="returns-2">Returns</h3> |
| |
| <p>The estimated regression parameters (the $\hat{\beta}_j$’s) are |
| populated into a matrix and written to an HDFS file whose path/name was |
| provided as the <code>B</code> input argument. What this matrix |
| contains, and its size, depends on the input argument <code>icpt</code>, |
| which specifies the user’s intercept and rescaling choice:</p> |
| |
| <ul> |
| <li><strong>icpt=0</strong>: No intercept, matrix $B$ has size $m\,{\times}\,1$, with |
| $B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to $m = {}$ncol$(X)$.</li> |
| <li><strong>icpt=1</strong>: There is intercept, but no shifting/rescaling of $X$; matrix $B$ has |
| size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from |
| 1 to $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept |
| coefficient.</li> |
| <li><strong>icpt=2</strong>: There is intercept, and the features in $X$ are shifted to mean${} = 0$ |
| and rescaled to variance${} = 1$; then there are two versions of |
| the $\hat{\beta}_j$’s, one for the original features and another for the |
| shifted/rescaled features. Now matrix $B$ has size |
| $(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and |
| $B[\cdot, 2]$ for the shifted/rescaled features, in the above format. |
| Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are |
| obtained from $B[\cdot, 2]$ by complementary shifting and rescaling.</li> |
| </ul> |
| |
| <p>Our script also estimates the dispersion $\hat{a}$ (or takes it from the |
| user’s input) and the deviances $D_1(\hat{\beta})$ and |
| $D_{\hat{a}}(\hat{\beta})$, see <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a> for details. A |
| log file with variables monitoring progress through the iterations can |
| also be made available, see <a href="algorithms-regression.html#table10"><strong>Table 10</strong></a>.</p> |
| |
| <h3 id="see-also">See Also</h3> |
| |
| <p>In case of binary classification problems, consider using L2-SVM or |
| binary logistic regression; for multiclass classification, use |
| multiclass SVM or multinomial logistic regression. For the special cases |
| of linear regression and logistic regression, it may be more efficient |
| to use the corresponding specialized scripts instead of GLM.</p> |
| |
| <hr /> |
| |
| <h2 id="stepwise-generalized-linear-regression">4.4. Stepwise Generalized Linear Regression</h2> |
| |
| <h3 id="description-3">Description</h3> |
| |
| <p>Our stepwise generalized linear regression script selects a model based |
| on the Akaike information criterion (AIC): the model that gives rise to |
| the lowest AIC is provided. Note that currently only the Bernoulli |
| distribution family is supported (see below for details).</p> |
| |
| <h3 id="usage-3">Usage</h3> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f StepGLM.dml |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| S=[file] |
| O=[file] |
| link=[int] |
| yneg=[double] |
| icpt=[int] |
| tol=[double] |
| disp=[double] |
| moi=[int] |
| mii=[int] |
| thr=[double] |
| fmt=[format] |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f StepGLM.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=<file> |
| B=<file> |
| S=[file] |
| O=[file] |
| link=[int] |
| yneg=[double] |
| icpt=[int] |
| tol=[double] |
| disp=[double] |
| moi=[int] |
| mii=[int] |
| thr=[double] |
| fmt=[format] |
| </code></pre> |
| </div> |
| </div> |
| |
| <h3 id="arguments-for-spark-and-hadoop-invocation-3">Arguments for Spark and Hadoop invocation</h3> |
| |
| <p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors; each row is an |
| example.</p> |
| |
| <p><strong>Y</strong>: Location (on HDFS) to read the response matrix, which may have 1 or 2 |
| columns</p> |
| |
| <p><strong>B</strong>: Location (on HDFS) to store the estimated regression parameters (the |
| $\beta_j$’s), with the intercept parameter $\beta_0$ at position |
| B[$m\,{+}\,1$, 1] if available</p> |
| |
| <p><strong>S</strong>: (default: <code>" "</code>) Location (on HDFS) to store the selected feature-ids in the |
| order as computed by the algorithm, by default it is standard output.</p> |
| |
| <p><strong>O</strong>: (default: <code>" "</code>) Location (on HDFS) to write certain summary statistics |
| described in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a>, by default it is standard |
| output.</p> |
| |
| <p><strong>link</strong>: (default: <code>2</code>) Link function code to determine the link |
| function $\eta = g(\mu)$, see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>; currently the |
| following link functions are supported:</p> |
| |
| <ul> |
| <li>1 = log</li> |
| <li>2 = logit</li> |
| <li>3 = probit</li> |
| <li>4 = cloglog</li> |
| </ul> |
| |
| <p><strong>yneg</strong>: (default: <code>0.0</code>) Response value for Bernoulli “No” label, usually 0.0 or -1.0</p> |
| |
| <p><strong>icpt</strong>: (default: <code>0</code>) Intercept and shifting/rescaling of the features in $X$:</p> |
| |
| <ul> |
| <li>0 = no intercept (hence no $\beta_0$), no |
| shifting/rescaling of the features</li> |
| <li>1 = add intercept, but do not shift/rescale the features |
| in $X$</li> |
| <li>2 = add intercept, shift/rescale the features in $X$ to |
| mean 0, variance 1</li> |
| </ul> |
| |
| <p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance ($\epsilon$) used in the convergence criterion: we |
| terminate the outer iterations when the deviance changes by less than |
| this factor; see below for details.</p> |
| |
| <p><strong>disp</strong>: (default: <code>0.0</code>) Dispersion parameter, or <code>0.0</code> to estimate it from |
| data</p> |
| |
| <p><strong>moi</strong>: (default: <code>200</code>) Maximum number of outer (Fisher scoring) iterations</p> |
| |
| <p><strong>mii</strong>: (default: <code>0</code>) Maximum number of inner (conjugate gradient) iterations, or 0 |
| if no maximum limit provided</p> |
| |
| <p><strong>thr</strong>: (default: <code>0.01</code>) Threshold to stop the algorithm: if the decrease in the value |
| of the AIC falls below <code>thr</code> no further features are being |
| checked and the algorithm stops.</p> |
| |
| <p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, |
| <code>mm</code>, or <code>csv</code>; see read/write functions in |
| SystemML Language Reference for details.</p> |
| |
| <h3 id="examples-3">Examples</h3> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f StepGLM.dml |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| S=/user/ml/selected.csv |
| O=/user/ml/stats.csv |
| link=2 |
| yneg=-1.0 |
| icpt=2 |
| tol=0.000001 |
| moi=100 |
| mii=10 |
| thr=0.05 |
| fmt=csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f StepGLM.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| Y=/user/ml/Y.mtx |
| B=/user/ml/B.mtx |
| S=/user/ml/selected.csv |
| O=/user/ml/stats.csv |
| link=2 |
| yneg=-1.0 |
| icpt=2 |
| tol=0.000001 |
| moi=100 |
| mii=10 |
| thr=0.05 |
| fmt=csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <h3 id="details-3">Details</h3> |
| |
| <p>Similar to <code>StepLinearRegDS.dml</code> our stepwise GLM script |
| builds a model by iteratively selecting predictive variables using a |
| forward selection strategy based on the AIC (5). Note that |
| currently only the Bernoulli distribution family (<code>fam=2</code> in |
| <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>) together with the following link functions |
| are supported: <code>log</code>, <code>logit</code>, <code>probit</code>, and <code>cloglog</code> (link |
| <script type="math/tex">\in\{1,2,3,4\}</script> in <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>).</p> |
| |
| <h3 id="returns-3">Returns</h3> |
| |
| <p>Similar to the outputs from <code>GLM.dml</code> the stepwise GLM script |
| computes the estimated regression coefficients and stores them in matrix |
| $B$ on HDFS; matrix $B$ follows the same format as the one produced by |
| <code>GLM.dml</code> (see <a href="algorithms-regression.html#generalized-linear-models">Generalized Linear Models</a>). Additionally, |
| <code>StepGLM.dml</code> outputs the variable indices (stored in the |
| 1-column matrix $S$) in the order they have been selected by the |
| algorithm, i.e., $i$th entry in matrix $S$ stores the variable which |
| improves the AIC the most in $i$th iteration. If the model with the |
| lowest AIC includes no variables matrix $S$ will be empty. Moreover, the |
| estimated summary statistics as defined in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a> are |
| printed out or stored in a file on HDFS (if requested); these statistics |
| will be provided only if the selected model is nonempty, i.e., contains |
| at least one variable.</p> |
| |
| <hr /> |
| |
| <h2 id="regression-scoring-and-prediction">4.5. Regression Scoring and Prediction</h2> |
| |
| <h3 id="description-4">Description</h3> |
| |
| <p>Script <code>GLM-predict.dml</code> is intended to cover all linear |
| model based regressions, including linear regression, binomial and |
| multinomial logistic regression, and GLM regressions (Poisson, gamma, |
| binomial with probit link etc.). Having just one scoring script for all |
| these regressions simplifies maintenance and enhancement while ensuring |
| compatible interpretations for output statistics.</p> |
| |
| <p>The script performs two functions, prediction and scoring. To perform |
| prediction, the script takes two matrix inputs: a collection of records |
| $X$ (without the response attribute) and the estimated regression |
| parameters $B$, also known as $\beta$. To perform scoring, in addition |
| to $X$ and $B$, the script takes the matrix of actual response |
| values $Y$ that are compared to the predictions made with $X$ and $B$. |
| Of course there are other, non-matrix, input arguments that specify the |
| model and the output format, see below for the full list.</p> |
| |
| <p>We assume that our test/scoring dataset is given by |
| $n\,{\times}\,m$-matrix $X$ of numerical feature vectors, where each |
| row $x_i$ represents one feature vector of one record; we have $\dim x_i = m$. Each |
| record also includes the response variable $y_i$ that may be numerical, |
| single-label categorical, or multi-label categorical. A single-label |
| categorical $y_i$ is an integer category label, one label per record; a |
| multi-label $y_i$ is a vector of integer counts, one count for each |
| possible label, which represents multiple single-label events |
| (observations) for the same $x_i$. Internally we convert single-label |
| categoricals into multi-label categoricals by replacing each label $l$ |
| with an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$. In |
| prediction-only tasks the actual $y_i$’s are not needed to the script, |
| but they are needed for scoring.</p> |
| |
| <p>To perform prediction, the script matrix-multiplies $X$ and $B$, adding |
| the intercept if available, then applies the inverse of the model’s link |
| function. All GLMs assume that the linear combination of the features |
| in $x_i$ and the betas in $B$ determines the means $\mu_i$ of |
| the $y_i$’s (in numerical or multi-label categorical form) with |
| $\dim \mu_i = \dim y_i$. The observed $y_i$ is assumed to follow a |
| specified GLM family distribution |
| $Prob[y\mid \mu_i]$ |
| with mean(s) $\mu_i$:</p> |
| |
| <script type="math/tex; mode=display">x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} |
| \,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim Prob[y\mid \mu_i]</script> |
| |
| <p>If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number. Then |
| our script’s output matrix $M$ is the $n\,{\times}\,1$-vector of these |
| means $\mu_i$. Note that $\mu_i$ predicts the mean of $y_i$, not the |
| actual $y_i$. For example, in Poisson distribution, the mean is usually |
| fractional, but the actual $y_i$ is always integer.</p> |
| |
| <p>If $y_i$ is categorical, i.e. a vector of label counts for record $i$, |
| then $\mu_i$ is a vector of non-negative real numbers, one number |
| <script type="math/tex">\mu_{i,l}</script> per each label $l$. In this case we divide the <script type="math/tex">\mu_{i,l}</script> |
| by their sum $\sum_l \mu_{i,l}$ to obtain predicted label |
| probabilities <script type="math/tex">p_{i,l}\in [0, 1]</script>. The output matrix $M$ is the |
| $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the |
| number of records and $k\,{+}\,1$ is the number of categories<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. Note |
| again that we do not predict the labels themselves, nor their actual |
| counts per record, but we predict the labels’ probabilities.</p> |
| |
| <p>Going from predicted probabilities to predicted labels, in the |
| single-label categorical case, requires extra information such as the |
| cost of false positive versus false negative errors. For example, if |
| there are 5 categories and we <em>accurately</em> predicted their probabilities |
| as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the highest-probability |
| label would be wrong 70% of the time, whereas picking the |
| lowest-probability label might be right if, say, it represents a |
| diagnosis of cancer or another rare and serious outcome. Hence, we keep |
| this step outside the scope of <code>GLM-predict.dml</code> for now.</p> |
| |
| <h3 id="usage-4">Usage</h3> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs X=<file> |
| Y=[file] |
| B=<file> |
| M=[file] |
| O=[file] |
| dfam=[int] |
| vpow=[double] |
| link=[int] |
| lpow=[double] |
| disp=[double] |
| fmt=[format] |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| Y=[file] |
| B=<file> |
| M=[file] |
| O=[file] |
| dfam=[int] |
| vpow=[double] |
| link=[int] |
| lpow=[double] |
| disp=[double] |
| fmt=[format] |
| </code></pre> |
| </div> |
| </div> |
| |
| <h3 id="arguments-for-spark-and-hadoop-invocation-4">Arguments for Spark and Hadoop invocation</h3> |
| |
| <p><strong>X</strong>: Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature |
| vectors, each row constitutes one feature vector (one record)</p> |
| |
| <p><strong>Y</strong>: (default: <code>" "</code>) Location to read the response matrix $Y$ needed for scoring |
| (but optional for prediction), with the following dimensions:</p> |
| |
| <ul> |
| <li>$n {\times} 1$: acceptable for all distributions |
| (<code>dfam=1</code> or <code>2</code> or <code>3</code>)</li> |
| <li>$n {\times} 2$: for binomial (<code>dfam=2</code>) if given by |
| (#pos, #neg) counts</li> |
| <li>$n {\times} k\,{+}\,1$: for multinomial (<code>dfam=3</code>) if |
| given by category counts</li> |
| </ul> |
| |
| <p><strong>M</strong>: (default: <code>" "</code>) Location to write, if requested, the matrix of predicted |
| response means (for <code>dfam=1</code>) or probabilities (for |
| <code>dfam=2</code> or <code>3</code>):</p> |
| |
| <ul> |
| <li>$n {\times} 1$: for power-type distributions (<code>dfam=1</code>)</li> |
| <li>$n {\times} 2$: for binomial distribution (<code>dfam=2</code>), |
| col# 2 is the “No” probability</li> |
| <li>$n {\times} k\,{+}\,1$: for multinomial logit (<code>dfam=3</code>), |
| col# $k\,{+}\,1$ is for the baseline</li> |
| </ul> |
| |
| <p><strong>B</strong>: Location to read matrix $B$ of the betas, i.e. estimated GLM regression |
| parameters, with the intercept at row# $m\,{+}\,1$ if available:</p> |
| |
| <ul> |
| <li>$\dim(B) \,=\, m {\times} k$: do not add intercept</li> |
| <li>$\dim(B) \,=\, m\,{+}\,1 {\times} k$: add intercept as given by the |
| last $B$-row</li> |
| <li>if $k > 1$, use only $B[, 1]$ unless it is Multinomial Logit |
| (<code>dfam=3</code>)</li> |
| </ul> |
| |
| <p><strong>O</strong>: (default: <code>" "</code>) Location to store the CSV-file with goodness-of-fit |
| statistics defined in <a href="algorithms-regression.html#table13"><strong>Table 13</strong></a>, |
| the default is to |
| print them to the standard output</p> |
| |
| <p><strong>dfam</strong>: (default: <code>1</code>) GLM distribution family code to specify the type of |
| distribution |
| $Prob[y\,|\,\mu]$ |
| that we assume:</p> |
| |
| <ul> |
| <li>1 = power distributions with |
| $Var(y) = \mu^{\alpha}$, see |
| <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a></li> |
| <li>2 = binomial</li> |
| <li>3 = multinomial logit</li> |
| </ul> |
| |
| <p><strong>vpow</strong>: (default: <code>0.0</code>) Power for variance defined as (mean)$^{\textrm{power}}$ |
| (ignored if <code>dfam</code>$\,{\neq}\,1$): when <code>dfam=1</code>, |
| this provides the $q$ in |
| $Var(y) = a\mu^q$, the power |
| dependence of the variance of $y$ on its mean. In particular, use:</p> |
| |
| <ul> |
| <li>0.0 = Gaussian</li> |
| <li>1.0 = Poisson</li> |
| <li>2.0 = Gamma</li> |
| <li>3.0 = inverse Gaussian</li> |
| </ul> |
| |
| <p><strong>link</strong>: (default: <code>0</code>) Link function code to determine the link |
| function $\eta = g(\mu)$, ignored for multinomial logit |
| (<code>dfam=3</code>):</p> |
| |
| <ul> |
| <li>0 = canonical link (depends on the distribution family), |
| see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a></li> |
| <li>1 = power functions</li> |
| <li>2 = logit</li> |
| <li>3 = probit</li> |
| <li>4 = cloglog</li> |
| <li>5 = cauchit</li> |
| </ul> |
| |
| <p><strong>lpow</strong>: (default: <code>1.0</code>) Power for link function defined as |
| (mean)$^{\textrm{power}}$ (ignored if <code>link</code>$\,{\neq}\,1$): |
| when <code>link=1</code>, this provides the $s$ in $\eta = \mu^s$, the |
| power link function; <code>lpow=0.0</code> gives the log link |
| $\eta = \log\mu$. Common power links:</p> |
| |
| <ul> |
| <li>-2.0 = $1/\mu^2$</li> |
| <li>-1.0 = reciprocal</li> |
| <li>0.0 = log</li> |
| <li>0.5 = sqrt</li> |
| <li>1.0 = identity</li> |
| </ul> |
| |
| <p><strong>disp</strong>: (default: <code>1.0</code>) Dispersion value, when available; must be positive</p> |
| |
| <p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix M file output format, such as |
| <code>text</code>, <code>mm</code>, or <code>csv</code>; see read/write |
| functions in SystemML Language Reference for details.</p> |
| |
| <h3 id="examples-4">Examples</h3> |
| |
| <p>Note that in the examples below the value for the <code>disp</code> input |
| argument is set arbitrarily. The correct dispersion value should be |
| computed from the training data during model estimation, or omitted if |
| unknown (which sets it to <code>1.0</code>).</p> |
| |
| <p><strong>Linear regression example</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=1 |
| vpow=0.0 |
| link=1 |
| lpow=1.0 |
| disp=5.67 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=1 |
| vpow=0.0 |
| link=1 |
| lpow=1.0 |
| disp=5.67 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>Linear regression example, prediction only (no Y given)</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=1 |
| vpow=0.0 |
| link=1 |
| lpow=1.0 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=1 |
| vpow=0.0 |
| link=1 |
| lpow=1.0 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>Binomial logistic regression example</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=2 |
| link=2 |
| disp=3.0004464 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=2 |
| link=2 |
| disp=3.0004464 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>Binomial probit regression example</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=2 |
| link=3 |
| disp=3.0004464 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=2 |
| link=3 |
| disp=3.0004464 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>Multinomial logistic regression example</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=3 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=3 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Probabilities.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>Poisson regression with the log link example</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=1 |
| vpow=1.0 |
| link=1 |
| lpow=0.0 |
| disp=3.45 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=1 |
| vpow=1.0 |
| link=1 |
| lpow=0.0 |
| disp=3.45 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>Gamma regression with the inverse (reciprocal) link example</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f GLM-predict.dml |
| -nvargs dfam=1 |
| vpow=2.0 |
| link=1 |
| lpow=-1.0 |
| disp=1.99118 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f GLM-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs dfam=1 |
| vpow=2.0 |
| link=1 |
| lpow=-1.0 |
| disp=1.99118 |
| X=/user/ml/X.mtx |
| B=/user/ml/B.mtx |
| M=/user/ml/Means.mtx |
| fmt=csv |
| Y=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <hr /> |
| |
| <p><a name="table13"></a> |
| <strong>Table 13</strong>: The goodness-of-fit statistics are provided in CSV format, one per each line, with four |
| columns: (Name, CID, Disp?, Value). The columns are: |
| “Name” is the string identifier for the statistic; |
| “CID” is an optional integer value that specifies the $Y$-column index for per-column statistics |
| (note that a bi/multinomial one-column Y-input is converted into multi-column); |
| “Disp?” is an optional Boolean value ($TRUE$ or $FALSE$) that tells us |
| whether or not scaling by the input dispersion parameter <code>disp</code> has been applied to this |
| statistic; |
| “Value” is the value of the statistic.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Name</th> |
| <th style="text-align: center">CID</th> |
| <th style="text-align: center">Disp?</th> |
| <th>Meaning</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>LOGLHOOD_Z</td> |
| <td style="text-align: center"> </td> |
| <td style="text-align: center">+</td> |
| <td>Log-likelihood $Z$-score (in st. dev.’s from the mean)</td> |
| </tr> |
| <tr> |
| <td>LOGLHOOD_Z_PVAL</td> |
| <td style="text-align: center"> </td> |
| <td style="text-align: center">+</td> |
| <td>Log-likelihood $Z$-score p-value, two-sided</td> |
| </tr> |
| <tr> |
| <td>PEARSON_X2</td> |
| <td style="text-align: center"> </td> |
| <td style="text-align: center">+</td> |
| <td>Pearson residual $X^2$-statistic</td> |
| </tr> |
| <tr> |
| <td>PEARSON_X2_BY_DF</td> |
| <td style="text-align: center"> </td> |
| <td style="text-align: center">+</td> |
| <td>Pearson $X^2$ divided by degrees of freedom</td> |
| </tr> |
| <tr> |
| <td>PEARSON_X2_PVAL</td> |
| <td style="text-align: center"> </td> |
| <td style="text-align: center">+</td> |
| <td>Pearson $X^2$ p-value</td> |
| </tr> |
| <tr> |
| <td>DEVIANCE_G2</td> |
| <td style="text-align: center"> </td> |
| <td style="text-align: center">+</td> |
| <td>Deviance from the saturated model $G^2$-statistic</td> |
| </tr> |
| <tr> |
| <td>DEVIANCE_G2_BY_DF</td> |
| <td style="text-align: center"> </td> |
| <td style="text-align: center">+</td> |
| <td>Deviance $G^2$ divided by degrees of freedom</td> |
| </tr> |
| <tr> |
| <td>DEVIANCE_G2_PVAL</td> |
| <td style="text-align: center"> </td> |
| <td style="text-align: center">+</td> |
| <td>Deviance $G^2$ p-value</td> |
| </tr> |
| <tr> |
| <td>AVG_TOT_Y</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center"> </td> |
| <td>$Y$-column average for an individual response value</td> |
| </tr> |
| <tr> |
| <td>STDEV_TOT_Y</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center"> </td> |
| <td>$Y$-column st. dev. for an individual response value</td> |
| </tr> |
| <tr> |
| <td>AVG_RES_Y</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center"> </td> |
| <td>$Y$-column residual average of $Y - pred. mean(Y|X)$</td> |
| </tr> |
| <tr> |
| <td>STDEV_RES_Y</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center"> </td> |
| <td>$Y$-column residual st. dev. of $Y - pred. mean(Y|X)$</td> |
| </tr> |
| <tr> |
| <td>PRED_STDEV_RES</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center">+</td> |
| <td>Model-predicted $Y$-column residual st. deviation</td> |
| </tr> |
| <tr> |
| <td>R2</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center"> </td> |
| <td>$R^2$ of $Y$-column residual with bias included</td> |
| </tr> |
| <tr> |
| <td>ADJUSTED_R2</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center"> </td> |
| <td>Adjusted $R^2$ of $Y$-column residual w. bias included</td> |
| </tr> |
| <tr> |
| <td>R2_NOBIAS</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center"> </td> |
| <td>$R^2$ of $Y$-column residual, bias subtracted</td> |
| </tr> |
| <tr> |
| <td>ADJUSTED_R2_NOBIAS</td> |
| <td style="text-align: center">+</td> |
| <td style="text-align: center"> </td> |
| <td>Adjusted $R^2$ of $Y$-column residual, bias subtracted</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <h3 id="details-4">Details</h3> |
| |
| <p>The output matrix $M$ of predicted means (or probabilities) is computed |
| by matrix-multiplying $X$ with the first column of $B$ or with the |
| whole $B$ in the multinomial case, adding the intercept if available |
| (conceptually, appending an extra column of ones to $X$); then applying |
| the inverse of the model’s link function. The difference between “means” |
| and “probabilities” in the categorical case becomes significant when |
| there are ${\geq}\,2$ observations per record (with the multi-label |
| records) or when the labels such as $-1$ and $1$ are viewed and averaged |
| as numerical response values (with the single-label records). To avoid |
| any or information loss, we separately return the predicted probability |
| of each category label for each record.</p> |
| |
| <p>When the “actual” response values $Y$ are available, the summary |
| statistics are computed and written out as described in |
| <a href="algorithms-regression.html#table13"><strong>Table 13</strong></a>. Below we discuss each of these statistics |
| in detail. Note that in the categorical case (binomial and multinomial) |
| $Y$ is internally represented as the matrix of observation counts for |
| each label in each record, rather than just the label ID for each |
| record. The input $Y$ may already be a matrix of counts, in which case |
| it is used as-is. But if $Y$ is given as a vector of response labels, |
| each response label is converted into an indicator vector |
| $(0,\ldots,0,1_l,0,\ldots,0)$ where $l$ is the label ID for this record. |
| All negative (e.g. $-1$) or zero label IDs are converted to the |
| $1 +$ maximum label ID. The largest label ID is viewed as the |
| “baseline” as explained in the section on Multinomial Logistic |
| Regression. We assume that there are $k\geq 1$ non-baseline categories |
| and one (last) baseline category.</p> |
| |
| <p>We also estimate residual variances for each response value, although we |
| do not output them, but use them only inside the summary statistics, |
| scaled and unscaled by the input dispersion parameter <code>disp</code>, |
| as described below.</p> |
| |
| <p><code>LOGLHOOD_Z</code> and <code>LOGLHOOD_Z_PVAL</code> statistics measure how far the |
| log-likelihood of $Y$ deviates from its expected value according to the |
| model. The script implements them only for the binomial and the |
| multinomial distributions, returning <code>NaN</code> for all other distributions. |
| Pearson’s $X^2$ and deviance $G^2$ often perform poorly with bi- and |
| multinomial distributions due to low cell counts, hence we need this |
| extra goodness-of-fit measure. To compute these statistics, we use:</p> |
| |
| <ul> |
| <li>the $n\times (k\,{+}\,1)$-matrix $Y$ of multi-label response counts, in |
| which $y_{i,j}$ is the number of times label $j$ was observed in |
| record $i$</li> |
| <li>the model-estimated probability matrix $P$ of the same dimensions that |
| satisfies <script type="math/tex">\sum_{j=1}^{k+1} p_{i,j} = 1</script> for all $i=1,\ldots,n$ and |
| where $p_{i,j}$ is the model probability of observing label $j$ in |
| record $i$</li> |
| <li>the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of |
| observations in record $i$ (all $N_i = 1$ if each record has only one |
| response label)</li> |
| </ul> |
| |
| <p>We start by computing the multinomial log-likelihood of $Y$ given $P$ |
| and $N$, as well as the expected log-likelihood given a random $Y$ and |
| the variance of this log-likelihood if $Y$ indeed follows the proposed |
| distribution:</p> |
| |
| <script type="math/tex; mode=display">% <![CDATA[ |
| \begin{aligned} |
| \ell (Y) \,\,&=\,\, \log Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1} \,y_{i,j}\log p_{i,j} \\ |
| E_Y \ell (Y) \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j} |
| \,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\ |
| Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2 |
| - \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right) |
| \end{aligned} %]]></script> |
| |
| <p>Then we compute the $Z$-score as the difference between the actual and |
| the expected log-likelihood $\ell(Y)$ divided by its expected standard |
| deviation, and its two-sided p-value in the Normal distribution |
| assumption ($\ell(Y)$ should approach normality due to the Central Limit |
| Theorem):</p> |
| |
| <script type="math/tex; mode=display">Z \,=\, \frac {\ell(Y) - E_Y \ell(Y)}{\sqrt{Var_Y \ell(Y)}};\quad |
| \mathop{\textrm{p-value}}(Z) \,=\, Prob\Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big]</script> |
| |
| <p>A low p-value would indicate “underfitting” if $Z\ll 0$ or “overfitting” |
| if $Z\gg 0$. Here “overfitting” means that higher-probability labels |
| occur more often than their probabilities suggest.</p> |
| |
| <p>We also apply the dispersion input (<code>disp</code>) to compute the |
| “scaled” version of the $Z$-score and its p-value. Since $\ell(Y)$ is a |
| linear function of $Y$, multiplying the GLM-predicted variance of $Y$ by |
| <code>disp</code> results in multiplying |
| $Var_Y \ell(Y)$ by the same |
| <code>disp</code>. This, in turn, translates into dividing the $Z$-score |
| by the square root of the dispersion:</p> |
| |
| <script type="math/tex; mode=display">Z_{\texttt{disp}} \,=\, \big(\ell(Y) \,-\, E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot Var_Y \ell(Y)} |
| \,=\, Z / \sqrt{\texttt{disp}}</script> |
| |
| <p>Finally, we recalculate the p-value |
| with this new $Z$-score.</p> |
| |
| <p><code>PEARSON_X2</code>, <code>PEARSON_X2_BY_DF</code>, and <code>PEARSON_X2_PVAL</code>: |
| Pearson’s residual $X^2$-statistic is a commonly used goodness-of-fit |
| measure for linear models <a href="algorithms-bibliography.html">[McCullagh1989]</a>. |
| The idea is to measure how |
| well the model-predicted means and variances match the actual behavior |
| of response values. For each record $i$, we estimate the mean $\mu_i$ |
| and the variance $v_i$ (or <code>disp</code> $\cdot v_i$) and use them to |
| normalize the residual: $r_i = (y_i - \mu_i) / \sqrt{v_i}$. These |
| normalized residuals are then squared, aggregated by summation, and |
| tested against an appropriate $\chi^2$ distribution. The computation |
| of $X^2$ is slightly different for categorical data (bi- and |
| multinomial) than it is for numerical data, since $y_i$ has multiple |
| correlated dimensions <a href="algorithms-bibliography.html">[McCullagh1989]</a>:</p> |
| |
| <script type="math/tex; mode=display">X^2\,\textrm{(numer.)} \,=\, \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad |
| X^2\,\textrm{(categ.)} \,=\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i |
| \hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}}</script> |
| |
| <p>The number of |
| degrees of freedom #d.f. for the $\chi^2$ distribution is $n - m$ for |
| numerical data and $(n - m)k$ for categorical data, where |
| $k = \mathop{\texttt{ncol}}(Y) - 1$. Given the dispersion parameter |
| <code>disp</code> the $X^2$ statistic is scaled by division: <script type="math/tex">X^2_{\texttt{disp}} = X^2 / \texttt{disp}</script>. If the |
| dispersion is accurate, $X^2 / \texttt{disp}$ should be close to #d.f. |
| In fact, $X^2 / \textrm{#d.f.}$ over the <em>training</em> data is the |
| dispersion estimator used in our <code>GLM.dml</code> script, |
| see (7). Here we provide $X^2 / \textrm{#d.f.}$ and |
| $X^2_{\texttt{disp}} / \textrm{#d.f.}$ as |
| <code>PEARSON_X2_BY_DF</code> to enable dispersion comparison between |
| the training data and the test data.</p> |
| |
| <p>NOTE: For categorical data, both Pearson’s $X^2$ and the deviance $G^2$ |
| are unreliable (i.e. do not approach the $\chi^2$ distribution) unless |
| the predicted means of multi-label counts |
| <script type="math/tex">\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}</script> are fairly large: all |
| ${\geq}\,1$ and 80% are at least $5$ <a href="algorithms-bibliography.html">[Cochran1954]</a>. They should not |
| be used for “one label per record” categoricals.</p> |
| |
| <p><code>DEVIANCE_G2</code>, <code>DEVIANCE_G2_BY_DF</code>, and |
| <code>DEVIANCE_G2_PVAL</code>: Deviance $G^2$ is the log of the |
| likelihood ratio between the “saturated” model and the linear model |
| being tested for the given dataset, multiplied by two:</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| G^2 \,=\, 2 \,\log \frac{Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}{Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]} |
| \end{equation}</script> |
| |
| <p>The “saturated” model sets the mean |
| $\mu_i^{\mathrm{sat}}$ to equal $y_i$ for every record (for categorical |
| data, <script type="math/tex">p_{i,j}^{sat} = y_{i,j} / N_i</script>), which represents the |
| “perfect fit.” For records with $y_{i,j} \in {0, N_i}$ or otherwise at |
| a boundary, by continuity we set $0 \log 0 = 0$. The GLM likelihood |
| functions defined in (6) become simplified in |
| ratio (8) due to canceling out the term $c(y, a)$ |
| since it is the same in both models.</p> |
| |
| <p>The log of a likelihood ratio between two nested models, times two, is |
| known to approach a $\chi^2$ distribution as $n\to\infty$ if both models |
| have fixed parameter spaces. But this is not the case for the |
| “saturated” model: it adds more parameters with each record. In |
| practice, however, $\chi^2$ distributions are used to compute the |
| p-value of $G^2$ <a href="algorithms-bibliography.html">[McCullagh1989]</a>. The number of degrees of |
| freedom #d.f. and the treatment of dispersion are the same as for |
| Pearson’s $X^2$, see above.</p> |
| |
| <h3 id="column-wise-statistics">Column-Wise Statistics</h3> |
| |
| <p>The rest of the statistics are computed separately for each column |
| of $Y$. As explained above, $Y$ has two or more columns in bi- and |
| multinomial case, either at input or after conversion. Moreover, each |
| <script type="math/tex">y_{i,j}</script> in record $i$ with $N_i \geq 2$ is counted as $N_i$ separate |
| observations <script type="math/tex">y_{i,j,l}</script> of 0 or 1 (where $l=1,\ldots,N_i$) with |
| <script type="math/tex">y_{i,j}</script> ones and <script type="math/tex">N_i-y_{i,j}</script> zeros. For power distributions, |
| including linear regression, $Y$ has only one column and all $N_i = 1$, |
| so the statistics are computed for all $Y$ with each record counted |
| once. Below we denote <script type="math/tex">N = \sum_{i=1}^n N_i \,\geq n</script>. Here is the total |
| average and the residual average (residual bias) of $y_{i,j,l}$ for each |
| $Y$-column:</p> |
| |
| <script type="math/tex; mode=display">\texttt{AVG_TOT_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n y_{i,j}; \quad |
| \texttt{AVG_RES_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})</script> |
| |
| <p>Dividing by $N$ (rather than $n$) gives the averages for <script type="math/tex">y_{i,j,l}</script> |
| (rather than <script type="math/tex">y_{i,j}</script>). The total variance, and the standard deviation, |
| for individual observations <script type="math/tex">y_{i,j,l}</script> is estimated from the total |
| variance for response values <script type="math/tex">y_{i,j}</script> using independence assumption: |
| <script type="math/tex">Var \,y_{i,j} = Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} Var y_{i,j,l}</script>. |
| This allows us to estimate the sum of squares for $y_{i,j,l}$ via the |
| sum of squares for <script type="math/tex">y_{i,j}</script>:</p> |
| |
| <script type="math/tex; mode=display">\texttt{STDEV_TOT_Y}_j \,=\, |
| \Bigg[\frac{1}{N-1} \sum_{i=1}^n \Big( y_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n y_{i'\!,j}\Big)^2\Bigg]^{1/2}</script> |
| |
| <p>Analogously, we estimate the standard deviation of the residual |
| <script type="math/tex">y_{i,j,l} - \mu_{i,j,l}</script>:</p> |
| |
| <script type="math/tex; mode=display">\texttt{STDEV_RES_Y}_j \,=\, |
| \Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n \Big( y_{i,j} - \mu_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}</script> |
| |
| <p>Here $m’=m$ if $m$ includes the intercept as a feature and $m’=m+1$ if |
| it does not. The estimated standard deviations can be compared to the |
| model-predicted residual standard deviation computed from the predicted |
| means by the GLM variance formula and scaled by the dispersion:</p> |
| |
| <script type="math/tex; mode=display">\texttt{PRED_STDEV_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2}</script> |
| |
| <p>We also compute the $R^2$ statistics for each column of $Y$, see |
| <a href="algorithms-regression.html#table14"><strong>Table 14</strong></a> and <a href="algorithms-regression.html#table15"><strong>Table 15</strong></a> for details. We compute two versions |
| of $R^2$: in one version the residual sum-of-squares (RSS) includes any |
| bias in the residual that might be present (due to the lack of, or |
| inaccuracy in, the intercept); in the other version of RSS the bias is |
| subtracted by “centering” the residual. In both cases we subtract the |
| bias in the total sum-of-squares (in the denominator), and $m’$ equals |
| $m$ with the intercept or $m+1$ without the intercept.</p> |
| |
| <hr /> |
| |
| <p><a name="table14"></a> |
| <strong>Table 14</strong>: $R^2$ where the residual sum-of-squares includes the bias contribution.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Statistic</th> |
| <th>Formula</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>$\texttt{R2}_j$</td> |
| <td><script type="math/tex">\displaystyle 1 - \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}}</script></td> |
| </tr> |
| <tr> |
| <td>$\texttt{ADJUSTED_R2}_j$</td> |
| <td><script type="math/tex">\displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}} \, \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}}</script></td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <p><a name="table15"></a> |
| <strong>Table 15</strong>: $R^2$ where the residual sum-of-squares is centered so that the bias is subtracted.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Statistic</th> |
| <th>Formula</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>$\texttt{R2_NOBIAS}_j$</td> |
| <td><script type="math/tex">\displaystyle 1 - \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}}</script></td> |
| </tr> |
| <tr> |
| <td>$\texttt{ADJUSTED_R2_NOBIAS}_j$</td> |
| <td><script type="math/tex">\displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}}</script></td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <h3 id="returns-4">Returns</h3> |
| |
| <p>The matrix of predicted means (if the response is numerical) or |
| probabilities (if the response is categorical), see Description |
| subsection above for more information. Given <code>Y</code>, we return |
| some statistics in CSV format as described in |
| <a href="algorithms-regression.html#table13"><strong>Table 13</strong></a> and in the above text.</p> |
| |
| <hr /> |
| |
| <div class="footnotes"> |
| <ol> |
| <li id="fn:1"> |
| <p>Smaller likelihood difference between two models suggests less |
| statistical evidence to pick one model over the other. <a href="#fnref:1" class="reversefootnote">↩</a></p> |
| </li> |
| <li id="fn:2"> |
| <p>$Prob[y\mid \mu_i]$ |
| is given by a density function if $y$ is continuous. <a href="#fnref:2" class="reversefootnote">↩</a></p> |
| </li> |
| <li id="fn:3"> |
| <p>We use $k+1$ because there are $k$ non-baseline categories and one |
| baseline category, with regression parameters $B$ having |
| $k$ columns. <a href="#fnref:3" class="reversefootnote">↩</a></p> |
| </li> |
| </ol> |
| </div> |
| |
| |
| </div> <!-- /container --> |
| |
| |
| |
| <script src="js/vendor/jquery-1.12.0.min.js"></script> |
| <script src="js/vendor/bootstrap.min.js"></script> |
| <script src="js/vendor/anchor.min.js"></script> |
| <script src="js/main.js"></script> |
| |
| |
| |
| |
| |
| <!-- Analytics --> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| ga('create', 'UA-71553733-1', 'auto'); |
| ga('send', 'pageview'); |
| </script> |
| |
| |
| |
| <!-- MathJax Section --> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| TeX: { equationNumbers: { autoNumber: "AMS" } } |
| }); |
| </script> |
| <script> |
| // Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS. |
| // We could use "//cdn.mathjax...", but that won't support "file://". |
| (function(d, script) { |
| script = d.createElement('script'); |
| script.type = 'text/javascript'; |
| script.async = true; |
| script.onload = function(){ |
| MathJax.Hub.Config({ |
| tex2jax: { |
| inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ], |
| displayMath: [ ["$$","$$"], ["\\[", "\\]"] ], |
| processEscapes: true, |
| skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] |
| } |
| }); |
| }; |
| script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') + |
| 'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; |
| d.getElementsByTagName('head')[0].appendChild(script); |
| }(document)); |
| </script> |
| </body> |
| </html> |