\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{Linear Regression} | |

\label{sec:LinReg} | |

\noindent{\bf Description} | |

\smallskip | |

Linear Regression scripts are used to model the relationship between one numerical | |

response variable and one or more explanatory (feature) variables. | |

The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a | |

numerical vector of feature variables and $y_i$ is a numerical response value for | |

each training data record. The feature vectors are provided as a matrix $X$ of size | |

$n\,{\times}\,m$, where $n$ is the number of records and $m$ is the number of features. | |

The observed response values are provided as a 1-column matrix~$Y$, with a numerical | |

value $y_i$ for each~$x_i$ in the corresponding row of matrix~$X$. | |

In linear regression, we predict the distribution of the response~$y_i$ based on | |

a fixed linear combination of the features in~$x_i$. We assume that | |

there exist constant regression coefficients $\beta_0, \beta_1, \ldots, \beta_m$ | |

and a constant residual variance~$\sigma^2$ such that | |

\begin{equation} | |

y_i \sim \Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\, | |

\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} | |

\label{eqn:linregdef} | |

\end{equation} | |

Distribution $y_i \sim \Normal(\mu_i, \sigma^2)$ models the ``unexplained'' residual | |

noise and is assumed independent across different records. | |

The goal is to estimate the regression coefficients and the residual variance. | |

Once they are accurately estimated, we can make predictions about $y_i$ given~$x_i$ | |

in new records. We can also use the $\beta_j$'s to analyze the influence of individual | |

features on the response value, and assess the quality of this model by comparing | |

residual variance in the response, left after prediction, with its total variance. | |

There are two scripts in our library, both doing the same estimation, but using different | |

computational methods. Depending on the size and the sparsity of the feature matrix~$X$, | |

one or the other script may be more efficient. The ``direct solve'' script | |

{\tt LinearRegDS} is more efficient when the number of features $m$ is relatively small | |

($m \sim 1000$ or less) and matrix~$X$ is either tall or fairly dense | |

(has~${\gg}\:m^2$ nonzeros); otherwise, the ``conjugate gradient'' script {\tt LinearRegCG} | |

is more efficient. If $m > 50000$, use only {\tt LinearRegCG}. | |

\smallskip | |

\noindent{\bf Usage} | |

\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}LinearRegDS.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} Y=}path/file | |

{\tt{} B=}path/file | |

{\tt{} O=}path/file | |

{\tt{} icpt=}int | |

{\tt{} reg=}double | |

{\tt{} fmt=}format | |

}\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}LinearRegCG.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} Y=}path/file | |

{\tt{} B=}path/file | |

{\tt{} O=}path/file | |

{\tt{} Log=}path/file | |

{\tt{} icpt=}int | |

{\tt{} reg=}double | |

{\tt{} tol=}double | |

{\tt{} maxi=}int | |

{\tt{} fmt=}format | |

} | |

\smallskip | |

\noindent{\bf Arguments} | |

\begin{Description} | |

\item[{\tt X}:] | |

Location (on HDFS) to read the matrix of feature vectors, each row constitutes | |

one feature vector | |

\item[{\tt Y}:] | |

Location to read the 1-column matrix of response values | |

\item[{\tt B}:] | |

Location to store the estimated regression parameters (the $\beta_j$'s), with the | |

intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available | |

\item[{\tt O}:] (default:\mbox{ }{\tt " "}) | |

Location to store the CSV-file of summary statistics defined in | |

Table~\ref{table:linreg:stats}, the default is to print it to the standard output | |

\item[{\tt Log}:] (default:\mbox{ }{\tt " "}, {\tt LinearRegCG} only) | |

Location to store iteration-specific variables for monitoring and debugging purposes, | |

see Table~\ref{table:linreg:log} for details. | |

\item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) | |

Intercept presence and shifting/rescaling the features in~$X$:\\ | |

{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\ | |

{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ | |

{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 | |

\item[{\tt reg}:] (default:\mbox{ }{\tt 0.000001}) | |

L2-regularization parameter~\mbox{$\lambda\geq 0$}; set to nonzero for highly dependent, | |

sparse, or numerous ($m \gtrsim n/10$) features | |

\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}, {\tt LinearRegCG} only) | |

Tolerance \mbox{$\eps\geq 0$} used in the convergence criterion: we terminate conjugate | |

gradient iterations when the $\beta$-residual reduces in L2-norm by this factor | |

\item[{\tt maxi}:] (default:\mbox{ }{\tt 0}, {\tt LinearRegCG} only) | |

Maximum number of conjugate gradient iterations, or~0 if no maximum | |

limit provided | |

\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; | |

see read/write functions in SystemML Language Reference for details. | |

\end{Description} | |

\begin{table}[t]\small\centerline{% | |

\begin{tabular}{|ll|} | |

\hline | |

Name & Meaning \\ | |

\hline | |

{\tt AVG\_TOT\_Y} & Average of the response value $Y$ \\ | |

{\tt STDEV\_TOT\_Y} & Standard Deviation of the response value $Y$ \\ | |

{\tt AVG\_RES\_Y} & Average of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$, i.e.\ residual bias \\ | |

{\tt STDEV\_RES\_Y} & Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$ \\ | |

{\tt DISPERSION} & GLM-style dispersion, i.e.\ residual sum of squares / \#deg.\ fr. \\ | |

{\tt PLAIN\_R2} & Plain $R^2$ of residual with bias included vs.\ total average \\ | |

{\tt ADJUSTED\_R2} & Adjusted $R^2$ of residual with bias included vs.\ total average \\ | |

{\tt PLAIN\_R2\_NOBIAS} & Plain $R^2$ of residual with bias subtracted vs.\ total average \\ | |

{\tt ADJUSTED\_R2\_NOBIAS} & Adjusted $R^2$ of residual with bias subtracted vs.\ total average \\ | |

{\tt PLAIN\_R2\_VS\_0} & ${}^*$Plain $R^2$ of residual with bias included vs.\ zero constant \\ | |

{\tt ADJUSTED\_R2\_VS\_0} & ${}^*$Adjusted $R^2$ of residual with bias included vs.\ zero constant \\ | |

\hline | |

\multicolumn{2}{r}{${}^{*\mathstrut}$ The last two statistics are only printed if there is no intercept ({\tt icpt=0})} \\ | |

\end{tabular}} | |

\caption{Besides~$\beta$, linear regression scripts compute a few summary statistics | |

listed above. The statistics are provided in CSV format, one comma-separated name-value | |

pair per each line.} | |

\label{table:linreg:stats} | |

\end{table} | |

\begin{table}[t]\small\centerline{% | |

\begin{tabular}{|ll|} | |

\hline | |

Name & Meaning \\ | |

\hline | |

{\tt CG\_RESIDUAL\_NORM} & L2-norm of conjug.\ grad.\ residual, which is $A \pxp \beta - t(X) \pxp y$ \\ | |

& where $A = t(X) \pxp X + \diag (\lambda)$, or a similar quantity \\ | |

{\tt CG\_RESIDUAL\_RATIO} & Ratio of current L2-norm of conjug.\ grad.\ residual over the initial \\ | |

\hline | |

\end{tabular}} | |

\caption{ | |

The {\tt Log} file for {\tt{}LinearRegCG} script contains the above \mbox{per-}iteration | |

variables in CSV format, each line containing triple (Name, Iteration\#, Value) with | |

Iteration\# being~0 for initial values.} | |

\label{table:linreg:log} | |

\end{table} | |

\noindent{\bf Details} | |

\smallskip | |

To solve a linear regression problem over feature matrix~$X$ and response vector~$Y$, | |

we can find coefficients $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize | |

the joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed statistical | |

model~(\ref{eqn:linregdef}). Since the joint likelihood of the independent | |

$y_i \sim \Normal(\mu_i, \sigma^2)$ is proportional to the product of | |

$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the logarithm of this | |

product, then multiply by $-2\sigma^2 < 0$ to obtain a least squares problem: | |

\begin{equation} | |

\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, | |

\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 | |

\,\,\to\,\,\min | |

\label{eqn:linregls} | |

\end{equation} | |

This may not be enough, however. The minimum may sometimes be attained over infinitely many | |

$\beta$-vectors, for example if $X$ has an all-0 column, or has linearly dependent columns, | |

or has fewer rows than columns~\mbox{($n < m$)}. Even if~(\ref{eqn:linregls}) has a unique | |

solution, other $\beta$-vectors may be just a little suboptimal\footnote{Smaller likelihood | |

difference between two models suggests less statistical evidence to pick one model over the | |

other.}, yet give significantly different predictions for new feature vectors. This results | |

in \emph{overfitting}: prediction error for the training data ($X$ and~$Y$) is much smaller | |

than for the test data (new records). | |

Overfitting and degeneracy in the data is commonly mitigated by adding a regularization penalty | |

term to the least squares function: | |

\begin{equation} | |

\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 | |

\,+\,\, \lambda \sum_{j=1}^m \beta_j^2 | |

\,\,\to\,\,\min | |

\label{eqn:linreglsreg} | |

\end{equation} | |

The choice of $\lambda>0$, the regularization constant, typically involves cross-validation | |

where the dataset is repeatedly split into a training part (to estimate the~$\beta_j$'s) and | |

a test part (to evaluate prediction accuracy), with the goal of maximizing the test accuracy. | |

In our scripts, $\lambda$~is provided as input parameter~{\tt reg}. | |

The solution to least squares problem~(\ref{eqn:linreglsreg}), through taking the derivative | |

and setting it to~0, has the matrix linear equation form | |

\begin{equation} | |

A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\, | |

\textrm{where}\,\,\, | |

A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} \diag(\hspace{0.5pt} | |

\underbrace{\raisebox{0pt}[0pt][0.5pt]{$\lambda,\ldots, \lambda$}}_{\raisebox{2pt}{$\scriptstyle m$}} | |

\hspace{0.5pt}, 0) | |

\label{eqn:linregeq} | |

\end{equation} | |

where $[X,\,1]$ is $X$~with an extra column of~1s appended on the right, and the | |

diagonal matrix of $\lambda$'s has a zero to keep the intercept~$\beta_0$ unregularized. | |

If the intercept is disabled by setting {\tt icpt=0}, the equation is simply | |

\mbox{$X^T X \beta = X^T Y$}. | |

We implemented two scripts for solving equation~(\ref{eqn:linregeq}): one is a ``direct solver'' | |

that computes $A$ and then solves $A\beta = [X,\,1]^T Y$ by calling an external package, | |

the other performs linear conjugate gradient~(CG) iterations without ever materializing~$A$. | |

The CG~algorithm closely follows Algorithm~5.2 in Chapter~5 of~\cite{Nocedal2006:Optimization}. | |

Each step in the CG~algorithm computes a matrix-vector multiplication $q = Ap$ by first computing | |

$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$. Usually the number of such multiplications, | |

one per CG iteration, is much smaller than~$m$. The user can put a hard bound on it with input | |

parameter~{\tt maxi}, or use the default maximum of~$m+1$ (or~$m$ if no intercept) by | |

having {\tt maxi=0}. The CG~iterations terminate when the L2-norm of vector | |

$r = A\beta - [X,\,1]^T Y$ decreases from its initial value (for~$\beta=0$) by the tolerance | |

factor specified in input parameter~{\tt tol}. | |

The CG algorithm is more efficient if computing | |

$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$, | |

an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix. The Direct Solver~(DS) is more efficient if | |

$X$ takes up a lot more memory than $A$ (i.e.\ $X$~has a lot more nonzeros than~$m^2$) | |

and if $m^2$ is small enough for the external solver ($m \lesssim 50000$). A more precise | |

determination between CG and~DS is subject to further research. | |

In addition to the $\beta$-vector, the scripts estimate the residual standard | |

deviation~$\sigma$ and the~$R^2$, the ratio of ``explained'' variance to the total | |

variance of the response variable. These statistics only make sense if the number | |

of degrees of freedom $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant | |

$\lambda$ is negligible or zero. The formulas for $\sigma$ and $R^2$~are: | |

\begin{equation*} | |

R^2_{\textrm{plain}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad | |

\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad | |

R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}} | |

\end{equation*} | |

where | |

\begin{equation*} | |

\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i - | |

\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad | |

\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2 | |

\end{equation*} | |

Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the estimated | |

regression coefficients and the feature vectors. They may be biased when no | |

intercept is present, hence the RSS formula subtracts the bias. | |

Lastly, note that by choosing the input option {\tt icpt=2} the user can shift | |

and rescale the columns of~$X$ to have zero average and the variance of~1. | |

This is particularly important when using regularization over highly disbalanced | |

features, because regularization tends to penalize small-variance columns (which | |

need large~$\beta_j$'s) more than large-variance columns (with small~$\beta_j$'s). | |

At the end, the estimated regression coefficients are shifted and rescaled to | |

apply to the original features. | |

\smallskip | |

\noindent{\bf Returns} | |

\smallskip | |

The estimated regression coefficients (the $\hat{\beta}_j$'s) are populated into | |

a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}'' | |

input argument. What this matrix contains, and its size, depends on the input | |

argument {\tt icpt}, which specifies the user's intercept and rescaling choice: | |

\begin{Description} | |

\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with | |

$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$. | |

\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$ | |

has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$, | |

and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient. | |

\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to | |

mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of | |

the~$\hat{\beta}_j$'s, one for the original features and another for the | |

shifted/rescaled features. Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with | |

$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled | |

features, in the above format. Note that $B[\cdot, 2]$ are iteratively estimated | |

and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and | |

rescaling. | |

\end{Description} | |

The estimated summary statistics, including residual standard deviation~$\sigma$ and | |

the~$R^2$, are printed out or sent into a file (if specified) in CSV format as | |

defined in Table~\ref{table:linreg:stats}. For conjugate gradient iterations, | |

a log file with monitoring variables can also be made available, see | |

Table~\ref{table:linreg:log}. | |

\smallskip | |

\noindent{\bf Examples} | |

\smallskip | |

{\hangindent=\parindent\noindent\tt | |

\hml -f LinearRegCG.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx | |

B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv | |

icpt=2 reg=1.0 tol=0.00000001 maxi=100 Log=/user/biadmin/log.csv | |

} | |

{\hangindent=\parindent\noindent\tt | |

\hml -f LinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx | |

B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv | |

icpt=2 reg=1.0 | |

} | |

% \smallskip | |

% \noindent{\bf See Also} | |

% \smallskip | |

% | |

% In case of binary classification problems, please consider using L2-SVM or | |

% binary logistic regression; for multiclass classification, use multiclass~SVM | |

% or multinomial logistic regression. For more complex distributions of the | |

% response variable use the Generalized Linear Models script. |