docs/Algorithms Reference/LinReg.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Linear Regression}
 \label{sec:LinReg}

 \noindent{\bf Description}
 \smallskip

 Linear Regression scripts are used to model the relationship between one numerical
 response variable and one or more explanatory (feature) variables.
 The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a
 numerical vector of feature variables and $y_i$ is a numerical response value for
 each training data record.  The feature vectors are provided as a matrix $X$ of size
 $n\,{\times}\,m$, where $n$ is the number of records and $m$ is the number of features.
 The observed response values are provided as a 1-column matrix~$Y$, with a numerical
 value $y_i$ for each~$x_i$ in the corresponding row of matrix~$X$.

 In linear regression, we predict the distribution of the response~$y_i$ based on
 a fixed linear combination of the features in~$x_i$.  We assume that
 there exist constant regression coefficients $\beta_0, \beta_1, \ldots, \beta_m$
 and a constant residual variance~$\sigma^2$ such that
 \begin{equation}
 y_i \sim \Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\,
 \mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m}
 \label{eqn:linregdef}
 \end{equation}
 Distribution $y_i \sim \Normal(\mu_i, \sigma^2)$ models the ``unexplained'' residual
 noise and is assumed independent across different records.

 The goal is to estimate the regression coefficients and the residual variance.
 Once they are accurately estimated, we can make predictions about $y_i$ given~$x_i$
 in new records.  We can also use the $\beta_j$'s to analyze the influence of individual
 features on the response value, and assess the quality of this model by comparing
 residual variance in the response, left after prediction, with its total variance.

 There are two scripts in our library, both doing the same estimation, but using different
 computational methods.  Depending on the size and the sparsity of the feature matrix~$X$,
 one or the other script may be more efficient.  The ``direct solve'' script
 {\tt LinearRegDS} is more efficient when the number of features $m$ is relatively small
 ($m \sim 1000$ or less) and matrix~$X$ is either tall or fairly dense
 (has~${\gg}\:m^2$ nonzeros); otherwise, the ``conjugate gradient'' script {\tt LinearRegCG}
 is more efficient.  If $m > 50000$, use only {\tt LinearRegCG}.

 \smallskip
 \noindent{\bf Usage}
 \smallskip

 {\hangindent=\parindent\noindent\it%
 {\tt{}-f }path/\/{\tt{}LinearRegDS.dml}
 {\tt{} -nvargs}
 {\tt{} X=}path/file
 {\tt{} Y=}path/file
 {\tt{} B=}path/file
 {\tt{} O=}path/file
 {\tt{} icpt=}int
 {\tt{} reg=}double
 {\tt{} fmt=}format

 }\smallskip
 {\hangindent=\parindent\noindent\it%
 {\tt{}-f }path/\/{\tt{}LinearRegCG.dml}
 {\tt{} -nvargs}
 {\tt{} X=}path/file
 {\tt{} Y=}path/file
 {\tt{} B=}path/file
 {\tt{} O=}path/file
 {\tt{} Log=}path/file
 {\tt{} icpt=}int
 {\tt{} reg=}double
 {\tt{} tol=}double
 {\tt{} maxi=}int
 {\tt{} fmt=}format

 }

 \smallskip
 \noindent{\bf Arguments}
 \begin{Description}
 \item[{\tt X}:]
 Location (on HDFS) to read the matrix of feature vectors, each row constitutes
 one feature vector
 \item[{\tt Y}:]
 Location to read the 1-column matrix of response values
 \item[{\tt B}:]
 Location to store the estimated regression parameters (the $\beta_j$'s), with the
 intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
 \item[{\tt O}:] (default:\mbox{ }{\tt " "})
 Location to store the CSV-file of summary statistics defined in
 Table~\ref{table:linreg:stats}, the default is to print it to the standard output
 \item[{\tt Log}:] (default:\mbox{ }{\tt " "}, {\tt LinearRegCG} only)
 Location to store iteration-specific variables for monitoring and debugging purposes,
 see Table~\ref{table:linreg:log} for details.
 \item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
 Intercept presence and shifting/rescaling the features in~$X$:\\
 {\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
 {\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
 {\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
 \item[{\tt reg}:] (default:\mbox{ }{\tt 0.000001})
 L2-regularization parameter~\mbox{$\lambda\geq 0$}; set to nonzero for highly dependent,
 sparse, or numerous ($m \gtrsim n/10$) features
 \item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}, {\tt LinearRegCG} only)
 Tolerance \mbox{$\eps\geq 0$} used in the convergence criterion: we terminate conjugate
 gradient iterations when the $\beta$-residual reduces in L2-norm by this factor
 \item[{\tt maxi}:] (default:\mbox{ }{\tt 0}, {\tt LinearRegCG} only)
 Maximum number of conjugate gradient iterations, or~0 if no maximum
 limit provided
 \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
 Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
 see read/write functions in SystemML Language Reference for details.
 \end{Description}


 \begin{table}[t]\small\centerline{%
 \begin{tabular}{|ll|}
 \hline
 Name & Meaning \\
 \hline
 {\tt AVG\_TOT\_Y}          & Average of the response value $Y$ \\
 {\tt STDEV\_TOT\_Y}        & Standard Deviation of the response value $Y$ \\
 {\tt AVG\_RES\_Y}          & Average of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$, i.e.\ residual bias \\
 {\tt STDEV\_RES\_Y}        & Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$ \\
 {\tt DISPERSION}           & GLM-style dispersion, i.e.\ residual sum of squares / \#deg.\ fr. \\
 {\tt PLAIN\_R2}            & Plain $R^2$ of residual with bias included vs.\ total average \\
 {\tt ADJUSTED\_R2}         & Adjusted $R^2$ of residual with bias included vs.\ total average \\
 {\tt PLAIN\_R2\_NOBIAS}    & Plain $R^2$ of residual with bias subtracted vs.\ total average \\
 {\tt ADJUSTED\_R2\_NOBIAS} & Adjusted $R^2$ of residual with bias subtracted vs.\ total average \\
 {\tt PLAIN\_R2\_VS\_0}     & ${}^*$Plain $R^2$ of residual with bias included vs.\ zero constant \\
 {\tt ADJUSTED\_R2\_VS\_0}  & ${}^*$Adjusted $R^2$ of residual with bias included vs.\ zero constant \\
 \hline
 \multicolumn{2}{r}{${}^{*\mathstrut}$ The last two statistics are only printed if there is no intercept ({\tt icpt=0})} \\
 \end{tabular}}
 \caption{Besides~$\beta$, linear regression scripts compute a few summary statistics
 listed above.  The statistics are provided in CSV format, one comma-separated name-value
 pair per each line.}
 \label{table:linreg:stats}
 \end{table}

 \begin{table}[t]\small\centerline{%
 \begin{tabular}{|ll|}
 \hline
 Name & Meaning \\
 \hline
 {\tt CG\_RESIDUAL\_NORM}  & L2-norm of conjug.\ grad.\ residual, which is $A \pxp \beta - t(X) \pxp y$ \\
                           & where $A = t(X) \pxp X + \diag (\lambda)$, or a similar quantity \\
 {\tt CG\_RESIDUAL\_RATIO} & Ratio of current L2-norm of conjug.\ grad.\ residual over the initial \\
 \hline
 \end{tabular}}
 \caption{
 The {\tt Log} file for {\tt{}LinearRegCG} script contains the above \mbox{per-}iteration
 variables in CSV format, each line containing triple (Name, Iteration\#, Value) with
 Iteration\# being~0 for initial values.}
 \label{table:linreg:log}
 \end{table}


 \noindent{\bf Details}
 \smallskip

 To solve a linear regression problem over feature matrix~$X$ and response vector~$Y$,
 we can find coefficients $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize
 the joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed statistical
 model~(\ref{eqn:linregdef}).  Since the joint likelihood of the independent
 $y_i \sim \Normal(\mu_i, \sigma^2)$ is proportional to the product of
 $\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the logarithm of this
 product, then multiply by $-2\sigma^2 < 0$ to obtain a least squares problem:
 \begin{equation}
 \sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\,
 \sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
 \,\,\to\,\,\min
 \label{eqn:linregls}
 \end{equation}
 This may not be enough, however.  The minimum may sometimes be attained over infinitely many
 $\beta$-vectors, for example if $X$ has an all-0 column, or has linearly dependent columns,
 or has fewer rows than columns~\mbox{($n < m$)}.  Even if~(\ref{eqn:linregls}) has a unique
 solution, other $\beta$-vectors may be just a little suboptimal\footnote{Smaller likelihood
 difference between two models suggests less statistical evidence to pick one model over the
 other.}, yet give significantly different predictions for new feature vectors.  This results
 in \emph{overfitting}: prediction error for the training data ($X$ and~$Y$) is much smaller
 than for the test data (new records).

 Overfitting and degeneracy in the data is commonly mitigated by adding a regularization penalty
 term to the least squares function:
 \begin{equation}
 \sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
 \,+\,\, \lambda \sum_{j=1}^m \beta_j^2
 \,\,\to\,\,\min
 \label{eqn:linreglsreg}
 \end{equation}
 The choice of $\lambda>0$, the regularization constant, typically involves cross-validation
 where the dataset is repeatedly split into a training part (to estimate the~$\beta_j$'s) and
 a test part (to evaluate prediction accuracy), with the goal of maximizing the test accuracy.
 In our scripts, $\lambda$~is provided as input parameter~{\tt reg}.

 The solution to least squares problem~(\ref{eqn:linreglsreg}), through taking the derivative
 and setting it to~0, has the matrix linear equation form
 \begin{equation}
 A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\,
 \textrm{where}\,\,\,
 A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} \diag(\hspace{0.5pt}
 \underbrace{\raisebox{0pt}[0pt][0.5pt]{$\lambda,\ldots, \lambda$}}_{\raisebox{2pt}{$\scriptstyle m$}}
 \hspace{0.5pt}, 0)
 \label{eqn:linregeq}
 \end{equation}
 where $[X,\,1]$ is $X$~with an extra column of~1s appended on the right, and the
 diagonal matrix of $\lambda$'s has a zero to keep the intercept~$\beta_0$ unregularized.
 If the intercept is disabled by setting {\tt icpt=0}, the equation is simply
 \mbox{$X^T X \beta = X^T Y$}.

 We implemented two scripts for solving equation~(\ref{eqn:linregeq}): one is a ``direct solver''
 that computes $A$ and then solves $A\beta = [X,\,1]^T Y$ by calling an external package,
 the other performs linear conjugate gradient~(CG) iterations without ever materializing~$A$.
 The CG~algorithm closely follows Algorithm~5.2 in Chapter~5 of~\cite{Nocedal2006:Optimization}.
 Each step in the CG~algorithm computes a matrix-vector multiplication $q = Ap$ by first computing
 $[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$.  Usually the number of such multiplications,
 one per CG iteration, is much smaller than~$m$.  The user can put a hard bound on it with input
 parameter~{\tt maxi}, or use the default maximum of~$m+1$ (or~$m$ if no intercept) by
 having {\tt maxi=0}.  The CG~iterations terminate when the L2-norm of vector
 $r = A\beta - [X,\,1]^T Y$ decreases from its initial value (for~$\beta=0$) by the tolerance
 factor specified in input parameter~{\tt tol}.

 The CG algorithm is more efficient if computing
 $[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$,
 an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix.  The Direct Solver~(DS) is more efficient if
 $X$ takes up a lot more memory than $A$ (i.e.\ $X$~has a lot more nonzeros than~$m^2$)
 and if $m^2$ is small enough for the external solver ($m \lesssim 50000$).  A more precise
 determination between CG and~DS is subject to further research.

 In addition to the $\beta$-vector, the scripts estimate the residual standard
 deviation~$\sigma$ and the~$R^2$, the ratio of ``explained'' variance to the total
 variance of the response variable.  These statistics only make sense if the number
 of degrees of freedom $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant
 $\lambda$ is negligible or zero.  The formulas for $\sigma$ and $R^2$~are:
 \begin{equation*}
 R^2_{\textrm{plain}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad
 \sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad
 R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}
 \end{equation*}
 where
 \begin{equation*}
 \mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i -
 \frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad
 \mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2
 \end{equation*}
 Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the estimated
 regression coefficients and the feature vectors.  They may be biased when no
 intercept is present, hence the RSS formula subtracts the bias.

 Lastly, note that by choosing the input option {\tt icpt=2} the user can shift
 and rescale the columns of~$X$ to have zero average and the variance of~1.
 This is particularly important when using regularization over highly disbalanced
 features, because regularization tends to penalize small-variance columns (which
 need large~$\beta_j$'s) more than large-variance columns (with small~$\beta_j$'s).
 At the end, the estimated regression coefficients are shifted and rescaled to
 apply to the original features.

 \smallskip
 \noindent{\bf Returns}
 \smallskip

 The estimated regression coefficients (the $\hat{\beta}_j$'s) are populated into
 a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
 input argument.  What this matrix contains, and its size, depends on the input
 argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
 \begin{Description}
 \item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
 $B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
 \item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
 has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
 and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
 \item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
 mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
 the~$\hat{\beta}_j$'s, one for the original features and another for the
 shifted/rescaled features.  Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
 $B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
 features, in the above format.  Note that $B[\cdot, 2]$ are iteratively estimated
 and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
 rescaling.
 \end{Description}
 The estimated summary statistics, including residual standard deviation~$\sigma$ and
 the~$R^2$, are printed out or sent into a file (if specified) in CSV format as
 defined in Table~\ref{table:linreg:stats}.  For conjugate gradient iterations,
 a log file with monitoring variables can also be made available, see
 Table~\ref{table:linreg:log}.

 \smallskip
 \noindent{\bf Examples}
 \smallskip

 {\hangindent=\parindent\noindent\tt
 \hml -f LinearRegCG.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
   B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
   icpt=2 reg=1.0 tol=0.00000001 maxi=100 Log=/user/biadmin/log.csv

 }
 {\hangindent=\parindent\noindent\tt
 \hml -f LinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
   B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
   icpt=2 reg=1.0

 }

 % \smallskip
 % \noindent{\bf See Also}
 % \smallskip
 %
 % In case of binary classification problems, please consider using L2-SVM or
 % binary logistic regression; for multiclass classification, use multiclass~SVM
 % or multinomial logistic regression.  For more complex distributions of the
 % response variable use the Generalized Linear Models script.
	\begin{comment}

	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	\end{comment}

	\subsection{Linear Regression}
	\label{sec:LinReg}

	\noindent{\bf Description}
	\smallskip

	Linear Regression scripts are used to model the relationship between one numerical
	response variable and one or more explanatory (feature) variables.
	The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a
	numerical vector of feature variables and $y_i$ is a numerical response value for
	each training data record. The feature vectors are provided as a matrix $X$ of size
	$n\,{\times}\,m$, where $n$ is the number of records and $m$ is the number of features.
	The observed response values are provided as a 1-column matrix~$Y$, with a numerical
	value $y_i$ for each~$x_i$ in the corresponding row of matrix~$X$.

	In linear regression, we predict the distribution of the response~$y_i$ based on
	a fixed linear combination of the features in~$x_i$. We assume that
	there exist constant regression coefficients $\beta_0, \beta_1, \ldots, \beta_m$
	and a constant residual variance~$\sigma^2$ such that
	\begin{equation}
	y_i \sim \Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\,
	\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m}
	\label{eqn:linregdef}
	\end{equation}
	Distribution $y_i \sim \Normal(\mu_i, \sigma^2)$ models the ``unexplained'' residual
	noise and is assumed independent across different records.

	The goal is to estimate the regression coefficients and the residual variance.
	Once they are accurately estimated, we can make predictions about $y_i$ given~$x_i$
	in new records. We can also use the $\beta_j$'s to analyze the influence of individual
	features on the response value, and assess the quality of this model by comparing
	residual variance in the response, left after prediction, with its total variance.

	There are two scripts in our library, both doing the same estimation, but using different
	computational methods. Depending on the size and the sparsity of the feature matrix~$X$,
	one or the other script may be more efficient. The ``direct solve'' script
	{\tt LinearRegDS} is more efficient when the number of features $m$ is relatively small
	($m \sim 1000$ or less) and matrix~$X$ is either tall or fairly dense
	(has~${\gg}\:m^2$ nonzeros); otherwise, the ``conjugate gradient'' script {\tt LinearRegCG}
	is more efficient. If $m > 50000$, use only {\tt LinearRegCG}.

	\smallskip
	\noindent{\bf Usage}
	\smallskip

	{\hangindent=\parindent\noindent\it%
	{\tt{}-f }path/\/{\tt{}LinearRegDS.dml}
	{\tt{} -nvargs}
	{\tt{} X=}path/file
	{\tt{} Y=}path/file
	{\tt{} B=}path/file
	{\tt{} O=}path/file
	{\tt{} icpt=}int
	{\tt{} reg=}double
	{\tt{} fmt=}format

	}\smallskip
	{\hangindent=\parindent\noindent\it%
	{\tt{}-f }path/\/{\tt{}LinearRegCG.dml}
	{\tt{} -nvargs}
	{\tt{} X=}path/file
	{\tt{} Y=}path/file
	{\tt{} B=}path/file
	{\tt{} O=}path/file
	{\tt{} Log=}path/file
	{\tt{} icpt=}int
	{\tt{} reg=}double
	{\tt{} tol=}double
	{\tt{} maxi=}int
	{\tt{} fmt=}format

	}

	\smallskip
	\noindent{\bf Arguments}
	\begin{Description}
	\item[{\tt X}:]
	Location (on HDFS) to read the matrix of feature vectors, each row constitutes
	one feature vector
	\item[{\tt Y}:]
	Location to read the 1-column matrix of response values
	\item[{\tt B}:]
	Location to store the estimated regression parameters (the $\beta_j$'s), with the
	intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
	\item[{\tt O}:] (default:\mbox{ }{\tt " "})
	Location to store the CSV-file of summary statistics defined in
	Table~\ref{table:linreg:stats}, the default is to print it to the standard output
	\item[{\tt Log}:] (default:\mbox{ }{\tt " "}, {\tt LinearRegCG} only)
	Location to store iteration-specific variables for monitoring and debugging purposes,
	see Table~\ref{table:linreg:log} for details.
	\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
	Intercept presence and shifting/rescaling the features in~$X$:\\
	{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
	{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
	{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
	\item[{\tt reg}:] (default:\mbox{ }{\tt 0.000001})
	L2-regularization parameter~\mbox{$\lambda\geq 0$}; set to nonzero for highly dependent,
	sparse, or numerous ($m \gtrsim n/10$) features
	\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}, {\tt LinearRegCG} only)
	Tolerance \mbox{$\eps\geq 0$} used in the convergence criterion: we terminate conjugate
	gradient iterations when the $\beta$-residual reduces in L2-norm by this factor
	\item[{\tt maxi}:] (default:\mbox{ }{\tt 0}, {\tt LinearRegCG} only)
	Maximum number of conjugate gradient iterations, or~0 if no maximum
	limit provided
	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
	see read/write functions in SystemML Language Reference for details.
	\end{Description}


	\begin{table}[t]\small\centerline{%
	\begin{tabular}{\|ll\|}
	\hline
	Name & Meaning \\
	\hline
	{\tt AVG\_TOT\_Y} & Average of the response value $Y$ \\
	{\tt STDEV\_TOT\_Y} & Standard Deviation of the response value $Y$ \\
	{\tt AVG\_RES\_Y} & Average of the residual $Y - \mathop{\mathrm{pred}}(Y\|X)$, i.e.\ residual bias \\
	{\tt STDEV\_RES\_Y} & Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y\|X)$ \\
	{\tt DISPERSION} & GLM-style dispersion, i.e.\ residual sum of squares / \#deg.\ fr. \\
	{\tt PLAIN\_R2} & Plain $R^2$ of residual with bias included vs.\ total average \\
	{\tt ADJUSTED\_R2} & Adjusted $R^2$ of residual with bias included vs.\ total average \\
	{\tt PLAIN\_R2\_NOBIAS} & Plain $R^2$ of residual with bias subtracted vs.\ total average \\
	{\tt ADJUSTED\_R2\_NOBIAS} & Adjusted $R^2$ of residual with bias subtracted vs.\ total average \\
	{\tt PLAIN\_R2\_VS\_0} & ${}^*$Plain $R^2$ of residual with bias included vs.\ zero constant \\
	{\tt ADJUSTED\_R2\_VS\_0} & ${}^*$Adjusted $R^2$ of residual with bias included vs.\ zero constant \\
	\hline
	\multicolumn{2}{r}{${}^{*\mathstrut}$ The last two statistics are only printed if there is no intercept ({\tt icpt=0})} \\
	\end{tabular}}
	\caption{Besides~$\beta$, linear regression scripts compute a few summary statistics
	listed above. The statistics are provided in CSV format, one comma-separated name-value
	pair per each line.}
	\label{table:linreg:stats}
	\end{table}

	\begin{table}[t]\small\centerline{%
	\begin{tabular}{\|ll\|}
	\hline
	Name & Meaning \\
	\hline
	{\tt CG\_RESIDUAL\_NORM} & L2-norm of conjug.\ grad.\ residual, which is $A \pxp \beta - t(X) \pxp y$ \\
	& where $A = t(X) \pxp X + \diag (\lambda)$, or a similar quantity \\
	{\tt CG\_RESIDUAL\_RATIO} & Ratio of current L2-norm of conjug.\ grad.\ residual over the initial \\
	\hline
	\end{tabular}}
	\caption{
	The {\tt Log} file for {\tt{}LinearRegCG} script contains the above \mbox{per-}iteration
	variables in CSV format, each line containing triple (Name, Iteration\#, Value) with
	Iteration\# being~0 for initial values.}
	\label{table:linreg:log}
	\end{table}


	\noindent{\bf Details}
	\smallskip

	To solve a linear regression problem over feature matrix~$X$ and response vector~$Y$,
	we can find coefficients $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize
	the joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed statistical
	model~(\ref{eqn:linregdef}). Since the joint likelihood of the independent
	$y_i \sim \Normal(\mu_i, \sigma^2)$ is proportional to the product of
	$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the logarithm of this
	product, then multiply by $-2\sigma^2 < 0$ to obtain a least squares problem:
	\begin{equation}
	\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\,
	\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
	\,\,\to\,\,\min
	\label{eqn:linregls}
	\end{equation}
	This may not be enough, however. The minimum may sometimes be attained over infinitely many
	$\beta$-vectors, for example if $X$ has an all-0 column, or has linearly dependent columns,
	or has fewer rows than columns~\mbox{($n < m$)}. Even if~(\ref{eqn:linregls}) has a unique
	solution, other $\beta$-vectors may be just a little suboptimal\footnote{Smaller likelihood
	difference between two models suggests less statistical evidence to pick one model over the
	other.}, yet give significantly different predictions for new feature vectors. This results
	in \emph{overfitting}: prediction error for the training data ($X$ and~$Y$) is much smaller
	than for the test data (new records).

	Overfitting and degeneracy in the data is commonly mitigated by adding a regularization penalty
	term to the least squares function:
	\begin{equation}
	\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
	\,+\,\, \lambda \sum_{j=1}^m \beta_j^2
	\,\,\to\,\,\min
	\label{eqn:linreglsreg}
	\end{equation}
	The choice of $\lambda>0$, the regularization constant, typically involves cross-validation
	where the dataset is repeatedly split into a training part (to estimate the~$\beta_j$'s) and
	a test part (to evaluate prediction accuracy), with the goal of maximizing the test accuracy.
	In our scripts, $\lambda$~is provided as input parameter~{\tt reg}.

	The solution to least squares problem~(\ref{eqn:linreglsreg}), through taking the derivative
	and setting it to~0, has the matrix linear equation form
	\begin{equation}
	A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\,
	\textrm{where}\,\,\,
	A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} \diag(\hspace{0.5pt}
	\underbrace{\raisebox{0pt}[0pt][0.5pt]{$\lambda,\ldots, \lambda$}}_{\raisebox{2pt}{$\scriptstyle m$}}
	\hspace{0.5pt}, 0)
	\label{eqn:linregeq}
	\end{equation}
	where $[X,\,1]$ is $X$~with an extra column of~1s appended on the right, and the
	diagonal matrix of $\lambda$'s has a zero to keep the intercept~$\beta_0$ unregularized.
	If the intercept is disabled by setting {\tt icpt=0}, the equation is simply
	\mbox{$X^T X \beta = X^T Y$}.

	We implemented two scripts for solving equation~(\ref{eqn:linregeq}): one is a ``direct solver''
	that computes $A$ and then solves $A\beta = [X,\,1]^T Y$ by calling an external package,
	the other performs linear conjugate gradient~(CG) iterations without ever materializing~$A$.
	The CG~algorithm closely follows Algorithm~5.2 in Chapter~5 of~\cite{Nocedal2006:Optimization}.
	Each step in the CG~algorithm computes a matrix-vector multiplication $q = Ap$ by first computing
	$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$. Usually the number of such multiplications,
	one per CG iteration, is much smaller than~$m$. The user can put a hard bound on it with input
	parameter~{\tt maxi}, or use the default maximum of~$m+1$ (or~$m$ if no intercept) by
	having {\tt maxi=0}. The CG~iterations terminate when the L2-norm of vector
	$r = A\beta - [X,\,1]^T Y$ decreases from its initial value (for~$\beta=0$) by the tolerance
	factor specified in input parameter~{\tt tol}.

	The CG algorithm is more efficient if computing
	$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$,
	an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix. The Direct Solver~(DS) is more efficient if
	$X$ takes up a lot more memory than $A$ (i.e.\ $X$~has a lot more nonzeros than~$m^2$)
	and if $m^2$ is small enough for the external solver ($m \lesssim 50000$). A more precise
	determination between CG and~DS is subject to further research.

	In addition to the $\beta$-vector, the scripts estimate the residual standard
	deviation~$\sigma$ and the~$R^2$, the ratio of ``explained'' variance to the total
	variance of the response variable. These statistics only make sense if the number
	of degrees of freedom $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant
	$\lambda$ is negligible or zero. The formulas for $\sigma$ and $R^2$~are:
	\begin{equation*}
	R^2_{\textrm{plain}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad
	\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad
	R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}
	\end{equation*}
	where
	\begin{equation*}
	\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i -
	\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad
	\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2
	\end{equation*}
	Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the estimated
	regression coefficients and the feature vectors. They may be biased when no
	intercept is present, hence the RSS formula subtracts the bias.

	Lastly, note that by choosing the input option {\tt icpt=2} the user can shift
	and rescale the columns of~$X$ to have zero average and the variance of~1.
	This is particularly important when using regularization over highly disbalanced
	features, because regularization tends to penalize small-variance columns (which
	need large~$\beta_j$'s) more than large-variance columns (with small~$\beta_j$'s).
	At the end, the estimated regression coefficients are shifted and rescaled to
	apply to the original features.

	\smallskip
	\noindent{\bf Returns}
	\smallskip

	The estimated regression coefficients (the $\hat{\beta}_j$'s) are populated into
	a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
	input argument. What this matrix contains, and its size, depends on the input
	argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
	\begin{Description}
	\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
	$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
	\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
	has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
	and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
	\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
	mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
	the~$\hat{\beta}_j$'s, one for the original features and another for the
	shifted/rescaled features. Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
	$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
	features, in the above format. Note that $B[\cdot, 2]$ are iteratively estimated
	and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
	rescaling.
	\end{Description}
	The estimated summary statistics, including residual standard deviation~$\sigma$ and
	the~$R^2$, are printed out or sent into a file (if specified) in CSV format as
	defined in Table~\ref{table:linreg:stats}. For conjugate gradient iterations,
	a log file with monitoring variables can also be made available, see
	Table~\ref{table:linreg:log}.

	\smallskip
	\noindent{\bf Examples}
	\smallskip

	{\hangindent=\parindent\noindent\tt
	\hml -f LinearRegCG.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
	B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
	icpt=2 reg=1.0 tol=0.00000001 maxi=100 Log=/user/biadmin/log.csv

	}
	{\hangindent=\parindent\noindent\tt
	\hml -f LinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
	B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
	icpt=2 reg=1.0

	}

	% \smallskip
	% \noindent{\bf See Also}
	% \smallskip
	%
	% In case of binary classification problems, please consider using L2-SVM or
	% binary logistic regression; for multiclass classification, use multiclass~SVM
	% or multinomial logistic regression. For more complex distributions of the
	% response variable use the Generalized Linear Models script.