docs/Algorithms Reference/StepLinRegDS.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Stepwise Linear Regression}

 \noindent{\bf Description}
 \smallskip

 Our stepwise linear regression script selects a linear model based on the Akaike information criterion (AIC):
 the model that gives rise to the lowest AIC is computed. \\

 \smallskip
 \noindent{\bf Usage}
 \smallskip

 {\hangindent=\parindent\noindent\it%
 {\tt{}-f }path/\/{\tt{}StepLinearRegDS.dml}
 {\tt{} -nvargs}
 {\tt{} X=}path/file
 {\tt{} Y=}path/file
 {\tt{} B=}path/file
 {\tt{} S=}path/file
 {\tt{} O=}path/file
 {\tt{} icpt=}int
 {\tt{} thr=}double
 {\tt{} fmt=}format

 }

 \smallskip
 \noindent{\bf Arguments}
 \begin{Description}
 \item[{\tt X}:]
 Location (on HDFS) to read the matrix of feature vectors, each row contains
 one feature vector.
 \item[{\tt Y}:]
 Location (on HDFS) to read the 1-column matrix of response values
 \item[{\tt B}:]
 Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the
 intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
 \item[{\tt S}:] (default:\mbox{ }{\tt " "})
 Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm;
 by default the selected feature-ids are forwarded to the standard output.
 \item[{\tt O}:] (default:\mbox{ }{\tt " "})
 Location (on HDFS) to store the CSV-file of summary statistics defined in
 Table~\ref{table:linreg:stats}; by default the summary statistics are forwarded to the standard output.
 \item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
 Intercept presence and shifting/rescaling the features in~$X$:\\
 {\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
 {\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
 {\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
 \item[{\tt thr}:] (default:\mbox{ }{\tt 0.01})
 Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr}
 no further features are being checked and the algorithm stops.
 \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
 Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
 see read/write functions in SystemML Language Reference for details.
 \end{Description}


 \noindent{\bf Details}
 \smallskip

 Stepwise linear regression iteratively selects predictive variables in an automated procedure.
 Currently, our implementation supports forward selection: starting from an empty model (without any variable)
 the algorithm examines the addition of each variable based on the AIC as a model comparison criterion. The AIC is defined as
 \begin{equation}
 AIC = -2 \log{L} + 2 edf,\label{eq:AIC}
 \end{equation}
 where $L$ denotes the likelihood of the fitted model and $edf$ is the equivalent degrees of freedom, i.e., the number of estimated parameters.
 This procedure is repeated until including no additional variable improves the model by a certain threshold
 specified in the input parameter {\tt thr}.

 For fitting a model in each iteration we use the ``direct solve'' method as in the script {\tt LinearRegDS.dml} discussed in Section~\ref{sec:LinReg}.


 \smallskip
 \noindent{\bf Returns}
 \smallskip

 Similar to the outputs from {\tt LinearRegDS.dml} the stepwise linear regression script computes
 the estimated regression coefficients and stores them in matrix $B$ on HDFS.
 The format of matrix $B$ is identical to the one produced by the scripts for linear regression (see Section~\ref{sec:LinReg}).
 Additionally, {\tt StepLinearRegDS.dml} outputs the variable indices (stored in the 1-column matrix $S$)
 in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ corresponds to
 the variable which improves the AIC the most in $i$th iteration.
 If the model with the lowest AIC includes no variables matrix $S$ will be empty (contains one 0).
 Moreover, the estimated summary statistics as defined in Table~\ref{table:linreg:stats}
 are printed out or stored in a file (if requested).
 In the case where an empty model achieves the best AIC these statistics will not be produced.


 \smallskip
 \noindent{\bf Examples}
 \smallskip

 {\hangindent=\parindent\noindent\tt
 	\hml -f StepLinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
 	B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv
 	icpt=2 thr=0.05 fmt=csv

 }
	\begin{comment}

	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	\end{comment}

	\subsection{Stepwise Linear Regression}

	\noindent{\bf Description}
	\smallskip

	Our stepwise linear regression script selects a linear model based on the Akaike information criterion (AIC):
	the model that gives rise to the lowest AIC is computed. \\

	\smallskip
	\noindent{\bf Usage}
	\smallskip

	{\hangindent=\parindent\noindent\it%
	{\tt{}-f }path/\/{\tt{}StepLinearRegDS.dml}
	{\tt{} -nvargs}
	{\tt{} X=}path/file
	{\tt{} Y=}path/file
	{\tt{} B=}path/file
	{\tt{} S=}path/file
	{\tt{} O=}path/file
	{\tt{} icpt=}int
	{\tt{} thr=}double
	{\tt{} fmt=}format

	}

	\smallskip
	\noindent{\bf Arguments}
	\begin{Description}
	\item[{\tt X}:]
	Location (on HDFS) to read the matrix of feature vectors, each row contains
	one feature vector.
	\item[{\tt Y}:]
	Location (on HDFS) to read the 1-column matrix of response values
	\item[{\tt B}:]
	Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the
	intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
	\item[{\tt S}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm;
	by default the selected feature-ids are forwarded to the standard output.
	\item[{\tt O}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to store the CSV-file of summary statistics defined in
	Table~\ref{table:linreg:stats}; by default the summary statistics are forwarded to the standard output.
	\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
	Intercept presence and shifting/rescaling the features in~$X$:\\
	{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
	{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
	{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
	\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01})
	Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr}
	no further features are being checked and the algorithm stops.
	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
	see read/write functions in SystemML Language Reference for details.
	\end{Description}


	\noindent{\bf Details}
	\smallskip

	Stepwise linear regression iteratively selects predictive variables in an automated procedure.
	Currently, our implementation supports forward selection: starting from an empty model (without any variable)
	the algorithm examines the addition of each variable based on the AIC as a model comparison criterion. The AIC is defined as
	\begin{equation}
	AIC = -2 \log{L} + 2 edf,\label{eq:AIC}
	\end{equation}
	where $L$ denotes the likelihood of the fitted model and $edf$ is the equivalent degrees of freedom, i.e., the number of estimated parameters.
	This procedure is repeated until including no additional variable improves the model by a certain threshold
	specified in the input parameter {\tt thr}.

	For fitting a model in each iteration we use the ``direct solve'' method as in the script {\tt LinearRegDS.dml} discussed in Section~\ref{sec:LinReg}.


	\smallskip
	\noindent{\bf Returns}
	\smallskip

	Similar to the outputs from {\tt LinearRegDS.dml} the stepwise linear regression script computes
	the estimated regression coefficients and stores them in matrix $B$ on HDFS.
	The format of matrix $B$ is identical to the one produced by the scripts for linear regression (see Section~\ref{sec:LinReg}).
	Additionally, {\tt StepLinearRegDS.dml} outputs the variable indices (stored in the 1-column matrix $S$)
	in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ corresponds to
	the variable which improves the AIC the most in $i$th iteration.
	If the model with the lowest AIC includes no variables matrix $S$ will be empty (contains one 0).
	Moreover, the estimated summary statistics as defined in Table~\ref{table:linreg:stats}
	are printed out or stored in a file (if requested).
	In the case where an empty model achieves the best AIC these statistics will not be produced.


	\smallskip
	\noindent{\bf Examples}
	\smallskip

	{\hangindent=\parindent\noindent\tt
	\hml -f StepLinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
	B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv
	icpt=2 thr=0.05 fmt=csv

	}