\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{Stepwise Linear Regression} | |

\noindent{\bf Description} | |

\smallskip | |

Our stepwise linear regression script selects a linear model based on the Akaike information criterion (AIC): | |

the model that gives rise to the lowest AIC is computed. \\ | |

\smallskip | |

\noindent{\bf Usage} | |

\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}StepLinearRegDS.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} Y=}path/file | |

{\tt{} B=}path/file | |

{\tt{} S=}path/file | |

{\tt{} O=}path/file | |

{\tt{} icpt=}int | |

{\tt{} thr=}double | |

{\tt{} fmt=}format | |

} | |

\smallskip | |

\noindent{\bf Arguments} | |

\begin{Description} | |

\item[{\tt X}:] | |

Location (on HDFS) to read the matrix of feature vectors, each row contains | |

one feature vector. | |

\item[{\tt Y}:] | |

Location (on HDFS) to read the 1-column matrix of response values | |

\item[{\tt B}:] | |

Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the | |

intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available | |

\item[{\tt S}:] (default:\mbox{ }{\tt " "}) | |

Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm; | |

by default the selected feature-ids are forwarded to the standard output. | |

\item[{\tt O}:] (default:\mbox{ }{\tt " "}) | |

Location (on HDFS) to store the CSV-file of summary statistics defined in | |

Table~\ref{table:linreg:stats}; by default the summary statistics are forwarded to the standard output. | |

\item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) | |

Intercept presence and shifting/rescaling the features in~$X$:\\ | |

{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\ | |

{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ | |

{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 | |

\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01}) | |

Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr} | |

no further features are being checked and the algorithm stops. | |

\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; | |

see read/write functions in SystemML Language Reference for details. | |

\end{Description} | |

\noindent{\bf Details} | |

\smallskip | |

Stepwise linear regression iteratively selects predictive variables in an automated procedure. | |

Currently, our implementation supports forward selection: starting from an empty model (without any variable) | |

the algorithm examines the addition of each variable based on the AIC as a model comparison criterion. The AIC is defined as | |

\begin{equation} | |

AIC = -2 \log{L} + 2 edf,\label{eq:AIC} | |

\end{equation} | |

where $L$ denotes the likelihood of the fitted model and $edf$ is the equivalent degrees of freedom, i.e., the number of estimated parameters. | |

This procedure is repeated until including no additional variable improves the model by a certain threshold | |

specified in the input parameter {\tt thr}. | |

For fitting a model in each iteration we use the ``direct solve'' method as in the script {\tt LinearRegDS.dml} discussed in Section~\ref{sec:LinReg}. | |

\smallskip | |

\noindent{\bf Returns} | |

\smallskip | |

Similar to the outputs from {\tt LinearRegDS.dml} the stepwise linear regression script computes | |

the estimated regression coefficients and stores them in matrix $B$ on HDFS. | |

The format of matrix $B$ is identical to the one produced by the scripts for linear regression (see Section~\ref{sec:LinReg}). | |

Additionally, {\tt StepLinearRegDS.dml} outputs the variable indices (stored in the 1-column matrix $S$) | |

in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ corresponds to | |

the variable which improves the AIC the most in $i$th iteration. | |

If the model with the lowest AIC includes no variables matrix $S$ will be empty (contains one 0). | |

Moreover, the estimated summary statistics as defined in Table~\ref{table:linreg:stats} | |

are printed out or stored in a file (if requested). | |

In the case where an empty model achieves the best AIC these statistics will not be produced. | |

\smallskip | |

\noindent{\bf Examples} | |

\smallskip | |

{\hangindent=\parindent\noindent\tt | |

\hml -f StepLinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx | |

B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv | |

icpt=2 thr=0.05 fmt=csv | |

} | |