\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{Stepwise Generalized Linear Regression} | |

\noindent{\bf Description} | |

\smallskip | |

Our stepwise generalized linear regression script selects a model based on the Akaike information criterion (AIC): the model that gives rise to the lowest AIC is provided. Note that currently only the Bernoulli distribution family is supported (see below for details). \\ | |

\smallskip | |

\noindent{\bf Usage} | |

\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}StepGLM.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} Y=}path/file | |

{\tt{} B=}path/file | |

{\tt{} S=}path/file | |

{\tt{} O=}path/file | |

{\tt{} link=}int | |

{\tt{} yneg=}double | |

{\tt{} icpt=}int | |

{\tt{} tol=}double | |

{\tt{} disp=}double | |

{\tt{} moi=}int | |

{\tt{} mii=}int | |

{\tt{} thr=}double | |

{\tt{} fmt=}format | |

} | |

\smallskip | |

\noindent{\bf Arguments} | |

\begin{Description} | |

\item[{\tt X}:] | |

Location (on HDFS) to read the matrix of feature vectors; each row is | |

an example. | |

\item[{\tt Y}:] | |

Location (on HDFS) to read the response matrix, which may have 1 or 2 columns | |

\item[{\tt B}:] | |

Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the | |

intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available | |

\item[{\tt S}:] (default:\mbox{ }{\tt " "}) | |

Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm, | |

by default it is standard output. | |

\item[{\tt O}:] (default:\mbox{ }{\tt " "}) | |

Location (on HDFS) to write certain summary statistics described in Table~\ref{table:GLM:stats}, | |

by default it is standard output. | |

\item[{\tt link}:] (default:\mbox{ }{\tt 2}) | |

Link function code to determine the link function~$\eta = g(\mu)$, see Table~\ref{table:commonGLMs}; currently the following link functions are supported: \\ | |

{\tt 1} = log, | |

{\tt 2} = logit, | |

{\tt 3} = probit, | |

{\tt 4} = cloglog. | |

\item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0}) | |

Response value for Bernoulli ``No'' label, usually 0.0 or -1.0 | |

\item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) | |

Intercept and shifting/rescaling of the features in~$X$:\\ | |

{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\ | |

{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ | |

{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 | |

\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}) | |

Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations | |

when the deviance changes by less than this factor; see below for details. | |

\item[{\tt disp}:] (default:\mbox{ }{\tt 0.0}) | |

Dispersion parameter, or {\tt 0.0} to estimate it from data | |

\item[{\tt moi}:] (default:\mbox{ }{\tt 200}) | |

Maximum number of outer (Fisher scoring) iterations | |

\item[{\tt mii}:] (default:\mbox{ }{\tt 0}) | |

Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum | |

limit provided | |

\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01}) | |

Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr} | |

no further features are being checked and the algorithm stops. | |

\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; | |

see read/write functions in SystemML Language Reference for details. | |

\end{Description} | |

\noindent{\bf Details} | |

\smallskip | |

Similar to {\tt StepLinearRegDS.dml} our stepwise GLM script builds a model by iteratively selecting predictive variables | |

using a forward selection strategy based on the AIC (\ref{eq:AIC}). | |

Note that currently only the Bernoulli distribution family ({\tt fam=2} in Table~\ref{table:commonGLMs}) together with the following link functions are supported: log, logit, probit, and cloglog ({\tt link $\in\{1,2,3,4\}$ } in Table~\ref{table:commonGLMs}). | |

\smallskip | |

\noindent{\bf Returns} | |

\smallskip | |

Similar to the outputs from {\tt GLM.dml} the stepwise GLM script computes the estimated regression coefficients and stores them in matrix $B$ on HDFS; matrix $B$ follows the same format as the one produced by {\tt GLM.dml} (see Section~\ref{sec:GLM}). | |

Additionally, {\tt StepGLM.dml} outputs the variable indices (stored in the 1-column matrix $S$) in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ stores the variable which improves the AIC the most in $i$th iteration. | |

If the model with the lowest AIC includes no variables matrix $S$ will be empty. | |

Moreover, the estimated summary statistics as defined in Table~\ref{table:GLM:stats} | |

are printed out or stored in a file on HDFS (if requested); | |

these statistics will be provided only if the selected model is nonempty, i.e., contains at least one variable. | |

\smallskip | |

\noindent{\bf Examples} | |

\smallskip | |

{\hangindent=\parindent\noindent\tt | |

\hml -f StepGLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv link=2 yneg=-1.0 icpt=2 tol=0.000001 moi=100 mii=10 thr=0.05 fmt=csv | |

} | |