blob: 38699907a5d046246363a1c70c15f3c62a60735f [file] [log] [blame]
\begin{comment}
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\end{comment}
\subsection{Stepwise Generalized Linear Regression}
\noindent{\bf Description}
\smallskip
Our stepwise generalized linear regression script selects a model based on the Akaike information criterion (AIC): the model that gives rise to the lowest AIC is provided. Note that currently only the Bernoulli distribution family is supported (see below for details). \\
\smallskip
\noindent{\bf Usage}
\smallskip
{\hangindent=\parindent\noindent\it%
{\tt{}-f }path/\/{\tt{}StepGLM.dml}
{\tt{} -nvargs}
{\tt{} X=}path/file
{\tt{} Y=}path/file
{\tt{} B=}path/file
{\tt{} S=}path/file
{\tt{} O=}path/file
{\tt{} link=}int
{\tt{} yneg=}double
{\tt{} icpt=}int
{\tt{} tol=}double
{\tt{} disp=}double
{\tt{} moi=}int
{\tt{} mii=}int
{\tt{} thr=}double
{\tt{} fmt=}format
}
\smallskip
\noindent{\bf Arguments}
\begin{Description}
\item[{\tt X}:]
Location (on HDFS) to read the matrix of feature vectors; each row is
an example.
\item[{\tt Y}:]
Location (on HDFS) to read the response matrix, which may have 1 or 2 columns
\item[{\tt B}:]
Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the
intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
\item[{\tt S}:] (default:\mbox{ }{\tt " "})
Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm,
by default it is standard output.
\item[{\tt O}:] (default:\mbox{ }{\tt " "})
Location (on HDFS) to write certain summary statistics described in Table~\ref{table:GLM:stats},
by default it is standard output.
\item[{\tt link}:] (default:\mbox{ }{\tt 2})
Link function code to determine the link function~$\eta = g(\mu)$, see Table~\ref{table:commonGLMs}; currently the following link functions are supported: \\
{\tt 1} = log,
{\tt 2} = logit,
{\tt 3} = probit,
{\tt 4} = cloglog.
\item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0})
Response value for Bernoulli ``No'' label, usually 0.0 or -1.0
\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
Intercept and shifting/rescaling of the features in~$X$:\\
{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations
when the deviance changes by less than this factor; see below for details.
\item[{\tt disp}:] (default:\mbox{ }{\tt 0.0})
Dispersion parameter, or {\tt 0.0} to estimate it from data
\item[{\tt moi}:] (default:\mbox{ }{\tt 200})
Maximum number of outer (Fisher scoring) iterations
\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
limit provided
\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01})
Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr}
no further features are being checked and the algorithm stops.
\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
see read/write functions in SystemML Language Reference for details.
\end{Description}
\noindent{\bf Details}
\smallskip
Similar to {\tt StepLinearRegDS.dml} our stepwise GLM script builds a model by iteratively selecting predictive variables
using a forward selection strategy based on the AIC (\ref{eq:AIC}).
Note that currently only the Bernoulli distribution family ({\tt fam=2} in Table~\ref{table:commonGLMs}) together with the following link functions are supported: log, logit, probit, and cloglog ({\tt link $\in\{1,2,3,4\}$ } in Table~\ref{table:commonGLMs}).
\smallskip
\noindent{\bf Returns}
\smallskip
Similar to the outputs from {\tt GLM.dml} the stepwise GLM script computes the estimated regression coefficients and stores them in matrix $B$ on HDFS; matrix $B$ follows the same format as the one produced by {\tt GLM.dml} (see Section~\ref{sec:GLM}).
Additionally, {\tt StepGLM.dml} outputs the variable indices (stored in the 1-column matrix $S$) in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ stores the variable which improves the AIC the most in $i$th iteration.
If the model with the lowest AIC includes no variables matrix $S$ will be empty.
Moreover, the estimated summary statistics as defined in Table~\ref{table:GLM:stats}
are printed out or stored in a file on HDFS (if requested);
these statistics will be provided only if the selected model is nonempty, i.e., contains at least one variable.
\smallskip
\noindent{\bf Examples}
\smallskip
{\hangindent=\parindent\noindent\tt
\hml -f StepGLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv link=2 yneg=-1.0 icpt=2 tol=0.000001 moi=100 mii=10 thr=0.05 fmt=csv
}