| \begin{comment} |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| \end{comment} |
| |
| \subsection{Stepwise Generalized Linear Regression} |
| |
| \noindent{\bf Description} |
| \smallskip |
| |
| Our stepwise generalized linear regression script selects a model based on the Akaike information criterion (AIC): the model that gives rise to the lowest AIC is provided. Note that currently only the Bernoulli distribution family is supported (see below for details). \\ |
| |
| \smallskip |
| \noindent{\bf Usage} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\it% |
| {\tt{}-f }path/\/{\tt{}StepGLM.dml} |
| {\tt{} -nvargs} |
| {\tt{} X=}path/file |
| {\tt{} Y=}path/file |
| {\tt{} B=}path/file |
| {\tt{} S=}path/file |
| {\tt{} O=}path/file |
| {\tt{} link=}int |
| {\tt{} yneg=}double |
| {\tt{} icpt=}int |
| {\tt{} tol=}double |
| {\tt{} disp=}double |
| {\tt{} moi=}int |
| {\tt{} mii=}int |
| {\tt{} thr=}double |
| {\tt{} fmt=}format |
| |
| } |
| |
| |
| \smallskip |
| \noindent{\bf Arguments} |
| \begin{Description} |
| \item[{\tt X}:] |
| Location (on HDFS) to read the matrix of feature vectors; each row is |
| an example. |
| \item[{\tt Y}:] |
| Location (on HDFS) to read the response matrix, which may have 1 or 2 columns |
| \item[{\tt B}:] |
| Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the |
| intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available |
| \item[{\tt S}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm, |
| by default it is standard output. |
| \item[{\tt O}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to write certain summary statistics described in Table~\ref{table:GLM:stats}, |
| by default it is standard output. |
| \item[{\tt link}:] (default:\mbox{ }{\tt 2}) |
| Link function code to determine the link function~$\eta = g(\mu)$, see Table~\ref{table:commonGLMs}; currently the following link functions are supported: \\ |
| {\tt 1} = log, |
| {\tt 2} = logit, |
| {\tt 3} = probit, |
| {\tt 4} = cloglog. |
| \item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0}) |
| Response value for Bernoulli ``No'' label, usually 0.0 or -1.0 |
| \item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) |
| Intercept and shifting/rescaling of the features in~$X$:\\ |
| {\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\ |
| {\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ |
| {\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 |
| \item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}) |
| Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations |
| when the deviance changes by less than this factor; see below for details. |
| \item[{\tt disp}:] (default:\mbox{ }{\tt 0.0}) |
| Dispersion parameter, or {\tt 0.0} to estimate it from data |
| \item[{\tt moi}:] (default:\mbox{ }{\tt 200}) |
| Maximum number of outer (Fisher scoring) iterations |
| \item[{\tt mii}:] (default:\mbox{ }{\tt 0}) |
| Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum |
| limit provided |
| \item[{\tt thr}:] (default:\mbox{ }{\tt 0.01}) |
| Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr} |
| no further features are being checked and the algorithm stops. |
| \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) |
| Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; |
| see read/write functions in SystemML Language Reference for details. |
| \end{Description} |
| |
| |
| \noindent{\bf Details} |
| \smallskip |
| |
| Similar to {\tt StepLinearRegDS.dml} our stepwise GLM script builds a model by iteratively selecting predictive variables |
| using a forward selection strategy based on the AIC (\ref{eq:AIC}). |
| Note that currently only the Bernoulli distribution family ({\tt fam=2} in Table~\ref{table:commonGLMs}) together with the following link functions are supported: log, logit, probit, and cloglog ({\tt link $\in\{1,2,3,4\}$ } in Table~\ref{table:commonGLMs}). |
| |
| |
| \smallskip |
| \noindent{\bf Returns} |
| \smallskip |
| |
| Similar to the outputs from {\tt GLM.dml} the stepwise GLM script computes the estimated regression coefficients and stores them in matrix $B$ on HDFS; matrix $B$ follows the same format as the one produced by {\tt GLM.dml} (see Section~\ref{sec:GLM}). |
| Additionally, {\tt StepGLM.dml} outputs the variable indices (stored in the 1-column matrix $S$) in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ stores the variable which improves the AIC the most in $i$th iteration. |
| If the model with the lowest AIC includes no variables matrix $S$ will be empty. |
| Moreover, the estimated summary statistics as defined in Table~\ref{table:GLM:stats} |
| are printed out or stored in a file on HDFS (if requested); |
| these statistics will be provided only if the selected model is nonempty, i.e., contains at least one variable. |
| |
| |
| \smallskip |
| \noindent{\bf Examples} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\tt |
| \hml -f StepGLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv link=2 yneg=-1.0 icpt=2 tol=0.000001 moi=100 mii=10 thr=0.05 fmt=csv |
| |
| } |
| |
| |