docs/Algorithms Reference/GLM.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Generalized Linear Models (GLM)}
 \label{sec:GLM}

 \noindent{\bf Description}
 \smallskip

 Generalized Linear Models~\cite{Gill2000:GLM,McCullagh1989:GLM,Nelder1972:GLM}
 extend the methodology of linear and logistic regression to a variety of
 distributions commonly assumed as noise effects in the response variable.
 As before, we are given a collection
 of records $(x_1, y_1)$, \ldots, $(x_n, y_n)$ where $x_i$ is a numerical vector of
 explanatory (feature) variables of size~\mbox{$\dim x_i = m$}, and $y_i$ is the
 response (dependent) variable observed for this vector.  GLMs assume that some
 linear combination of the features in~$x_i$ determines the \emph{mean}~$\mu_i$
 of~$y_i$, while the observed $y_i$ is a random outcome of a noise distribution
 $\Prob[y\mid \mu_i]\,$\footnote{$\Prob[y\mid \mu_i]$ is given by a density function
 if $y$ is continuous.}
 with that mean~$\mu_i$:
 \begin{equation*}
 x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}
 \,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
 \end{equation*}

 In linear regression the response mean $\mu_i$ \emph{equals} some linear combination
 over~$x_i$, denoted above by~$\eta_i$.
 In logistic regression with $y\in\{0, 1\}$ (Bernoulli) the mean of~$y$ is the same
 as $\Prob[y=1]$ and equals $1/(1+e^{-\eta_i})$, the logistic function of~$\eta_i$.
 In GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone function
 called the \emph{link function}: $\eta_i = g(\mu_i)$.  The unknown linear combination
 parameters $\beta_j$ are assumed to be the same for all records.

 The goal of the regression is to estimate the parameters~$\beta_j$ from the observed
 data.  Once the~$\beta_j$'s are accurately estimated, we can make predictions
 about~$y$ for a new feature vector~$x$.  To do so, compute $\eta$ from~$x$ and use
 the inverted link function $\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of~$y$;
 then use the distribution $\Prob[y\mid \mu]$ to make predictions about~$y$.
 Both $g(\mu)$ and $\Prob[y\mid \mu]$ are user-provided.  Our GLM script supports
 a standard set of distributions and link functions, see below for details.

 \smallskip
 \noindent{\bf Usage}
 \smallskip

 {\hangindent=\parindent\noindent\it%
 {\tt{}-f }path/\/{\tt{}GLM.dml}
 {\tt{} -nvargs}
 {\tt{} X=}path/file
 {\tt{} Y=}path/file
 {\tt{} B=}path/file
 {\tt{} fmt=}format
 {\tt{} O=}path/file
 {\tt{} Log=}path/file
 {\tt{} dfam=}int
 {\tt{} vpow=}double
 {\tt{} link=}int
 {\tt{} lpow=}double
 {\tt{} yneg=}double
 {\tt{} icpt=}int
 {\tt{} reg=}double
 {\tt{} tol=}double
 {\tt{} disp=}double
 {\tt{} moi=}int
 {\tt{} mii=}int

 }

 \smallskip
 \noindent{\bf Arguments}
 \begin{Description}
 \item[{\tt X}:]
 Location (on HDFS) to read the matrix of feature vectors; each row constitutes
 an example.
 \item[{\tt Y}:]
 Location to read the response matrix, which may have 1 or 2 columns
 \item[{\tt B}:]
 Location to store the estimated regression parameters (the $\beta_j$'s), with the
 intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
 \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
 Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
 see read/write functions in SystemML Language Reference for details.
 \item[{\tt O}:] (default:\mbox{ }{\tt " "})
 Location to write certain summary statistics described in Table~\ref{table:GLM:stats},
 by default it is standard output.
 \item[{\tt Log}:] (default:\mbox{ }{\tt " "})
 Location to store iteration-specific variables for monitoring and debugging purposes,
 see Table~\ref{table:GLM:log} for details.
 \item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
 Distribution family code to specify $\Prob[y\mid \mu]$, see Table~\ref{table:commonGLMs}:\\
 {\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$;
 {\tt 2} = binomial or Bernoulli
 \item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
 When {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
 dependence of the variance of~$y$ on its mean.  In particular, use:\\
 {\tt 0.0} = Gaussian,
 {\tt 1.0} = Poisson,
 {\tt 2.0} = Gamma,
 {\tt 3.0} = inverse Gaussian
 \item[{\tt link}:] (default:\mbox{ }{\tt 0})
 Link function code to determine the link function~$\eta = g(\mu)$:\\
 {\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
 {\tt 1} = power functions,
 {\tt 2} = logit,
 {\tt 3} = probit,
 {\tt 4} = cloglog,
 {\tt 5} = cauchit
 \item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
 When {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
 function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$.  Common power links:\\
 {\tt -2.0} = $1/\mu^2$,
 {\tt -1.0} = reciprocal,
 {\tt 0.0} = log,
 {\tt 0.5} = sqrt,
 {\tt 1.0} = identity
 \item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0})
 When {\tt dfam=2} and the response matrix $Y$ has 1~column,
 this specifies the $y$-value used for Bernoulli ``No'' label.
 All other $y$-values are treated as the ``Yes'' label.
 For example, {\tt yneg=-1.0} may be used when $y\in\{-1, 1\}$;
 either {\tt yneg=1.0} or {\tt yneg=2.0} may be used when $y\in\{1, 2\}$.
 \item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
 Intercept and shifting/rescaling of the features in~$X$:\\
 {\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
 {\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
 {\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
 \item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
 L2-regularization parameter (lambda)
 \item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
 Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations
 when the deviance changes by less than this factor; see below for details
 \item[{\tt disp}:] (default:\mbox{ }{\tt 0.0})
 Dispersion parameter, or {\tt 0.0} to estimate it from data
 \item[{\tt moi}:] (default:\mbox{ }{\tt 200})
 Maximum number of outer (Fisher scoring) iterations
 \item[{\tt mii}:] (default:\mbox{ }{\tt 0})
 Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
 limit provided
 \end{Description}


 \begin{table}[t]\small\centerline{%
 \begin{tabular}{|ll|}
 \hline
 Name & Meaning \\
 \hline
 {\tt TERMINATION\_CODE}  & A positive integer indicating success/failure as follows: \\
                          & $1 = {}$Converged successfully;
                            $2 = {}$Maximum \# of iterations reached; \\
                          & $3 = {}$Input ({\tt X}, {\tt Y}) out of range;
                            $4 = {}$Distribution/link not supported \\
 {\tt BETA\_MIN}          & Smallest beta value (regression coefficient), excluding the intercept \\
 {\tt BETA\_MIN\_INDEX}   & Column index for the smallest beta value \\
 {\tt BETA\_MAX}          & Largest beta value (regression coefficient), excluding the intercept \\
 {\tt BETA\_MAX\_INDEX}   & Column index for the largest beta value \\
 {\tt INTERCEPT}          & Intercept value, or NaN if there is no intercept (if {\tt icpt=0}) \\
 {\tt DISPERSION}         & Dispersion used to scale deviance, provided in {\tt disp} input argument \\
                          & or estimated (same as {\tt DISPERSION\_EST}) if {\tt disp} argument is${} \leq 0$ \\
 {\tt DISPERSION\_EST}    & Dispersion estimated from the dataset \\
 {\tt DEVIANCE\_UNSCALED} & Deviance from the saturated model, assuming dispersion${} = 1.0$ \\
 {\tt DEVIANCE\_SCALED}   & Deviance from the saturated model, scaled by {\tt DISPERSION} value \\
 \hline
 \end{tabular}}
 \caption{Besides~$\beta$, GLM regression script computes a few summary statistics listed above.
 They are provided in CSV format, one comma-separated name-value pair per each line.}
 \label{table:GLM:stats}
 \end{table}


 \begin{table}[t]\small\centerline{%
 \begin{tabular}{|ll|}
 \hline
 Name & Meaning \\
 \hline
 {\tt NUM\_CG\_ITERS}     & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
 {\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
 {\tt POINT\_STEP\_NORM}  & L2-norm of iteration step from old point ($\beta$-vector) to new point \\
 {\tt OBJECTIVE}          & The loss function we minimize (negative partial log-likelihood) \\
 {\tt OBJ\_DROP\_REAL}    & Reduction in the objective during this iteration, actual value \\
 {\tt OBJ\_DROP\_PRED}    & Reduction in the objective predicted by a quadratic approximation \\
 {\tt OBJ\_DROP\_RATIO}   & Actual-to-predicted reduction ratio, used to update the trust region \\
 {\tt GRADIENT\_NORM}     & L2-norm of the loss function gradient (omitted if point is rejected) \\
 {\tt LINEAR\_TERM\_MIN}  & The minimum value of $X \pxp \beta$, used to check for overflows \\
 {\tt LINEAR\_TERM\_MAX}  & The maximum value of $X \pxp \beta$, used to check for overflows \\
 {\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
 {\tt TRUST\_DELTA}       & Updated trust region size, the ``delta'' \\
 \hline
 \end{tabular}}
 \caption{
 The {\tt Log} file for GLM regression contains the above \mbox{per-}iteration
 variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
 being~0 for initial values.}
 \label{table:GLM:log}
 \end{table}

 \begin{table}[t]\hfil
 \begin{tabular}{|ccccccc|}
 \hline
 \multicolumn{4}{|c}{INPUT PARAMETERS}              & Distribution  & Link      & Cano- \\
 {\tt dfam} & {\tt vpow} & {\tt link} & {\tt\ lpow} & family        & function  & nical?\\
 \hline
 {\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt -1.0}  & Gaussian      & inverse   &       \\
 {\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt\ 0.0}  & Gaussian      & log       &       \\
 {\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt\ 1.0}  & Gaussian      & identity  & Yes   \\
 {\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 0.0}  & Poisson       & log       & Yes   \\
 {\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 0.5}  & Poisson       & sq.root   &       \\
 {\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 1.0}  & Poisson       & identity  &       \\
 {\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt -1.0}  & Gamma         & inverse   & Yes   \\
 {\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt\ 0.0}  & Gamma         & log       &       \\
 {\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt\ 1.0}  & Gamma         & identity  &       \\
 {\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt -2.0}  & Inverse Gauss & $1/\mu^2$ & Yes   \\
 {\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt -1.0}  & Inverse Gauss & inverse   &       \\
 {\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt\ 0.0}  & Inverse Gauss & log       &       \\
 {\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt\ 1.0}  & Inverse Gauss & identity  &       \\
 \hline
 {\tt 2}    & {\tt  *}   & {\tt 1}    & {\tt\ 0.0}  & Binomial      & log       &       \\
 {\tt 2}    & {\tt  *}   & {\tt 1}    & {\tt\ 0.5}  & Binomial      & sq.root   &       \\
 {\tt 2}    & {\tt  *}   & {\tt 2}    & {\tt\  *}   & Binomial      & logit     & Yes   \\
 {\tt 2}    & {\tt  *}   & {\tt 3}    & {\tt\  *}   & Binomial      & probit    &       \\
 {\tt 2}    & {\tt  *}   & {\tt 4}    & {\tt\  *}   & Binomial      & cloglog   &       \\
 {\tt 2}    & {\tt  *}   & {\tt 5}    & {\tt\  *}   & Binomial      & cauchit   &       \\
 \hline
 \end{tabular}\hfil
 \caption{Common GLM distribution families and link functions.
 (Here ``{\tt *}'' stands for ``any value.'')}
 \label{table:commonGLMs}
 \end{table}

 \noindent{\bf Details}
 \smallskip

 In GLM, the noise distribution $\Prob[y\mid \mu]$ of the response variable~$y$
 given its mean~$\mu$ is restricted to have the \emph{exponential family} form
 \begin{equation}
 Y \sim\, \Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a}
 + c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = \E(Y) = b'(\theta).
 \label{eqn:GLM}
 \end{equation}
 Changing the mean in such a distribution simply multiplies all \mbox{$\Prob[y\mid \mu]$}
 by~$e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so that they again integrate to~1.
 Parameter $\theta$ is called \emph{canonical}, and the function $\theta = b'^{\,-1}(\mu)$
 that relates it to the mean is called the~\emph{canonical link}; constant~$a$ is called
 \emph{dispersion} and rescales the variance of~$y$.  Many common distributions can be put
 into this form, see Table~\ref{table:commonGLMs}.  The canonical parameter~$\theta$
 is often chosen to coincide with~$\eta$, the linear combination of the regression features;
 other choices for~$\eta$ are possible too.

 Rather than specifying the canonical link, GLM distributions are commonly defined
 by their variance $\Var(y)$ as the function of the mean~$\mu$.  It can be shown
 from Eq.~(\ref{eqn:GLM}) that $\Var(y) = a\,b''(\theta) = a\,b''(b'^{\,-1}(\mu))$.
 For example, for the Bernoulli distribution $\Var(y) = \mu(1-\mu)$, for the Poisson
 distribution \mbox{$\Var(y) = \mu$}, and for the Gaussian distribution
 $\Var(y) = a\cdot 1 = \sigma^2$.
 It turns out that for many common distributions $\Var(y) = a\mu^q$, a power function.
 We support all distributions where $\Var(y) = a\mu^q$, as well as the Bernoulli and
 the binomial distributions.

 For distributions with $\Var(y) = a\mu^q$ the canonical link is also a power function,
 namely $\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose canonical link is
 $\theta = \log\mu$.  We support all power link functions in the form $\eta = \mu^s$,
 dropping any constant factor, with $\eta = \log\mu$ for $s=0$.  The binomial distribution
 has its own family of link functions, which includes logit (the canonical link),
 probit, cloglog, and cauchit (see Table~\ref{table:binomial_links}); we support these
 only for the binomial and Bernoulli distributions.  Links and distributions are specified
 via four input parameters: {\tt dfam}, {\tt vpow}, {\tt link}, and {\tt lpow} (see
 Table~\ref{table:commonGLMs}).

 \begin{table}[t]\hfil
 \begin{tabular}{|cc|cc|}
 \hline
 Name & Link function & Name & Link function \\
 \hline
 Logit   & $\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$ &
 Cloglog & $\displaystyle \eta = \log \big(\!- \log(1 - \mu)\big)^{\mathstrut}$ \\
 Probit  & $\displaystyle \mu  = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut}
           \!\!\!\!\! e^{-\frac{t^2}{2}} dt$ &
 Cauchit & $\displaystyle \eta = \tan\pi(\mu - 1/2)$ \\
 \hline
 \end{tabular}\hfil
 \caption{The supported non-power link functions for the Bernoulli and the binomial
 distributions.  (Here $\mu$~is the Bernoulli mean.)}
 \label{table:binomial_links}
 \end{table}

 The observed response values are provided to the regression script as matrix~$Y$
 having 1 or 2 columns.  If a power distribution family is selected ({\tt dfam=1}),
 matrix $Y$ must have 1~column that provides $y_i$ for each~$x_i$ in the corresponding
 row of matrix~$X$.  When {\tt dfam=2} and $Y$ has 1~column, we assume the Bernoulli
 distribution for $y_i\in\{y_{\mathrm{neg}}, y_{\mathrm{pos}}\}$ with $y_{\mathrm{neg}}$
 from the input parameter {\tt yneg} and with $y_{\mathrm{pos}} \neq y_{\mathrm{neg}}$.
 When {\tt dfam=2} and $Y$ has 2~columns, we assume the
 binomial distribution; for each row~$i$ in~$X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide
 the positive and the negative binomial counts respectively.  Internally we convert
 the 1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.

 We estimate the regression parameters via L2-regularized negative log-likelihood
 minimization:
 \begin{equation*}
 f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big)
 \,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min
 \end{equation*}
 where $\theta_i$ and $b(\theta_i)$ are from~(\ref{eqn:GLM}); note that $a$
 and $c(y, a)$ are constant w.r.t.~$\beta$ and can be ignored here.
 The canonical parameter $\theta_i$ depends on both $\beta$ and~$x_i$:
 \begin{equation*}
 \theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\,
 \big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)
 \end{equation*}
 The user-provided (via {\tt reg}) regularization coefficient $\lambda\geq 0$ can be used
 to mitigate overfitting and degeneracy in the data.  Note that the intercept is never
 regularized.

 Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring approximation
 to the difference $\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$,
 recomputed at each iteration:
 \begin{gather*}
 \varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z,
 \,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!\diag(w) X \,+\, \lambda I\\
 \textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta,
 \,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\
 \forall\,i = 1\ldots n: \,\,\,\,
 w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1}
 \!\!\!\!\!\!,\,\,\,\,\,\,\,\,\,
 u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1}
 \!\!\!\!\!\!.\,\,\,\,
 \end{gather*}
 Here $v(\mu_i)=\Var(y_i)/a$, the variance of $y_i$ as the function of the mean, and
 $g'(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative.  The Fisher scoring
 approximation is minimized by trust-region conjugate gradient iterations (called the
 \emph{inner} iterations, with the Fisher scoring iterations as the \emph{outer}
 iterations), which approximately solve the following problem:
 \begin{equation*}
 1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\,
 \|z\|_2 \leq \delta
 \end{equation*}
 The conjugate gradient algorithm closely follows Algorithm~7.2 on page~171
 of~\cite{Nocedal2006:Optimization}.
 The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
 and updated as described in~\cite{Nocedal2006:Optimization}.
 The user can specify the maximum number of the outer and the inner iterations with
 input parameters {\tt moi} and {\tt mii}, respectively.  The Fisher scoring algorithm
 terminates successfully if $2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}\eps$
 where $\eps > 0$ is a tolerance supplied by the user via {\tt tol}, and $D_1(\beta)$ is
 the unit-dispersion deviance estimated as
 \begin{equation*}
 D_1(\beta) \,\,=\,\, 2 \cdot \big(\Prob[Y \mid \!
 \begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1]
 \,\,-\,\,\Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)
 \end{equation*}
 The deviance estimate is also produced as part of the output.  Once the Fisher scoring
 algorithm terminates, if requested by the user, we estimate the dispersion~$a$ from
 Eq.~\ref{eqn:GLM} using Pearson residuals
 \begin{equation}
 \hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)}
 \label{eqn:dispersion}
 \end{equation}
 and use it to adjust our deviance estimate: $D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$.
 If input argument {\tt disp} is {\tt 0.0} we estimate $\hat{a}$, otherwise we use its
 value as~$a$.  Note that in~(\ref{eqn:dispersion}) $m$~counts the intercept
 ($m \leftarrow m+1$) if it is present.

 \smallskip
 \noindent{\bf Returns}
 \smallskip

 The estimated regression parameters (the $\hat{\beta}_j$'s) are populated into
 a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
 input argument.  What this matrix contains, and its size, depends on the input
 argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
 \begin{Description}
 \item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
 $B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
 \item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
 has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
 and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
 \item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
 mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
 the~$\hat{\beta}_j$'s, one for the original features and another for the
 shifted/rescaled features.  Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
 $B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
 features, in the above format.  Note that $B[\cdot, 2]$ are iteratively estimated
 and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
 rescaling.
 \end{Description}
 Our script also estimates the dispersion $\hat{a}$ (or takes it from the user's input)
 and the deviances $D_1(\hat{\beta})$ and $D_{\hat{a}}(\hat{\beta})$, see
 Table~\ref{table:GLM:stats} for details.  A log file with variables monitoring
 progress through the iterations can also be made available, see Table~\ref{table:GLM:log}.

 \smallskip
 \noindent{\bf Examples}
 \smallskip

 {\hangindent=\parindent\noindent\tt
 \hml -f GLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
   B=/user/biadmin/B.mtx fmt=csv dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.01 tol=0.00000001
   disp=1.0 moi=100 mii=10 O=/user/biadmin/stats.csv Log=/user/biadmin/log.csv

 }

 \smallskip
 \noindent{\bf See Also}
 \smallskip

 In case of binary classification problems, consider using L2-SVM or binary logistic
 regression; for multiclass classification, use multiclass~SVM or multinomial logistic
 regression.  For the special cases of linear regression and logistic regression, it
 may be more efficient to use the corresponding specialized scripts instead of~GLM.