docs/Algorithms Reference/GLMpredict.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Regression Scoring and Prediction}

 \noindent{\bf Description}
 \smallskip

 Script {\tt GLM-predict.dml} is intended to cover all linear model based regressions,
 including linear regression, binomial and multinomial logistic regression, and GLM
 regressions (Poisson, gamma, binomial with probit link~etc.).  Having just one scoring
 script for all these regressions simplifies maintenance and enhancement while ensuring
 compatible interpretations for output statistics.

 The script performs two functions, prediction and scoring.  To perform prediction,
 the script takes two matrix inputs: a collection of records $X$ (without the response
 attribute) and the estimated regression parameters~$B$, also known as~$\beta$.
 To perform scoring, in addition to $X$ and~$B$, the script takes the matrix of actual
 response values~$Y$ that are compared to the predictions made with $X$ and~$B$.  Of course
 there are other, non-matrix, input arguments that specify the model and the output
 format, see below for the full list.

 We assume that our test/scoring dataset is given by $n\,{\times}\,m$-matrix $X$ of
 numerical feature vectors, where each row~$x_i$ represents one feature vector of one
 record; we have \mbox{$\dim x_i = m$}.  Each record also includes the response
 variable~$y_i$ that may be numerical, single-label categorical, or multi-label categorical.
 A single-label categorical $y_i$ is an integer category label, one label per record;
 a multi-label $y_i$ is a vector of integer counts, one count for each possible label,
 which represents multiple single-label events (observations) for the same~$x_i$.  Internally
 we convert single-label categoricals into multi-label categoricals by replacing each
 label~$l$ with an indicator vector~$(0,\ldots,0,1_l,0,\ldots,0)$.  In prediction-only
 tasks the actual $y_i$'s are not needed to the script, but they are needed for scoring.

 To perform prediction, the script matrix-multiplies $X$ and $B$, adding the intercept
 if available, then applies the inverse of the model's link function.
 All GLMs assume that the linear combination of the features in~$x_i$ and the betas
 in~$B$ determines the means~$\mu_i$ of the~$y_i$'s (in numerical or multi-label
 categorical form) with $\dim \mu_i = \dim y_i$.  The observed $y_i$ is assumed to follow
 a specified GLM family distribution $\Prob[y\mid \mu_i]$ with mean(s)~$\mu_i$:
 \begin{equation*}
 x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}
 \,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
 \end{equation*}
 If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number.  Then our script's
 output matrix $M$ is the $n\,{\times}\,1$-vector of these means~$\mu_i$.
 Note that $\mu_i$ predicts the mean of $y_i$, not the actual~$y_i$.  For example,
 in Poisson distribution, the mean is usually fractional, but the actual~$y_i$ is
 always integer.

 If $y_i$ is categorical, i.e.\ a vector of label counts for record~$i$, then $\mu_i$
 is a vector of non-negative real numbers, one number $\mu_{i,l}$ per each label~$l$.
 In this case we divide the $\mu_{i,l}$ by their sum $\sum_l \mu_{i,l}$ to obtain
 predicted label probabilities~\mbox{$p_{i,l}\in [0, 1]$}.  The output matrix $M$ is
 the $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the number of
 records and $k\,{+}\,1$ is the number of categories\footnote{We use $k+1$ because
 there are $k$ non-baseline categories and one baseline category, with regression
 parameters $B$ having $k$~columns.}.  Note again that we do not predict the labels
 themselves, nor their actual counts per record, but we predict the labels' probabilities.

 Going from predicted probabilities to predicted labels, in the single-label categorical
 case, requires extra information such as the cost of false positive versus
 false negative errors.  For example, if there are 5 categories and we \emph{accurately}
 predicted their probabilities as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the
 highest-probability label would be wrong 70\% of the time, whereas picking the
 lowest-probability label might be right if, say, it represents a diagnosis of cancer
 or another rare and serious outcome.  Hence, we keep this step outside the scope of
 {\tt GLM-predict.dml} for now.

 \smallskip
 \noindent{\bf Usage}
 \smallskip

 {\hangindent=\parindent\noindent\it%
 {\tt{}-f }path/\/{\tt{}GLM-predict.dml}
 {\tt{} -nvargs}
 {\tt{} X=}path/file
 {\tt{} Y=}path/file
 {\tt{} B=}path/file
 {\tt{} M=}path/file
 {\tt{} O=}path/file
 {\tt{} dfam=}int
 {\tt{} vpow=}double
 {\tt{} link=}int
 {\tt{} lpow=}double
 {\tt{} disp=}double
 {\tt{} fmt=}format

 }

 \smallskip
 \noindent{\bf Arguments}
 \begin{Description}
 \item[{\tt X}:]
 Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature vectors, each row
 constitutes one feature vector (one record)
 \item[{\tt Y}:] (default:\mbox{ }{\tt " "})
 Location to read the response matrix $Y$ needed for scoring (but optional for prediction),
 with the following dimensions: \\
     $n \:{\times}\: 1$: acceptable for all distributions ({\tt dfam=1} or {\tt 2} or {\tt 3}) \\
     $n \:{\times}\: 2$: for binomial ({\tt dfam=2}) if given by (\#pos, \#neg) counts \\
     $n \:{\times}\: k\,{+}\,1$: for multinomial ({\tt dfam=3}) if given by category counts
 \item[{\tt M}:] (default:\mbox{ }{\tt " "})
 Location to write, if requested, the matrix of predicted response means (for {\tt dfam=1}) or
 probabilities (for {\tt dfam=2} or {\tt 3}):\\
     $n \:{\times}\: 1$: for power-type distributions ({\tt dfam=1}) \\
     $n \:{\times}\: 2$: for binomial distribution ({\tt dfam=2}), col\#~2 is the ``No'' probability \\
     $n \:{\times}\: k\,{+}\,1$: for multinomial logit ({\tt dfam=3}), col\#~$k\,{+}\,1$ is for the baseline
 \item[{\tt B}:]
 Location to read matrix $B$ of the \mbox{betas}, i.e.\ estimated GLM regression parameters,
 with the intercept at row\#~$m\,{+}\,1$ if available:\\
     $\dim(B) \,=\, m \:{\times}\: k$: do not add intercept \\
     $\dim(B) \,=\, m\,{+}\,1 \:{\times}\: k$: add intercept as given by the last $B$-row \\
     if $k > 1$, use only $B${\tt [, 1]} unless it is Multinomial Logit ({\tt dfam=3})
 \item[{\tt O}:] (default:\mbox{ }{\tt " "})
 Location to store the CSV-file with goodness-of-fit statistics defined in
 Table~\ref{table:GLMpred:stats}, the default is to print them to the standard output
 \item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
 GLM distribution family code to specify the type of distribution $\Prob[y\,|\,\mu]$
 that we assume: \\
 {\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$, see Table~\ref{table:commonGLMs};\\
 {\tt 2} = binomial;
 {\tt 3} = multinomial logit
 \item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
 Power for variance defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt dfam}$\,{\neq}\,1$):
 when {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
 dependence of the variance of~$y$ on its mean.  In particular, use:\\
 {\tt 0.0} = Gaussian,
 {\tt 1.0} = Poisson,
 {\tt 2.0} = Gamma,
 {\tt 3.0} = inverse Gaussian
 \item[{\tt link}:] (default:\mbox{ }{\tt 0})
 Link function code to determine the link function~$\eta = g(\mu)$, ignored for
 multinomial logit ({\tt dfam=3}):\\
 {\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
 {\tt 1} = power functions,
 {\tt 2} = logit,
 {\tt 3} = probit,
 {\tt 4} = cloglog,
 {\tt 5} = cauchit
 \item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
 Power for link function defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt link}$\,{\neq}\,1$):
 when {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
 function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$.  Common power links:\\
 {\tt -2.0} = $1/\mu^2$,
 {\tt -1.0} = reciprocal,
 {\tt 0.0} = log,
 {\tt 0.5} = sqrt,
 {\tt 1.0} = identity
 \item[{\tt disp}:] (default:\mbox{ }{\tt 1.0})
 Dispersion value, when available; must be positive
 \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
 Matrix {\tt M} file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
 see read/write functions in SystemML Language Reference for details.
 \end{Description}


 \begin{table}[t]\small\centerline{%
 \begin{tabular}{|lccl|}
 \hline
 Name & \hspace{-0.6em}CID\hspace{-0.5em} & \hspace{-0.3em}Disp?\hspace{-0.6em} & Meaning \\
 \hline
 {\tt LOGLHOOD\_Z}          &   & + & Log-likelihood $Z$-score (in st.\ dev.'s from the mean) \\
 {\tt LOGLHOOD\_Z\_PVAL}    &   & + & Log-likelihood $Z$-score p-value, two-sided \\
 {\tt PEARSON\_X2}          &   & + & Pearson residual $X^2$-statistic \\
 {\tt PEARSON\_X2\_BY\_DF}  &   & + & Pearson $X^2$ divided by degrees of freedom \\
 {\tt PEARSON\_X2\_PVAL}    &   & + & Pearson $X^2$ p-value \\
 {\tt DEVIANCE\_G2}         &   & + & Deviance from the saturated model $G^2$-statistic \\
 {\tt DEVIANCE\_G2\_BY\_DF} &   & + & Deviance $G^2$ divided by degrees of freedom \\
 {\tt DEVIANCE\_G2\_PVAL}   &   & + & Deviance $G^2$ p-value \\
 {\tt AVG\_TOT\_Y}          & + &   & $Y$-column average for an individual response value \\
 {\tt STDEV\_TOT\_Y}        & + &   & $Y$-column st.\ dev.\ for an individual response value \\
 {\tt AVG\_RES\_Y}          & + &   & $Y$-column residual average of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
 {\tt STDEV\_RES\_Y}        & + &   & $Y$-column residual st.\ dev.\ of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
 {\tt PRED\_STDEV\_RES}     & + & + & Model-predicted $Y$-column residual st.\ deviation\\
 {\tt PLAIN\_R2}            & + &   & Plain $R^2$ of $Y$-column residual with bias included \\
 {\tt ADJUSTED\_R2}         & + &   & Adjusted $R^2$ of $Y$-column residual w.\ bias included \\
 {\tt PLAIN\_R2\_NOBIAS}    & + &   & Plain $R^2$ of $Y$-column residual, bias subtracted \\
 {\tt ADJUSTED\_R2\_NOBIAS} & + &   & Adjusted $R^2$ of $Y$-column residual, bias subtracted \\
 \hline
 \end{tabular}}
 \caption{The above goodness-of-fit statistics are provided in CSV format, one per each line, with four
 columns: (Name, [CID], [Disp?], Value).  The columns are:
 ``Name'' is the string identifier for the statistic, see the table;
 ``CID'' is an optional integer value that specifies the $Y$-column index for \mbox{per-}column statistics
 (note that a bi-/multinomial one-column {\tt Y}-input is converted into multi-column);
 ``Disp?'' is an optional Boolean value ({\tt TRUE} or {\tt FALSE}) that tells us
 whether or not scaling by the input dispersion parameter {\tt disp} has been applied to this
 statistic;
 ``Value''  is the value of the statistic.}
 \label{table:GLMpred:stats}
 \end{table}

 \noindent{\bf Details}
 \smallskip

 The output matrix $M$ of predicted means (or probabilities) is computed by matrix-multiplying $X$
 with the first column of~$B$ or with the whole~$B$ in the multinomial case, adding the intercept
 if available (conceptually, appending an extra column of ones to~$X$); then applying the inverse
 of the model's link function.  The difference between ``means'' and ``probabilities'' in the
 categorical case becomes significant when there are ${\geq}\,2$ observations per record
 (with the multi-label records) or when the labels such as $-1$ and~$1$ are viewed and averaged
 as numerical response values (with the single-label records).  To avoid any \mbox{mix-up} or
 information loss, we separately return the predicted probability of each category label for each
 record.

 When the ``actual'' response values $Y$ are available, the summary statistics are computed
 and written out as described in Table~\ref{table:GLMpred:stats}.  Below we discuss each of
 these statistics in detail.  Note that in the categorical case (binomial and multinomial)
 $Y$ is internally represented as the matrix of observation counts for each label in each record,
 rather than just the label~ID for each record.  The input~$Y$ may already be a matrix of counts,
 in which case it is used as-is.  But if $Y$ is given as a vector of response labels, each
 response label is converted into an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$ where~$l$
 is the label~ID for this record.  All negative (e.g.~$-1$) or zero label~IDs are converted to
 the $1 + {}$maximum label~ID.  The largest label~ID is viewed as the ``baseline'' as explained
 in the section on Multinomial Logistic Regression.  We assume that there are $k\geq 1$
 non-baseline categories and one (last) baseline category.

 We also estimate residual variances for each response value, although we do not output them,
 but use them only inside the summary statistics, scaled and unscaled by the input dispersion
 parameter {\tt disp}, as described below.

 \smallskip
 {\tt LOGLHOOD\_Z} and {\tt LOGLHOOD\_Z\_PVAL} statistics measure how far the log-likelihood
 of~$Y$ deviates from its expected value according to the model.  The script implements them
 only for the binomial and the multinomial distributions, returning NaN for all other distributions.
 Pearson's~$X^2$ and deviance~$G^2$ often perform poorly with bi- and multinomial distributions
 due to low cell counts, hence we need this extra goodness-of-fit measure.  To compute these
 statistics, we use:
 \begin{Itemize}
 \item the $n\times (k\,{+}\,1)$-matrix~$Y$ of multi-label response counts, in which $y_{i,j}$
 is the number of times label~$j$ was observed in record~$i$;
 \item the model-estimated probability matrix~$P$ of the same dimensions that satisfies
 $\sum_{j=1}^{k+1} p_{i,j} = 1$ for all~$i=1,\ldots,n$ and where $p_{i,j}$ is the model
 probability of observing label~$j$ in record~$i$;
 \item the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of observations
 in record~$i$ (all $N_i = 1$ if each record has only one response label).
 \end{Itemize}
 We start by computing the multinomial log-likelihood of $Y$ given~$P$ and~$N$, as well as
 the expected log-likelihood given a random~$Y$ and the variance of this log-likelihood if
 $Y$ indeed follows the proposed distribution:
 \begin{align*}
 \ell (Y) \,\,&=\,\, \log \Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1}  \,y_{i,j}\log p_{i,j} \\
 \E_Y \ell (Y)  \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j}
     \,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\
 \Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2
     - \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right)
 \end{align*}
 Then we compute the $Z$-score as the difference between the actual and the expected
 log-likelihood $\ell(Y)$ divided by its expected standard deviation, and its two-sided
 p-value in the Normal distribution assumption ($\ell(Y)$ should approach normality due
 to the Central Limit Theorem):
 \begin{equation*}
 Z   \,=\, \frac {\ell(Y) - \E_Y \ell(Y)}{\sqrt{\Var_Y \ell(Y)}};\quad
 \mathop{\textrm{p-value}}(Z) \,=\, \Prob \Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big]
 \end{equation*}
 A low p-value would indicate ``underfitting'' if $Z\ll 0$ or ``overfitting'' if $Z\gg 0$.  Here
 ``overfitting'' means that higher-probability labels occur more often than their probabilities
 suggest.

 We also apply the dispersion input ({\tt disp}) to compute the ``scaled'' version of the $Z$-score
 and its p-value.  Since $\ell(Y)$ is a linear function of~$Y$, multiplying the GLM-predicted
 variance of~$Y$ by {\tt disp} results in multiplying $\Var_Y \ell(Y)$ by the same {\tt disp}.  This, in turn,
 translates into dividing the $Z$-score by the square root of the dispersion:
 \begin{equation*}
 Z_{\texttt{disp}}  \,=\, \big(\ell(Y) \,-\, \E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot\Var_Y \ell(Y)}
 \,=\, Z / \sqrt{\texttt{disp}}
 \end{equation*}
 Finally, we recalculate the p-value with this new $Z$-score.

 \smallskip
 {\tt PEARSON\_X2}, {\tt PEARSON\_X2\_BY\_DF}, and {\tt PEARSON\_X2\_PVAL}:
 Pearson's residual $X^2$-statistic is a commonly used goodness-of-fit measure for linear models~\cite{McCullagh1989:GLM}.
 The idea is to measure how well the model-predicted means and variances match the actual behavior
 of response values.  For each record $i$, we estimate the mean $\mu_i$ and the variance $v_i$
 (or $\texttt{disp}\cdot v_i$) and use them to normalize the residual:
 $r_i = (y_i - \mu_i) / \sqrt{v_i}$.  These normalized residuals are then squared, aggregated
 by summation, and tested against an appropriate $\chi^2$~distribution.  The computation of~$X^2$
 is slightly different for categorical data (bi- and multinomial) than it is for numerical data,
 since $y_i$ has multiple correlated dimensions~\cite{McCullagh1989:GLM}:
 \begin{equation*}
 X^2\,\textrm{(numer.)} \,=\,  \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad
 X^2\,\textrm{(categ.)} \,=\,  \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i
 \hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}}
 \end{equation*}
 The number of degrees of freedom~\#d.f.\ for the $\chi^2$~distribution is $n - m$ for numerical data and
 $(n - m)k$ for categorical data, where $k = \mathop{\texttt{ncol}}(Y) - 1$.  Given the dispersion
 parameter {\tt disp}, the $X^2$ statistic is scaled by division: \mbox{$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$}.
 If the dispersion is accurate, $X^2 / \texttt{disp}$ should be close to~\#d.f.  In fact, $X^2 / \textrm{\#d.f.}$
 over the \emph{training} data is the dispersion estimator used in our {\tt GLM.dml} script,
 see~(\ref{eqn:dispersion}).  Here we provide $X^2 / \textrm{\#d.f.}$ and $X^2_{\texttt{disp}} / \textrm{\#d.f.}$
 as {\tt PEARSON\_X2\_BY\_DF} to enable dispersion comparison between the training data and
 the test data.

 NOTE: For categorical data, both Pearson's $X^2$ and the deviance $G^2$ are unreliable (i.e.\ do not
 approach the $\chi^2$~distribution) unless the predicted means of multi-label counts
 $\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$ are fairly large: all ${\geq}\,1$ and 80\% are
 at least~$5$~\cite{Cochran1954:chisq}.  They should not be used for ``one label per record'' categoricals.

 \smallskip
 {\tt DEVIANCE\_G2}, {\tt DEVIANCE\_G2\_BY\_DF}, and {\tt DEVIANCE\_G2\_PVAL}:
 Deviance $G^2$ is the log of the likelihood ratio between the ``saturated'' model and the
 linear model being tested for the given dataset, multiplied by two:
 \begin{equation}
 G^2 \,=\, 2 \,\log \frac{\Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}%
 {\Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]}
 \label{eqn:GLMpred:deviance}
 \end{equation}
 The ``saturated'' model sets the mean $\mu_i^{\mathrm{sat}}$ to equal~$y_i$ for every record
 (for categorical data, $p^{\mathrm{sat}}_{i,j} = y_{i,j} / N_i$), which represents the ``perfect fit.''
 For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at a boundary, by continuity we set
 $0 \log 0 = 0$.  The GLM~likelihood functions defined in~(\ref{eqn:GLM}) become simplified
 in ratio~(\ref{eqn:GLMpred:deviance}) due to canceling out the term $c(y, a)$ since it is the same
 in both models.

 The log of a likelihood ratio between two nested models, times two, is known to approach
 a $\chi^2$ distribution as $n\to\infty$ if both models have fixed parameter spaces.
 But this is not the case for the ``saturated'' model: it adds more parameters with each record.
 In practice, however, $\chi^2$ distributions are used to compute the p-value of~$G^2$~\cite{McCullagh1989:GLM}.
 The number of degrees of freedom~\#d.f.\ and the treatment of dispersion are the same as for
 Pearson's~$X^2$, see above.

 \Paragraph{Column-wise statistics.}  The rest of the statistics are computed separately
 for each column of~$Y$.  As explained above, $Y$~has two or more columns in bi- and multinomial case,
 either at input or after conversion.  Moreover, each $y_{i,j}$ in record~$i$ with $N_i \geq 2$ is
 counted as $N_i$ separate observations $y_{i,j,l}$ of 0 or~1 (where $l=1,\ldots,N_i$) with
 $y_{i,j}$~ones and $N_i-y_{i,j}$ zeros.
 For power distributions, including linear regression, $Y$~has only one column and all
 $N_i = 1$, so the statistics are computed for all~$Y$ with each record counted once.
 Below we denote $N = \sum_{i=1}^n N_i \,\geq n$.
 Here is the total average and the residual average (residual bias) of~$y_{i,j,l}$ for each $Y$-column:
 \begin{equation*}
 \texttt{AVG\_TOT\_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n  y_{i,j}; \quad
 \texttt{AVG\_RES\_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})
 \end{equation*}
 Dividing by~$N$ (rather than~$n$) gives the averages for~$y_{i,j,l}$ (rather than~$y_{i,j}$).
 The total variance, and the standard deviation, for individual observations~$y_{i,j,l}$ is
 estimated from the total variance for response values~$y_{i,j}$ using independence assumption:
 $\Var y_{i,j} = \Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} \Var y_{i,j,l}$.
 This allows us to estimate the sum of squares for~$y_{i,j,l}$ via the sum of squares for~$y_{i,j}$:
 \begin{equation*}
 \texttt{STDEV\_TOT\_Y}_j \,=\,
 \Bigg[\frac{1}{N-1} \sum_{i=1}^n  \Big( y_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  y_{i'\!,j}\Big)^2\Bigg]^{1/2}
 \end{equation*}
 Analogously, we estimate the standard deviation of the residual $y_{i,j,l} - \mu_{i,j,l}$:
 \begin{equation*}
 \texttt{STDEV\_RES\_Y}_j \,=\,
 \Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n  \Big( y_{i,j} - \mu_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}
 \end{equation*}
 Here $m'=m$ if $m$ includes the intercept as a feature and $m'=m+1$ if it does not.
 The estimated standard deviations can be compared to the model-predicted residual standard deviation
 computed from the predicted means by the GLM variance formula and scaled by the dispersion:
 \begin{equation*}
 \texttt{PRED\_STDEV\_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2}
 \end{equation*}
 We also compute the $R^2$ statistics for each column of~$Y$, see Table~\ref{table:GLMpred:R2} for details.
 We compute two versions of~$R^2$: in one version the residual sum-of-squares (RSS) includes any bias in
 the residual that might be present (due to the lack of, or inaccuracy in, the intercept); in the other
 version of~RSS the bias is subtracted by ``centering'' the residual.  In both cases we subtract the bias in the total
 sum-of-squares (in the denominator), and $m'$ equals $m$~with the intercept or $m+1$ without the intercept.

 \begin{table}[t]\small\centerline{%
 \begin{tabular}{|c|c|}
 \multicolumn{2}{c}{$R^2$ where the residual sum-of-squares includes the bias contribution:} \\
 \hline
 \multicolumn{1}{|l|}{\tt PLAIN\_R2${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2${}_j \,\,= {}$} \\
 $ \displaystyle 1 -
 \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
 {\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ &
 $ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}}  \,
 \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
 {\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ \\
 \hline
 \multicolumn{2}{c}{} \\
 \multicolumn{2}{c}{$R^2$ where the residual sum-of-squares is centered so that the bias is subtracted:} \\
 \hline
 \multicolumn{1}{|l|}{\tt PLAIN\_R2\_NOBIAS${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2\_NOBIAS${}_j \,\,= {}$} \\
 $ \displaystyle 1 -
 \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\!
     \sum\limits_{i'=1}^n  (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
 {\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ &
 $ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \,
 \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\!
     \sum\limits_{i'=1}^n  (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
 {\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ \\
 \hline
 \end{tabular}}
 \caption{The $R^2$ statistics we compute in {\tt GLM-predict.dml}}
 \label{table:GLMpred:R2}
 \end{table}

 \smallskip
 \noindent{\bf Returns}
 \smallskip

 The matrix of predicted means (if the response is numerical) or probabilities (if the response
 is categorical), see ``Description'' subsection above for more information.  Given {\tt Y}, we
 return some statistics in CSV format as described in Table~\ref{table:GLMpred:stats} and in the
 above text.

 \smallskip
 \noindent{\bf Examples}
 \smallskip

 Note that in the examples below the value for ``{\tt disp}'' input argument
 is set arbitrarily.  The correct dispersion value should be computed from the training
 data during model estimation, or omitted if unknown (which sets it to~{\tt 1.0}).

 \smallskip\noindent
 Linear regression example:
 \par\hangindent=\parindent\noindent{\tt
 \hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 disp=5.67
   X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
   Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv

 }\smallskip\noindent
 Linear regression example, prediction only (no {\tt Y} given):
 \par\hangindent=\parindent\noindent{\tt
 \hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0
   X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv

 }\smallskip\noindent
 Binomial logistic regression example:
 \par\hangindent=\parindent\noindent{\tt
 \hml -f GLM-predict.dml -nvargs dfam=2 link=2 disp=3.0004464
   X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
   Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv

 }\smallskip\noindent
 Binomial probit regression example:
 \par\hangindent=\parindent\noindent{\tt
 \hml -f GLM-predict.dml -nvargs dfam=2 link=3 disp=3.0004464
   X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
   Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv

 }\smallskip\noindent
 Multinomial logistic regression example:
 \par\hangindent=\parindent\noindent{\tt
 \hml -f GLM-predict.dml -nvargs dfam=3
   X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
   Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv

 }\smallskip\noindent
 Poisson regression with the log link example:
 \par\hangindent=\parindent\noindent{\tt
 \hml -f GLM-predict.dml -nvargs dfam=1 vpow=1.0 link=1 lpow=0.0 disp=3.45
   X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
   Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv

 }\smallskip\noindent
 Gamma regression with the inverse (reciprocal) link example:
 \par\hangindent=\parindent\noindent{\tt
 \hml -f GLM-predict.dml -nvargs dfam=1 vpow=2.0 link=1 lpow=-1.0 disp=1.99118
   X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
   Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv

 }