blob: ceb249db27c06c0a5e95b7a7a51e391361025371 [file] [log] [blame]
\begin{comment}
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\end{comment}
\subsection{Regression Scoring and Prediction}
\noindent{\bf Description}
\smallskip
Script {\tt GLM-predict.dml} is intended to cover all linear model based regressions,
including linear regression, binomial and multinomial logistic regression, and GLM
regressions (Poisson, gamma, binomial with probit link~etc.). Having just one scoring
script for all these regressions simplifies maintenance and enhancement while ensuring
compatible interpretations for output statistics.
The script performs two functions, prediction and scoring. To perform prediction,
the script takes two matrix inputs: a collection of records $X$ (without the response
attribute) and the estimated regression parameters~$B$, also known as~$\beta$.
To perform scoring, in addition to $X$ and~$B$, the script takes the matrix of actual
response values~$Y$ that are compared to the predictions made with $X$ and~$B$. Of course
there are other, non-matrix, input arguments that specify the model and the output
format, see below for the full list.
We assume that our test/scoring dataset is given by $n\,{\times}\,m$-matrix $X$ of
numerical feature vectors, where each row~$x_i$ represents one feature vector of one
record; we have \mbox{$\dim x_i = m$}. Each record also includes the response
variable~$y_i$ that may be numerical, single-label categorical, or multi-label categorical.
A single-label categorical $y_i$ is an integer category label, one label per record;
a multi-label $y_i$ is a vector of integer counts, one count for each possible label,
which represents multiple single-label events (observations) for the same~$x_i$. Internally
we convert single-label categoricals into multi-label categoricals by replacing each
label~$l$ with an indicator vector~$(0,\ldots,0,1_l,0,\ldots,0)$. In prediction-only
tasks the actual $y_i$'s are not needed to the script, but they are needed for scoring.
To perform prediction, the script matrix-multiplies $X$ and $B$, adding the intercept
if available, then applies the inverse of the model's link function.
All GLMs assume that the linear combination of the features in~$x_i$ and the betas
in~$B$ determines the means~$\mu_i$ of the~$y_i$'s (in numerical or multi-label
categorical form) with $\dim \mu_i = \dim y_i$. The observed $y_i$ is assumed to follow
a specified GLM family distribution $\Prob[y\mid \mu_i]$ with mean(s)~$\mu_i$:
\begin{equation*}
x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}
\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
\end{equation*}
If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number. Then our script's
output matrix $M$ is the $n\,{\times}\,1$-vector of these means~$\mu_i$.
Note that $\mu_i$ predicts the mean of $y_i$, not the actual~$y_i$. For example,
in Poisson distribution, the mean is usually fractional, but the actual~$y_i$ is
always integer.
If $y_i$ is categorical, i.e.\ a vector of label counts for record~$i$, then $\mu_i$
is a vector of non-negative real numbers, one number $\mu_{i,l}$ per each label~$l$.
In this case we divide the $\mu_{i,l}$ by their sum $\sum_l \mu_{i,l}$ to obtain
predicted label probabilities~\mbox{$p_{i,l}\in [0, 1]$}. The output matrix $M$ is
the $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the number of
records and $k\,{+}\,1$ is the number of categories\footnote{We use $k+1$ because
there are $k$ non-baseline categories and one baseline category, with regression
parameters $B$ having $k$~columns.}. Note again that we do not predict the labels
themselves, nor their actual counts per record, but we predict the labels' probabilities.
Going from predicted probabilities to predicted labels, in the single-label categorical
case, requires extra information such as the cost of false positive versus
false negative errors. For example, if there are 5 categories and we \emph{accurately}
predicted their probabilities as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the
highest-probability label would be wrong 70\% of the time, whereas picking the
lowest-probability label might be right if, say, it represents a diagnosis of cancer
or another rare and serious outcome. Hence, we keep this step outside the scope of
{\tt GLM-predict.dml} for now.
\smallskip
\noindent{\bf Usage}
\smallskip
{\hangindent=\parindent\noindent\it%
{\tt{}-f }path/\/{\tt{}GLM-predict.dml}
{\tt{} -nvargs}
{\tt{} X=}path/file
{\tt{} Y=}path/file
{\tt{} B=}path/file
{\tt{} M=}path/file
{\tt{} O=}path/file
{\tt{} dfam=}int
{\tt{} vpow=}double
{\tt{} link=}int
{\tt{} lpow=}double
{\tt{} disp=}double
{\tt{} fmt=}format
}
\smallskip
\noindent{\bf Arguments}
\begin{Description}
\item[{\tt X}:]
Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature vectors, each row
constitutes one feature vector (one record)
\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
Location to read the response matrix $Y$ needed for scoring (but optional for prediction),
with the following dimensions: \\
$n \:{\times}\: 1$: acceptable for all distributions ({\tt dfam=1} or {\tt 2} or {\tt 3}) \\
$n \:{\times}\: 2$: for binomial ({\tt dfam=2}) if given by (\#pos, \#neg) counts \\
$n \:{\times}\: k\,{+}\,1$: for multinomial ({\tt dfam=3}) if given by category counts
\item[{\tt M}:] (default:\mbox{ }{\tt " "})
Location to write, if requested, the matrix of predicted response means (for {\tt dfam=1}) or
probabilities (for {\tt dfam=2} or {\tt 3}):\\
$n \:{\times}\: 1$: for power-type distributions ({\tt dfam=1}) \\
$n \:{\times}\: 2$: for binomial distribution ({\tt dfam=2}), col\#~2 is the ``No'' probability \\
$n \:{\times}\: k\,{+}\,1$: for multinomial logit ({\tt dfam=3}), col\#~$k\,{+}\,1$ is for the baseline
\item[{\tt B}:]
Location to read matrix $B$ of the \mbox{betas}, i.e.\ estimated GLM regression parameters,
with the intercept at row\#~$m\,{+}\,1$ if available:\\
$\dim(B) \,=\, m \:{\times}\: k$: do not add intercept \\
$\dim(B) \,=\, m\,{+}\,1 \:{\times}\: k$: add intercept as given by the last $B$-row \\
if $k > 1$, use only $B${\tt [, 1]} unless it is Multinomial Logit ({\tt dfam=3})
\item[{\tt O}:] (default:\mbox{ }{\tt " "})
Location to store the CSV-file with goodness-of-fit statistics defined in
Table~\ref{table:GLMpred:stats}, the default is to print them to the standard output
\item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
GLM distribution family code to specify the type of distribution $\Prob[y\,|\,\mu]$
that we assume: \\
{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$, see Table~\ref{table:commonGLMs};\\
{\tt 2} = binomial;
{\tt 3} = multinomial logit
\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
Power for variance defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt dfam}$\,{\neq}\,1$):
when {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
dependence of the variance of~$y$ on its mean. In particular, use:\\
{\tt 0.0} = Gaussian,
{\tt 1.0} = Poisson,
{\tt 2.0} = Gamma,
{\tt 3.0} = inverse Gaussian
\item[{\tt link}:] (default:\mbox{ }{\tt 0})
Link function code to determine the link function~$\eta = g(\mu)$, ignored for
multinomial logit ({\tt dfam=3}):\\
{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
{\tt 1} = power functions,
{\tt 2} = logit,
{\tt 3} = probit,
{\tt 4} = cloglog,
{\tt 5} = cauchit
\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
Power for link function defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt link}$\,{\neq}\,1$):
when {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$. Common power links:\\
{\tt -2.0} = $1/\mu^2$,
{\tt -1.0} = reciprocal,
{\tt 0.0} = log,
{\tt 0.5} = sqrt,
{\tt 1.0} = identity
\item[{\tt disp}:] (default:\mbox{ }{\tt 1.0})
Dispersion value, when available; must be positive
\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
Matrix {\tt M} file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
see read/write functions in SystemML Language Reference for details.
\end{Description}
\begin{table}[t]\small\centerline{%
\begin{tabular}{|lccl|}
\hline
Name & \hspace{-0.6em}CID\hspace{-0.5em} & \hspace{-0.3em}Disp?\hspace{-0.6em} & Meaning \\
\hline
{\tt LOGLHOOD\_Z} & & + & Log-likelihood $Z$-score (in st.\ dev.'s from the mean) \\
{\tt LOGLHOOD\_Z\_PVAL} & & + & Log-likelihood $Z$-score p-value, two-sided \\
{\tt PEARSON\_X2} & & + & Pearson residual $X^2$-statistic \\
{\tt PEARSON\_X2\_BY\_DF} & & + & Pearson $X^2$ divided by degrees of freedom \\
{\tt PEARSON\_X2\_PVAL} & & + & Pearson $X^2$ p-value \\
{\tt DEVIANCE\_G2} & & + & Deviance from the saturated model $G^2$-statistic \\
{\tt DEVIANCE\_G2\_BY\_DF} & & + & Deviance $G^2$ divided by degrees of freedom \\
{\tt DEVIANCE\_G2\_PVAL} & & + & Deviance $G^2$ p-value \\
{\tt AVG\_TOT\_Y} & + & & $Y$-column average for an individual response value \\
{\tt STDEV\_TOT\_Y} & + & & $Y$-column st.\ dev.\ for an individual response value \\
{\tt AVG\_RES\_Y} & + & & $Y$-column residual average of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
{\tt STDEV\_RES\_Y} & + & & $Y$-column residual st.\ dev.\ of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
{\tt PRED\_STDEV\_RES} & + & + & Model-predicted $Y$-column residual st.\ deviation\\
{\tt PLAIN\_R2} & + & & Plain $R^2$ of $Y$-column residual with bias included \\
{\tt ADJUSTED\_R2} & + & & Adjusted $R^2$ of $Y$-column residual w.\ bias included \\
{\tt PLAIN\_R2\_NOBIAS} & + & & Plain $R^2$ of $Y$-column residual, bias subtracted \\
{\tt ADJUSTED\_R2\_NOBIAS} & + & & Adjusted $R^2$ of $Y$-column residual, bias subtracted \\
\hline
\end{tabular}}
\caption{The above goodness-of-fit statistics are provided in CSV format, one per each line, with four
columns: (Name, [CID], [Disp?], Value). The columns are:
``Name'' is the string identifier for the statistic, see the table;
``CID'' is an optional integer value that specifies the $Y$-column index for \mbox{per-}column statistics
(note that a bi-/multinomial one-column {\tt Y}-input is converted into multi-column);
``Disp?'' is an optional Boolean value ({\tt TRUE} or {\tt FALSE}) that tells us
whether or not scaling by the input dispersion parameter {\tt disp} has been applied to this
statistic;
``Value'' is the value of the statistic.}
\label{table:GLMpred:stats}
\end{table}
\noindent{\bf Details}
\smallskip
The output matrix $M$ of predicted means (or probabilities) is computed by matrix-multiplying $X$
with the first column of~$B$ or with the whole~$B$ in the multinomial case, adding the intercept
if available (conceptually, appending an extra column of ones to~$X$); then applying the inverse
of the model's link function. The difference between ``means'' and ``probabilities'' in the
categorical case becomes significant when there are ${\geq}\,2$ observations per record
(with the multi-label records) or when the labels such as $-1$ and~$1$ are viewed and averaged
as numerical response values (with the single-label records). To avoid any \mbox{mix-up} or
information loss, we separately return the predicted probability of each category label for each
record.
When the ``actual'' response values $Y$ are available, the summary statistics are computed
and written out as described in Table~\ref{table:GLMpred:stats}. Below we discuss each of
these statistics in detail. Note that in the categorical case (binomial and multinomial)
$Y$ is internally represented as the matrix of observation counts for each label in each record,
rather than just the label~ID for each record. The input~$Y$ may already be a matrix of counts,
in which case it is used as-is. But if $Y$ is given as a vector of response labels, each
response label is converted into an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$ where~$l$
is the label~ID for this record. All negative (e.g.~$-1$) or zero label~IDs are converted to
the $1 + {}$maximum label~ID. The largest label~ID is viewed as the ``baseline'' as explained
in the section on Multinomial Logistic Regression. We assume that there are $k\geq 1$
non-baseline categories and one (last) baseline category.
We also estimate residual variances for each response value, although we do not output them,
but use them only inside the summary statistics, scaled and unscaled by the input dispersion
parameter {\tt disp}, as described below.
\smallskip
{\tt LOGLHOOD\_Z} and {\tt LOGLHOOD\_Z\_PVAL} statistics measure how far the log-likelihood
of~$Y$ deviates from its expected value according to the model. The script implements them
only for the binomial and the multinomial distributions, returning NaN for all other distributions.
Pearson's~$X^2$ and deviance~$G^2$ often perform poorly with bi- and multinomial distributions
due to low cell counts, hence we need this extra goodness-of-fit measure. To compute these
statistics, we use:
\begin{Itemize}
\item the $n\times (k\,{+}\,1)$-matrix~$Y$ of multi-label response counts, in which $y_{i,j}$
is the number of times label~$j$ was observed in record~$i$;
\item the model-estimated probability matrix~$P$ of the same dimensions that satisfies
$\sum_{j=1}^{k+1} p_{i,j} = 1$ for all~$i=1,\ldots,n$ and where $p_{i,j}$ is the model
probability of observing label~$j$ in record~$i$;
\item the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of observations
in record~$i$ (all $N_i = 1$ if each record has only one response label).
\end{Itemize}
We start by computing the multinomial log-likelihood of $Y$ given~$P$ and~$N$, as well as
the expected log-likelihood given a random~$Y$ and the variance of this log-likelihood if
$Y$ indeed follows the proposed distribution:
\begin{align*}
\ell (Y) \,\,&=\,\, \log \Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1} \,y_{i,j}\log p_{i,j} \\
\E_Y \ell (Y) \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j}
\,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\
\Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2
- \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right)
\end{align*}
Then we compute the $Z$-score as the difference between the actual and the expected
log-likelihood $\ell(Y)$ divided by its expected standard deviation, and its two-sided
p-value in the Normal distribution assumption ($\ell(Y)$ should approach normality due
to the Central Limit Theorem):
\begin{equation*}
Z \,=\, \frac {\ell(Y) - \E_Y \ell(Y)}{\sqrt{\Var_Y \ell(Y)}};\quad
\mathop{\textrm{p-value}}(Z) \,=\, \Prob \Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big]
\end{equation*}
A low p-value would indicate ``underfitting'' if $Z\ll 0$ or ``overfitting'' if $Z\gg 0$. Here
``overfitting'' means that higher-probability labels occur more often than their probabilities
suggest.
We also apply the dispersion input ({\tt disp}) to compute the ``scaled'' version of the $Z$-score
and its p-value. Since $\ell(Y)$ is a linear function of~$Y$, multiplying the GLM-predicted
variance of~$Y$ by {\tt disp} results in multiplying $\Var_Y \ell(Y)$ by the same {\tt disp}. This, in turn,
translates into dividing the $Z$-score by the square root of the dispersion:
\begin{equation*}
Z_{\texttt{disp}} \,=\, \big(\ell(Y) \,-\, \E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot\Var_Y \ell(Y)}
\,=\, Z / \sqrt{\texttt{disp}}
\end{equation*}
Finally, we recalculate the p-value with this new $Z$-score.
\smallskip
{\tt PEARSON\_X2}, {\tt PEARSON\_X2\_BY\_DF}, and {\tt PEARSON\_X2\_PVAL}:
Pearson's residual $X^2$-statistic is a commonly used goodness-of-fit measure for linear models~\cite{McCullagh1989:GLM}.
The idea is to measure how well the model-predicted means and variances match the actual behavior
of response values. For each record $i$, we estimate the mean $\mu_i$ and the variance $v_i$
(or $\texttt{disp}\cdot v_i$) and use them to normalize the residual:
$r_i = (y_i - \mu_i) / \sqrt{v_i}$. These normalized residuals are then squared, aggregated
by summation, and tested against an appropriate $\chi^2$~distribution. The computation of~$X^2$
is slightly different for categorical data (bi- and multinomial) than it is for numerical data,
since $y_i$ has multiple correlated dimensions~\cite{McCullagh1989:GLM}:
\begin{equation*}
X^2\,\textrm{(numer.)} \,=\, \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad
X^2\,\textrm{(categ.)} \,=\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i
\hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}}
\end{equation*}
The number of degrees of freedom~\#d.f.\ for the $\chi^2$~distribution is $n - m$ for numerical data and
$(n - m)k$ for categorical data, where $k = \mathop{\texttt{ncol}}(Y) - 1$. Given the dispersion
parameter {\tt disp}, the $X^2$ statistic is scaled by division: \mbox{$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$}.
If the dispersion is accurate, $X^2 / \texttt{disp}$ should be close to~\#d.f. In fact, $X^2 / \textrm{\#d.f.}$
over the \emph{training} data is the dispersion estimator used in our {\tt GLM.dml} script,
see~(\ref{eqn:dispersion}). Here we provide $X^2 / \textrm{\#d.f.}$ and $X^2_{\texttt{disp}} / \textrm{\#d.f.}$
as {\tt PEARSON\_X2\_BY\_DF} to enable dispersion comparison between the training data and
the test data.
NOTE: For categorical data, both Pearson's $X^2$ and the deviance $G^2$ are unreliable (i.e.\ do not
approach the $\chi^2$~distribution) unless the predicted means of multi-label counts
$\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$ are fairly large: all ${\geq}\,1$ and 80\% are
at least~$5$~\cite{Cochran1954:chisq}. They should not be used for ``one label per record'' categoricals.
\smallskip
{\tt DEVIANCE\_G2}, {\tt DEVIANCE\_G2\_BY\_DF}, and {\tt DEVIANCE\_G2\_PVAL}:
Deviance $G^2$ is the log of the likelihood ratio between the ``saturated'' model and the
linear model being tested for the given dataset, multiplied by two:
\begin{equation}
G^2 \,=\, 2 \,\log \frac{\Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}%
{\Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]}
\label{eqn:GLMpred:deviance}
\end{equation}
The ``saturated'' model sets the mean $\mu_i^{\mathrm{sat}}$ to equal~$y_i$ for every record
(for categorical data, $p^{\mathrm{sat}}_{i,j} = y_{i,j} / N_i$), which represents the ``perfect fit.''
For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at a boundary, by continuity we set
$0 \log 0 = 0$. The GLM~likelihood functions defined in~(\ref{eqn:GLM}) become simplified
in ratio~(\ref{eqn:GLMpred:deviance}) due to canceling out the term $c(y, a)$ since it is the same
in both models.
The log of a likelihood ratio between two nested models, times two, is known to approach
a $\chi^2$ distribution as $n\to\infty$ if both models have fixed parameter spaces.
But this is not the case for the ``saturated'' model: it adds more parameters with each record.
In practice, however, $\chi^2$ distributions are used to compute the p-value of~$G^2$~\cite{McCullagh1989:GLM}.
The number of degrees of freedom~\#d.f.\ and the treatment of dispersion are the same as for
Pearson's~$X^2$, see above.
\Paragraph{Column-wise statistics.} The rest of the statistics are computed separately
for each column of~$Y$. As explained above, $Y$~has two or more columns in bi- and multinomial case,
either at input or after conversion. Moreover, each $y_{i,j}$ in record~$i$ with $N_i \geq 2$ is
counted as $N_i$ separate observations $y_{i,j,l}$ of 0 or~1 (where $l=1,\ldots,N_i$) with
$y_{i,j}$~ones and $N_i-y_{i,j}$ zeros.
For power distributions, including linear regression, $Y$~has only one column and all
$N_i = 1$, so the statistics are computed for all~$Y$ with each record counted once.
Below we denote $N = \sum_{i=1}^n N_i \,\geq n$.
Here is the total average and the residual average (residual bias) of~$y_{i,j,l}$ for each $Y$-column:
\begin{equation*}
\texttt{AVG\_TOT\_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n y_{i,j}; \quad
\texttt{AVG\_RES\_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})
\end{equation*}
Dividing by~$N$ (rather than~$n$) gives the averages for~$y_{i,j,l}$ (rather than~$y_{i,j}$).
The total variance, and the standard deviation, for individual observations~$y_{i,j,l}$ is
estimated from the total variance for response values~$y_{i,j}$ using independence assumption:
$\Var y_{i,j} = \Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} \Var y_{i,j,l}$.
This allows us to estimate the sum of squares for~$y_{i,j,l}$ via the sum of squares for~$y_{i,j}$:
\begin{equation*}
\texttt{STDEV\_TOT\_Y}_j \,=\,
\Bigg[\frac{1}{N-1} \sum_{i=1}^n \Big( y_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n y_{i'\!,j}\Big)^2\Bigg]^{1/2}
\end{equation*}
Analogously, we estimate the standard deviation of the residual $y_{i,j,l} - \mu_{i,j,l}$:
\begin{equation*}
\texttt{STDEV\_RES\_Y}_j \,=\,
\Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n \Big( y_{i,j} - \mu_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}
\end{equation*}
Here $m'=m$ if $m$ includes the intercept as a feature and $m'=m+1$ if it does not.
The estimated standard deviations can be compared to the model-predicted residual standard deviation
computed from the predicted means by the GLM variance formula and scaled by the dispersion:
\begin{equation*}
\texttt{PRED\_STDEV\_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2}
\end{equation*}
We also compute the $R^2$ statistics for each column of~$Y$, see Table~\ref{table:GLMpred:R2} for details.
We compute two versions of~$R^2$: in one version the residual sum-of-squares (RSS) includes any bias in
the residual that might be present (due to the lack of, or inaccuracy in, the intercept); in the other
version of~RSS the bias is subtracted by ``centering'' the residual. In both cases we subtract the bias in the total
sum-of-squares (in the denominator), and $m'$ equals $m$~with the intercept or $m+1$ without the intercept.
\begin{table}[t]\small\centerline{%
\begin{tabular}{|c|c|}
\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares includes the bias contribution:} \\
\hline
\multicolumn{1}{|l|}{\tt PLAIN\_R2${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2${}_j \,\,= {}$} \\
$ \displaystyle 1 -
\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ &
$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}} \,
\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ \\
\hline
\multicolumn{2}{c}{} \\
\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares is centered so that the bias is subtracted:} \\
\hline
\multicolumn{1}{|l|}{\tt PLAIN\_R2\_NOBIAS${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2\_NOBIAS${}_j \,\,= {}$} \\
$ \displaystyle 1 -
\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\!
\sum\limits_{i'=1}^n (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ &
$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \,
\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\!
\sum\limits_{i'=1}^n (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ \\
\hline
\end{tabular}}
\caption{The $R^2$ statistics we compute in {\tt GLM-predict.dml}}
\label{table:GLMpred:R2}
\end{table}
\smallskip
\noindent{\bf Returns}
\smallskip
The matrix of predicted means (if the response is numerical) or probabilities (if the response
is categorical), see ``Description'' subsection above for more information. Given {\tt Y}, we
return some statistics in CSV format as described in Table~\ref{table:GLMpred:stats} and in the
above text.
\smallskip
\noindent{\bf Examples}
\smallskip
Note that in the examples below the value for ``{\tt disp}'' input argument
is set arbitrarily. The correct dispersion value should be computed from the training
data during model estimation, or omitted if unknown (which sets it to~{\tt 1.0}).
\smallskip\noindent
Linear regression example:
\par\hangindent=\parindent\noindent{\tt
\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 disp=5.67
X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
}\smallskip\noindent
Linear regression example, prediction only (no {\tt Y} given):
\par\hangindent=\parindent\noindent{\tt
\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0
X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
}\smallskip\noindent
Binomial logistic regression example:
\par\hangindent=\parindent\noindent{\tt
\hml -f GLM-predict.dml -nvargs dfam=2 link=2 disp=3.0004464
X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
}\smallskip\noindent
Binomial probit regression example:
\par\hangindent=\parindent\noindent{\tt
\hml -f GLM-predict.dml -nvargs dfam=2 link=3 disp=3.0004464
X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
}\smallskip\noindent
Multinomial logistic regression example:
\par\hangindent=\parindent\noindent{\tt
\hml -f GLM-predict.dml -nvargs dfam=3
X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
}\smallskip\noindent
Poisson regression with the log link example:
\par\hangindent=\parindent\noindent{\tt
\hml -f GLM-predict.dml -nvargs dfam=1 vpow=1.0 link=1 lpow=0.0 disp=3.45
X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
}\smallskip\noindent
Gamma regression with the inverse (reciprocal) link example:
\par\hangindent=\parindent\noindent{\tt
\hml -f GLM-predict.dml -nvargs dfam=1 vpow=2.0 link=1 lpow=-1.0 disp=1.99118
X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
}