\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{Regression Scoring and Prediction} | |

\noindent{\bf Description} | |

\smallskip | |

Script {\tt GLM-predict.dml} is intended to cover all linear model based regressions, | |

including linear regression, binomial and multinomial logistic regression, and GLM | |

regressions (Poisson, gamma, binomial with probit link~etc.). Having just one scoring | |

script for all these regressions simplifies maintenance and enhancement while ensuring | |

compatible interpretations for output statistics. | |

The script performs two functions, prediction and scoring. To perform prediction, | |

the script takes two matrix inputs: a collection of records $X$ (without the response | |

attribute) and the estimated regression parameters~$B$, also known as~$\beta$. | |

To perform scoring, in addition to $X$ and~$B$, the script takes the matrix of actual | |

response values~$Y$ that are compared to the predictions made with $X$ and~$B$. Of course | |

there are other, non-matrix, input arguments that specify the model and the output | |

format, see below for the full list. | |

We assume that our test/scoring dataset is given by $n\,{\times}\,m$-matrix $X$ of | |

numerical feature vectors, where each row~$x_i$ represents one feature vector of one | |

record; we have \mbox{$\dim x_i = m$}. Each record also includes the response | |

variable~$y_i$ that may be numerical, single-label categorical, or multi-label categorical. | |

A single-label categorical $y_i$ is an integer category label, one label per record; | |

a multi-label $y_i$ is a vector of integer counts, one count for each possible label, | |

which represents multiple single-label events (observations) for the same~$x_i$. Internally | |

we convert single-label categoricals into multi-label categoricals by replacing each | |

label~$l$ with an indicator vector~$(0,\ldots,0,1_l,0,\ldots,0)$. In prediction-only | |

tasks the actual $y_i$'s are not needed to the script, but they are needed for scoring. | |

To perform prediction, the script matrix-multiplies $X$ and $B$, adding the intercept | |

if available, then applies the inverse of the model's link function. | |

All GLMs assume that the linear combination of the features in~$x_i$ and the betas | |

in~$B$ determines the means~$\mu_i$ of the~$y_i$'s (in numerical or multi-label | |

categorical form) with $\dim \mu_i = \dim y_i$. The observed $y_i$ is assumed to follow | |

a specified GLM family distribution $\Prob[y\mid \mu_i]$ with mean(s)~$\mu_i$: | |

\begin{equation*} | |

x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} | |

\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i] | |

\end{equation*} | |

If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number. Then our script's | |

output matrix $M$ is the $n\,{\times}\,1$-vector of these means~$\mu_i$. | |

Note that $\mu_i$ predicts the mean of $y_i$, not the actual~$y_i$. For example, | |

in Poisson distribution, the mean is usually fractional, but the actual~$y_i$ is | |

always integer. | |

If $y_i$ is categorical, i.e.\ a vector of label counts for record~$i$, then $\mu_i$ | |

is a vector of non-negative real numbers, one number $\mu_{i,l}$ per each label~$l$. | |

In this case we divide the $\mu_{i,l}$ by their sum $\sum_l \mu_{i,l}$ to obtain | |

predicted label probabilities~\mbox{$p_{i,l}\in [0, 1]$}. The output matrix $M$ is | |

the $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the number of | |

records and $k\,{+}\,1$ is the number of categories\footnote{We use $k+1$ because | |

there are $k$ non-baseline categories and one baseline category, with regression | |

parameters $B$ having $k$~columns.}. Note again that we do not predict the labels | |

themselves, nor their actual counts per record, but we predict the labels' probabilities. | |

Going from predicted probabilities to predicted labels, in the single-label categorical | |

case, requires extra information such as the cost of false positive versus | |

false negative errors. For example, if there are 5 categories and we \emph{accurately} | |

predicted their probabilities as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the | |

highest-probability label would be wrong 70\% of the time, whereas picking the | |

lowest-probability label might be right if, say, it represents a diagnosis of cancer | |

or another rare and serious outcome. Hence, we keep this step outside the scope of | |

{\tt GLM-predict.dml} for now. | |

\smallskip | |

\noindent{\bf Usage} | |

\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}GLM-predict.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} Y=}path/file | |

{\tt{} B=}path/file | |

{\tt{} M=}path/file | |

{\tt{} O=}path/file | |

{\tt{} dfam=}int | |

{\tt{} vpow=}double | |

{\tt{} link=}int | |

{\tt{} lpow=}double | |

{\tt{} disp=}double | |

{\tt{} fmt=}format | |

} | |

\smallskip | |

\noindent{\bf Arguments} | |

\begin{Description} | |

\item[{\tt X}:] | |

Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature vectors, each row | |

constitutes one feature vector (one record) | |

\item[{\tt Y}:] (default:\mbox{ }{\tt " "}) | |

Location to read the response matrix $Y$ needed for scoring (but optional for prediction), | |

with the following dimensions: \\ | |

$n \:{\times}\: 1$: acceptable for all distributions ({\tt dfam=1} or {\tt 2} or {\tt 3}) \\ | |

$n \:{\times}\: 2$: for binomial ({\tt dfam=2}) if given by (\#pos, \#neg) counts \\ | |

$n \:{\times}\: k\,{+}\,1$: for multinomial ({\tt dfam=3}) if given by category counts | |

\item[{\tt M}:] (default:\mbox{ }{\tt " "}) | |

Location to write, if requested, the matrix of predicted response means (for {\tt dfam=1}) or | |

probabilities (for {\tt dfam=2} or {\tt 3}):\\ | |

$n \:{\times}\: 1$: for power-type distributions ({\tt dfam=1}) \\ | |

$n \:{\times}\: 2$: for binomial distribution ({\tt dfam=2}), col\#~2 is the ``No'' probability \\ | |

$n \:{\times}\: k\,{+}\,1$: for multinomial logit ({\tt dfam=3}), col\#~$k\,{+}\,1$ is for the baseline | |

\item[{\tt B}:] | |

Location to read matrix $B$ of the \mbox{betas}, i.e.\ estimated GLM regression parameters, | |

with the intercept at row\#~$m\,{+}\,1$ if available:\\ | |

$\dim(B) \,=\, m \:{\times}\: k$: do not add intercept \\ | |

$\dim(B) \,=\, m\,{+}\,1 \:{\times}\: k$: add intercept as given by the last $B$-row \\ | |

if $k > 1$, use only $B${\tt [, 1]} unless it is Multinomial Logit ({\tt dfam=3}) | |

\item[{\tt O}:] (default:\mbox{ }{\tt " "}) | |

Location to store the CSV-file with goodness-of-fit statistics defined in | |

Table~\ref{table:GLMpred:stats}, the default is to print them to the standard output | |

\item[{\tt dfam}:] (default:\mbox{ }{\tt 1}) | |

GLM distribution family code to specify the type of distribution $\Prob[y\,|\,\mu]$ | |

that we assume: \\ | |

{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$, see Table~\ref{table:commonGLMs};\\ | |

{\tt 2} = binomial; | |

{\tt 3} = multinomial logit | |

\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0}) | |

Power for variance defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt dfam}$\,{\neq}\,1$): | |

when {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power | |

dependence of the variance of~$y$ on its mean. In particular, use:\\ | |

{\tt 0.0} = Gaussian, | |

{\tt 1.0} = Poisson, | |

{\tt 2.0} = Gamma, | |

{\tt 3.0} = inverse Gaussian | |

\item[{\tt link}:] (default:\mbox{ }{\tt 0}) | |

Link function code to determine the link function~$\eta = g(\mu)$, ignored for | |

multinomial logit ({\tt dfam=3}):\\ | |

{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\ | |

{\tt 1} = power functions, | |

{\tt 2} = logit, | |

{\tt 3} = probit, | |

{\tt 4} = cloglog, | |

{\tt 5} = cauchit | |

\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0}) | |

Power for link function defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt link}$\,{\neq}\,1$): | |

when {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link | |

function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$. Common power links:\\ | |

{\tt -2.0} = $1/\mu^2$, | |

{\tt -1.0} = reciprocal, | |

{\tt 0.0} = log, | |

{\tt 0.5} = sqrt, | |

{\tt 1.0} = identity | |

\item[{\tt disp}:] (default:\mbox{ }{\tt 1.0}) | |

Dispersion value, when available; must be positive | |

\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

Matrix {\tt M} file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; | |

see read/write functions in SystemML Language Reference for details. | |

\end{Description} | |

\begin{table}[t]\small\centerline{% | |

\begin{tabular}{|lccl|} | |

\hline | |

Name & \hspace{-0.6em}CID\hspace{-0.5em} & \hspace{-0.3em}Disp?\hspace{-0.6em} & Meaning \\ | |

\hline | |

{\tt LOGLHOOD\_Z} & & + & Log-likelihood $Z$-score (in st.\ dev.'s from the mean) \\ | |

{\tt LOGLHOOD\_Z\_PVAL} & & + & Log-likelihood $Z$-score p-value, two-sided \\ | |

{\tt PEARSON\_X2} & & + & Pearson residual $X^2$-statistic \\ | |

{\tt PEARSON\_X2\_BY\_DF} & & + & Pearson $X^2$ divided by degrees of freedom \\ | |

{\tt PEARSON\_X2\_PVAL} & & + & Pearson $X^2$ p-value \\ | |

{\tt DEVIANCE\_G2} & & + & Deviance from the saturated model $G^2$-statistic \\ | |

{\tt DEVIANCE\_G2\_BY\_DF} & & + & Deviance $G^2$ divided by degrees of freedom \\ | |

{\tt DEVIANCE\_G2\_PVAL} & & + & Deviance $G^2$ p-value \\ | |

{\tt AVG\_TOT\_Y} & + & & $Y$-column average for an individual response value \\ | |

{\tt STDEV\_TOT\_Y} & + & & $Y$-column st.\ dev.\ for an individual response value \\ | |

{\tt AVG\_RES\_Y} & + & & $Y$-column residual average of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\ | |

{\tt STDEV\_RES\_Y} & + & & $Y$-column residual st.\ dev.\ of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\ | |

{\tt PRED\_STDEV\_RES} & + & + & Model-predicted $Y$-column residual st.\ deviation\\ | |

{\tt PLAIN\_R2} & + & & Plain $R^2$ of $Y$-column residual with bias included \\ | |

{\tt ADJUSTED\_R2} & + & & Adjusted $R^2$ of $Y$-column residual w.\ bias included \\ | |

{\tt PLAIN\_R2\_NOBIAS} & + & & Plain $R^2$ of $Y$-column residual, bias subtracted \\ | |

{\tt ADJUSTED\_R2\_NOBIAS} & + & & Adjusted $R^2$ of $Y$-column residual, bias subtracted \\ | |

\hline | |

\end{tabular}} | |

\caption{The above goodness-of-fit statistics are provided in CSV format, one per each line, with four | |

columns: (Name, [CID], [Disp?], Value). The columns are: | |

``Name'' is the string identifier for the statistic, see the table; | |

``CID'' is an optional integer value that specifies the $Y$-column index for \mbox{per-}column statistics | |

(note that a bi-/multinomial one-column {\tt Y}-input is converted into multi-column); | |

``Disp?'' is an optional Boolean value ({\tt TRUE} or {\tt FALSE}) that tells us | |

whether or not scaling by the input dispersion parameter {\tt disp} has been applied to this | |

statistic; | |

``Value'' is the value of the statistic.} | |

\label{table:GLMpred:stats} | |

\end{table} | |

\noindent{\bf Details} | |

\smallskip | |

The output matrix $M$ of predicted means (or probabilities) is computed by matrix-multiplying $X$ | |

with the first column of~$B$ or with the whole~$B$ in the multinomial case, adding the intercept | |

if available (conceptually, appending an extra column of ones to~$X$); then applying the inverse | |

of the model's link function. The difference between ``means'' and ``probabilities'' in the | |

categorical case becomes significant when there are ${\geq}\,2$ observations per record | |

(with the multi-label records) or when the labels such as $-1$ and~$1$ are viewed and averaged | |

as numerical response values (with the single-label records). To avoid any \mbox{mix-up} or | |

information loss, we separately return the predicted probability of each category label for each | |

record. | |

When the ``actual'' response values $Y$ are available, the summary statistics are computed | |

and written out as described in Table~\ref{table:GLMpred:stats}. Below we discuss each of | |

these statistics in detail. Note that in the categorical case (binomial and multinomial) | |

$Y$ is internally represented as the matrix of observation counts for each label in each record, | |

rather than just the label~ID for each record. The input~$Y$ may already be a matrix of counts, | |

in which case it is used as-is. But if $Y$ is given as a vector of response labels, each | |

response label is converted into an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$ where~$l$ | |

is the label~ID for this record. All negative (e.g.~$-1$) or zero label~IDs are converted to | |

the $1 + {}$maximum label~ID. The largest label~ID is viewed as the ``baseline'' as explained | |

in the section on Multinomial Logistic Regression. We assume that there are $k\geq 1$ | |

non-baseline categories and one (last) baseline category. | |

We also estimate residual variances for each response value, although we do not output them, | |

but use them only inside the summary statistics, scaled and unscaled by the input dispersion | |

parameter {\tt disp}, as described below. | |

\smallskip | |

{\tt LOGLHOOD\_Z} and {\tt LOGLHOOD\_Z\_PVAL} statistics measure how far the log-likelihood | |

of~$Y$ deviates from its expected value according to the model. The script implements them | |

only for the binomial and the multinomial distributions, returning NaN for all other distributions. | |

Pearson's~$X^2$ and deviance~$G^2$ often perform poorly with bi- and multinomial distributions | |

due to low cell counts, hence we need this extra goodness-of-fit measure. To compute these | |

statistics, we use: | |

\begin{Itemize} | |

\item the $n\times (k\,{+}\,1)$-matrix~$Y$ of multi-label response counts, in which $y_{i,j}$ | |

is the number of times label~$j$ was observed in record~$i$; | |

\item the model-estimated probability matrix~$P$ of the same dimensions that satisfies | |

$\sum_{j=1}^{k+1} p_{i,j} = 1$ for all~$i=1,\ldots,n$ and where $p_{i,j}$ is the model | |

probability of observing label~$j$ in record~$i$; | |

\item the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of observations | |

in record~$i$ (all $N_i = 1$ if each record has only one response label). | |

\end{Itemize} | |

We start by computing the multinomial log-likelihood of $Y$ given~$P$ and~$N$, as well as | |

the expected log-likelihood given a random~$Y$ and the variance of this log-likelihood if | |

$Y$ indeed follows the proposed distribution: | |

\begin{align*} | |

\ell (Y) \,\,&=\,\, \log \Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1} \,y_{i,j}\log p_{i,j} \\ | |

\E_Y \ell (Y) \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j} | |

\,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\ | |

\Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2 | |

- \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right) | |

\end{align*} | |

Then we compute the $Z$-score as the difference between the actual and the expected | |

log-likelihood $\ell(Y)$ divided by its expected standard deviation, and its two-sided | |

p-value in the Normal distribution assumption ($\ell(Y)$ should approach normality due | |

to the Central Limit Theorem): | |

\begin{equation*} | |

Z \,=\, \frac {\ell(Y) - \E_Y \ell(Y)}{\sqrt{\Var_Y \ell(Y)}};\quad | |

\mathop{\textrm{p-value}}(Z) \,=\, \Prob \Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big] | |

\end{equation*} | |

A low p-value would indicate ``underfitting'' if $Z\ll 0$ or ``overfitting'' if $Z\gg 0$. Here | |

``overfitting'' means that higher-probability labels occur more often than their probabilities | |

suggest. | |

We also apply the dispersion input ({\tt disp}) to compute the ``scaled'' version of the $Z$-score | |

and its p-value. Since $\ell(Y)$ is a linear function of~$Y$, multiplying the GLM-predicted | |

variance of~$Y$ by {\tt disp} results in multiplying $\Var_Y \ell(Y)$ by the same {\tt disp}. This, in turn, | |

translates into dividing the $Z$-score by the square root of the dispersion: | |

\begin{equation*} | |

Z_{\texttt{disp}} \,=\, \big(\ell(Y) \,-\, \E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot\Var_Y \ell(Y)} | |

\,=\, Z / \sqrt{\texttt{disp}} | |

\end{equation*} | |

Finally, we recalculate the p-value with this new $Z$-score. | |

\smallskip | |

{\tt PEARSON\_X2}, {\tt PEARSON\_X2\_BY\_DF}, and {\tt PEARSON\_X2\_PVAL}: | |

Pearson's residual $X^2$-statistic is a commonly used goodness-of-fit measure for linear models~\cite{McCullagh1989:GLM}. | |

The idea is to measure how well the model-predicted means and variances match the actual behavior | |

of response values. For each record $i$, we estimate the mean $\mu_i$ and the variance $v_i$ | |

(or $\texttt{disp}\cdot v_i$) and use them to normalize the residual: | |

$r_i = (y_i - \mu_i) / \sqrt{v_i}$. These normalized residuals are then squared, aggregated | |

by summation, and tested against an appropriate $\chi^2$~distribution. The computation of~$X^2$ | |

is slightly different for categorical data (bi- and multinomial) than it is for numerical data, | |

since $y_i$ has multiple correlated dimensions~\cite{McCullagh1989:GLM}: | |

\begin{equation*} | |

X^2\,\textrm{(numer.)} \,=\, \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad | |

X^2\,\textrm{(categ.)} \,=\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i | |

\hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}} | |

\end{equation*} | |

The number of degrees of freedom~\#d.f.\ for the $\chi^2$~distribution is $n - m$ for numerical data and | |

$(n - m)k$ for categorical data, where $k = \mathop{\texttt{ncol}}(Y) - 1$. Given the dispersion | |

parameter {\tt disp}, the $X^2$ statistic is scaled by division: \mbox{$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$}. | |

If the dispersion is accurate, $X^2 / \texttt{disp}$ should be close to~\#d.f. In fact, $X^2 / \textrm{\#d.f.}$ | |

over the \emph{training} data is the dispersion estimator used in our {\tt GLM.dml} script, | |

see~(\ref{eqn:dispersion}). Here we provide $X^2 / \textrm{\#d.f.}$ and $X^2_{\texttt{disp}} / \textrm{\#d.f.}$ | |

as {\tt PEARSON\_X2\_BY\_DF} to enable dispersion comparison between the training data and | |

the test data. | |

NOTE: For categorical data, both Pearson's $X^2$ and the deviance $G^2$ are unreliable (i.e.\ do not | |

approach the $\chi^2$~distribution) unless the predicted means of multi-label counts | |

$\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$ are fairly large: all ${\geq}\,1$ and 80\% are | |

at least~$5$~\cite{Cochran1954:chisq}. They should not be used for ``one label per record'' categoricals. | |

\smallskip | |

{\tt DEVIANCE\_G2}, {\tt DEVIANCE\_G2\_BY\_DF}, and {\tt DEVIANCE\_G2\_PVAL}: | |

Deviance $G^2$ is the log of the likelihood ratio between the ``saturated'' model and the | |

linear model being tested for the given dataset, multiplied by two: | |

\begin{equation} | |

G^2 \,=\, 2 \,\log \frac{\Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}% | |

{\Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]} | |

\label{eqn:GLMpred:deviance} | |

\end{equation} | |

The ``saturated'' model sets the mean $\mu_i^{\mathrm{sat}}$ to equal~$y_i$ for every record | |

(for categorical data, $p^{\mathrm{sat}}_{i,j} = y_{i,j} / N_i$), which represents the ``perfect fit.'' | |

For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at a boundary, by continuity we set | |

$0 \log 0 = 0$. The GLM~likelihood functions defined in~(\ref{eqn:GLM}) become simplified | |

in ratio~(\ref{eqn:GLMpred:deviance}) due to canceling out the term $c(y, a)$ since it is the same | |

in both models. | |

The log of a likelihood ratio between two nested models, times two, is known to approach | |

a $\chi^2$ distribution as $n\to\infty$ if both models have fixed parameter spaces. | |

But this is not the case for the ``saturated'' model: it adds more parameters with each record. | |

In practice, however, $\chi^2$ distributions are used to compute the p-value of~$G^2$~\cite{McCullagh1989:GLM}. | |

The number of degrees of freedom~\#d.f.\ and the treatment of dispersion are the same as for | |

Pearson's~$X^2$, see above. | |

\Paragraph{Column-wise statistics.} The rest of the statistics are computed separately | |

for each column of~$Y$. As explained above, $Y$~has two or more columns in bi- and multinomial case, | |

either at input or after conversion. Moreover, each $y_{i,j}$ in record~$i$ with $N_i \geq 2$ is | |

counted as $N_i$ separate observations $y_{i,j,l}$ of 0 or~1 (where $l=1,\ldots,N_i$) with | |

$y_{i,j}$~ones and $N_i-y_{i,j}$ zeros. | |

For power distributions, including linear regression, $Y$~has only one column and all | |

$N_i = 1$, so the statistics are computed for all~$Y$ with each record counted once. | |

Below we denote $N = \sum_{i=1}^n N_i \,\geq n$. | |

Here is the total average and the residual average (residual bias) of~$y_{i,j,l}$ for each $Y$-column: | |

\begin{equation*} | |

\texttt{AVG\_TOT\_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n y_{i,j}; \quad | |

\texttt{AVG\_RES\_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j}) | |

\end{equation*} | |

Dividing by~$N$ (rather than~$n$) gives the averages for~$y_{i,j,l}$ (rather than~$y_{i,j}$). | |

The total variance, and the standard deviation, for individual observations~$y_{i,j,l}$ is | |

estimated from the total variance for response values~$y_{i,j}$ using independence assumption: | |

$\Var y_{i,j} = \Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} \Var y_{i,j,l}$. | |

This allows us to estimate the sum of squares for~$y_{i,j,l}$ via the sum of squares for~$y_{i,j}$: | |

\begin{equation*} | |

\texttt{STDEV\_TOT\_Y}_j \,=\, | |

\Bigg[\frac{1}{N-1} \sum_{i=1}^n \Big( y_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n y_{i'\!,j}\Big)^2\Bigg]^{1/2} | |

\end{equation*} | |

Analogously, we estimate the standard deviation of the residual $y_{i,j,l} - \mu_{i,j,l}$: | |

\begin{equation*} | |

\texttt{STDEV\_RES\_Y}_j \,=\, | |

\Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n \Big( y_{i,j} - \mu_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2} | |

\end{equation*} | |

Here $m'=m$ if $m$ includes the intercept as a feature and $m'=m+1$ if it does not. | |

The estimated standard deviations can be compared to the model-predicted residual standard deviation | |

computed from the predicted means by the GLM variance formula and scaled by the dispersion: | |

\begin{equation*} | |

\texttt{PRED\_STDEV\_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2} | |

\end{equation*} | |

We also compute the $R^2$ statistics for each column of~$Y$, see Table~\ref{table:GLMpred:R2} for details. | |

We compute two versions of~$R^2$: in one version the residual sum-of-squares (RSS) includes any bias in | |

the residual that might be present (due to the lack of, or inaccuracy in, the intercept); in the other | |

version of~RSS the bias is subtracted by ``centering'' the residual. In both cases we subtract the bias in the total | |

sum-of-squares (in the denominator), and $m'$ equals $m$~with the intercept or $m+1$ without the intercept. | |

\begin{table}[t]\small\centerline{% | |

\begin{tabular}{|c|c|} | |

\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares includes the bias contribution:} \\ | |

\hline | |

\multicolumn{1}{|l|}{\tt PLAIN\_R2${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2${}_j \,\,= {}$} \\ | |

$ \displaystyle 1 - | |

\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}% | |

{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ & | |

$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}} \, | |

\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}% | |

{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ \\ | |

\hline | |

\multicolumn{2}{c}{} \\ | |

\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares is centered so that the bias is subtracted:} \\ | |

\hline | |

\multicolumn{1}{|l|}{\tt PLAIN\_R2\_NOBIAS${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2\_NOBIAS${}_j \,\,= {}$} \\ | |

$ \displaystyle 1 - | |

\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! | |

\sum\limits_{i'=1}^n (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}% | |

{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ & | |

$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, | |

\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! | |

\sum\limits_{i'=1}^n (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}% | |

{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ \\ | |

\hline | |

\end{tabular}} | |

\caption{The $R^2$ statistics we compute in {\tt GLM-predict.dml}} | |

\label{table:GLMpred:R2} | |

\end{table} | |

\smallskip | |

\noindent{\bf Returns} | |

\smallskip | |

The matrix of predicted means (if the response is numerical) or probabilities (if the response | |

is categorical), see ``Description'' subsection above for more information. Given {\tt Y}, we | |

return some statistics in CSV format as described in Table~\ref{table:GLMpred:stats} and in the | |

above text. | |

\smallskip | |

\noindent{\bf Examples} | |

\smallskip | |

Note that in the examples below the value for ``{\tt disp}'' input argument | |

is set arbitrarily. The correct dispersion value should be computed from the training | |

data during model estimation, or omitted if unknown (which sets it to~{\tt 1.0}). | |

\smallskip\noindent | |

Linear regression example: | |

\par\hangindent=\parindent\noindent{\tt | |

\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 disp=5.67 | |

X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv | |

Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv | |

}\smallskip\noindent | |

Linear regression example, prediction only (no {\tt Y} given): | |

\par\hangindent=\parindent\noindent{\tt | |

\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 | |

X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv | |

}\smallskip\noindent | |

Binomial logistic regression example: | |

\par\hangindent=\parindent\noindent{\tt | |

\hml -f GLM-predict.dml -nvargs dfam=2 link=2 disp=3.0004464 | |

X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv | |

Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv | |

}\smallskip\noindent | |

Binomial probit regression example: | |

\par\hangindent=\parindent\noindent{\tt | |

\hml -f GLM-predict.dml -nvargs dfam=2 link=3 disp=3.0004464 | |

X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv | |

Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv | |

}\smallskip\noindent | |

Multinomial logistic regression example: | |

\par\hangindent=\parindent\noindent{\tt | |

\hml -f GLM-predict.dml -nvargs dfam=3 | |

X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv | |

Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv | |

}\smallskip\noindent | |

Poisson regression with the log link example: | |

\par\hangindent=\parindent\noindent{\tt | |

\hml -f GLM-predict.dml -nvargs dfam=1 vpow=1.0 link=1 lpow=0.0 disp=3.45 | |

X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv | |

Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv | |

}\smallskip\noindent | |

Gamma regression with the inverse (reciprocal) link example: | |

\par\hangindent=\parindent\noindent{\tt | |

\hml -f GLM-predict.dml -nvargs dfam=1 vpow=2.0 link=1 lpow=-1.0 disp=1.99118 | |

X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv | |

Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv | |

} |