| \begin{comment} |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| \end{comment} |
| |
| \subsection{Regression Scoring and Prediction} |
| |
| \noindent{\bf Description} |
| \smallskip |
| |
| Script {\tt GLM-predict.dml} is intended to cover all linear model based regressions, |
| including linear regression, binomial and multinomial logistic regression, and GLM |
| regressions (Poisson, gamma, binomial with probit link~etc.). Having just one scoring |
| script for all these regressions simplifies maintenance and enhancement while ensuring |
| compatible interpretations for output statistics. |
| |
| The script performs two functions, prediction and scoring. To perform prediction, |
| the script takes two matrix inputs: a collection of records $X$ (without the response |
| attribute) and the estimated regression parameters~$B$, also known as~$\beta$. |
| To perform scoring, in addition to $X$ and~$B$, the script takes the matrix of actual |
| response values~$Y$ that are compared to the predictions made with $X$ and~$B$. Of course |
| there are other, non-matrix, input arguments that specify the model and the output |
| format, see below for the full list. |
| |
| We assume that our test/scoring dataset is given by $n\,{\times}\,m$-matrix $X$ of |
| numerical feature vectors, where each row~$x_i$ represents one feature vector of one |
| record; we have \mbox{$\dim x_i = m$}. Each record also includes the response |
| variable~$y_i$ that may be numerical, single-label categorical, or multi-label categorical. |
| A single-label categorical $y_i$ is an integer category label, one label per record; |
| a multi-label $y_i$ is a vector of integer counts, one count for each possible label, |
| which represents multiple single-label events (observations) for the same~$x_i$. Internally |
| we convert single-label categoricals into multi-label categoricals by replacing each |
| label~$l$ with an indicator vector~$(0,\ldots,0,1_l,0,\ldots,0)$. In prediction-only |
| tasks the actual $y_i$'s are not needed to the script, but they are needed for scoring. |
| |
| To perform prediction, the script matrix-multiplies $X$ and $B$, adding the intercept |
| if available, then applies the inverse of the model's link function. |
| All GLMs assume that the linear combination of the features in~$x_i$ and the betas |
| in~$B$ determines the means~$\mu_i$ of the~$y_i$'s (in numerical or multi-label |
| categorical form) with $\dim \mu_i = \dim y_i$. The observed $y_i$ is assumed to follow |
| a specified GLM family distribution $\Prob[y\mid \mu_i]$ with mean(s)~$\mu_i$: |
| \begin{equation*} |
| x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} |
| \,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i] |
| \end{equation*} |
| If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number. Then our script's |
| output matrix $M$ is the $n\,{\times}\,1$-vector of these means~$\mu_i$. |
| Note that $\mu_i$ predicts the mean of $y_i$, not the actual~$y_i$. For example, |
| in Poisson distribution, the mean is usually fractional, but the actual~$y_i$ is |
| always integer. |
| |
| If $y_i$ is categorical, i.e.\ a vector of label counts for record~$i$, then $\mu_i$ |
| is a vector of non-negative real numbers, one number $\mu_{i,l}$ per each label~$l$. |
| In this case we divide the $\mu_{i,l}$ by their sum $\sum_l \mu_{i,l}$ to obtain |
| predicted label probabilities~\mbox{$p_{i,l}\in [0, 1]$}. The output matrix $M$ is |
| the $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the number of |
| records and $k\,{+}\,1$ is the number of categories\footnote{We use $k+1$ because |
| there are $k$ non-baseline categories and one baseline category, with regression |
| parameters $B$ having $k$~columns.}. Note again that we do not predict the labels |
| themselves, nor their actual counts per record, but we predict the labels' probabilities. |
| |
| Going from predicted probabilities to predicted labels, in the single-label categorical |
| case, requires extra information such as the cost of false positive versus |
| false negative errors. For example, if there are 5 categories and we \emph{accurately} |
| predicted their probabilities as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the |
| highest-probability label would be wrong 70\% of the time, whereas picking the |
| lowest-probability label might be right if, say, it represents a diagnosis of cancer |
| or another rare and serious outcome. Hence, we keep this step outside the scope of |
| {\tt GLM-predict.dml} for now. |
| |
| \smallskip |
| \noindent{\bf Usage} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\it% |
| {\tt{}-f }path/\/{\tt{}GLM-predict.dml} |
| {\tt{} -nvargs} |
| {\tt{} X=}path/file |
| {\tt{} Y=}path/file |
| {\tt{} B=}path/file |
| {\tt{} M=}path/file |
| {\tt{} O=}path/file |
| {\tt{} dfam=}int |
| {\tt{} vpow=}double |
| {\tt{} link=}int |
| {\tt{} lpow=}double |
| {\tt{} disp=}double |
| {\tt{} fmt=}format |
| |
| } |
| |
| \smallskip |
| \noindent{\bf Arguments} |
| \begin{Description} |
| \item[{\tt X}:] |
| Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature vectors, each row |
| constitutes one feature vector (one record) |
| \item[{\tt Y}:] (default:\mbox{ }{\tt " "}) |
| Location to read the response matrix $Y$ needed for scoring (but optional for prediction), |
| with the following dimensions: \\ |
| $n \:{\times}\: 1$: acceptable for all distributions ({\tt dfam=1} or {\tt 2} or {\tt 3}) \\ |
| $n \:{\times}\: 2$: for binomial ({\tt dfam=2}) if given by (\#pos, \#neg) counts \\ |
| $n \:{\times}\: k\,{+}\,1$: for multinomial ({\tt dfam=3}) if given by category counts |
| \item[{\tt M}:] (default:\mbox{ }{\tt " "}) |
| Location to write, if requested, the matrix of predicted response means (for {\tt dfam=1}) or |
| probabilities (for {\tt dfam=2} or {\tt 3}):\\ |
| $n \:{\times}\: 1$: for power-type distributions ({\tt dfam=1}) \\ |
| $n \:{\times}\: 2$: for binomial distribution ({\tt dfam=2}), col\#~2 is the ``No'' probability \\ |
| $n \:{\times}\: k\,{+}\,1$: for multinomial logit ({\tt dfam=3}), col\#~$k\,{+}\,1$ is for the baseline |
| \item[{\tt B}:] |
| Location to read matrix $B$ of the \mbox{betas}, i.e.\ estimated GLM regression parameters, |
| with the intercept at row\#~$m\,{+}\,1$ if available:\\ |
| $\dim(B) \,=\, m \:{\times}\: k$: do not add intercept \\ |
| $\dim(B) \,=\, m\,{+}\,1 \:{\times}\: k$: add intercept as given by the last $B$-row \\ |
| if $k > 1$, use only $B${\tt [, 1]} unless it is Multinomial Logit ({\tt dfam=3}) |
| \item[{\tt O}:] (default:\mbox{ }{\tt " "}) |
| Location to store the CSV-file with goodness-of-fit statistics defined in |
| Table~\ref{table:GLMpred:stats}, the default is to print them to the standard output |
| \item[{\tt dfam}:] (default:\mbox{ }{\tt 1}) |
| GLM distribution family code to specify the type of distribution $\Prob[y\,|\,\mu]$ |
| that we assume: \\ |
| {\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$, see Table~\ref{table:commonGLMs};\\ |
| {\tt 2} = binomial; |
| {\tt 3} = multinomial logit |
| \item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0}) |
| Power for variance defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt dfam}$\,{\neq}\,1$): |
| when {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power |
| dependence of the variance of~$y$ on its mean. In particular, use:\\ |
| {\tt 0.0} = Gaussian, |
| {\tt 1.0} = Poisson, |
| {\tt 2.0} = Gamma, |
| {\tt 3.0} = inverse Gaussian |
| \item[{\tt link}:] (default:\mbox{ }{\tt 0}) |
| Link function code to determine the link function~$\eta = g(\mu)$, ignored for |
| multinomial logit ({\tt dfam=3}):\\ |
| {\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\ |
| {\tt 1} = power functions, |
| {\tt 2} = logit, |
| {\tt 3} = probit, |
| {\tt 4} = cloglog, |
| {\tt 5} = cauchit |
| \item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0}) |
| Power for link function defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt link}$\,{\neq}\,1$): |
| when {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link |
| function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$. Common power links:\\ |
| {\tt -2.0} = $1/\mu^2$, |
| {\tt -1.0} = reciprocal, |
| {\tt 0.0} = log, |
| {\tt 0.5} = sqrt, |
| {\tt 1.0} = identity |
| \item[{\tt disp}:] (default:\mbox{ }{\tt 1.0}) |
| Dispersion value, when available; must be positive |
| \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) |
| Matrix {\tt M} file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; |
| see read/write functions in SystemML Language Reference for details. |
| \end{Description} |
| |
| |
| \begin{table}[t]\small\centerline{% |
| \begin{tabular}{|lccl|} |
| \hline |
| Name & \hspace{-0.6em}CID\hspace{-0.5em} & \hspace{-0.3em}Disp?\hspace{-0.6em} & Meaning \\ |
| \hline |
| {\tt LOGLHOOD\_Z} & & + & Log-likelihood $Z$-score (in st.\ dev.'s from the mean) \\ |
| {\tt LOGLHOOD\_Z\_PVAL} & & + & Log-likelihood $Z$-score p-value, two-sided \\ |
| {\tt PEARSON\_X2} & & + & Pearson residual $X^2$-statistic \\ |
| {\tt PEARSON\_X2\_BY\_DF} & & + & Pearson $X^2$ divided by degrees of freedom \\ |
| {\tt PEARSON\_X2\_PVAL} & & + & Pearson $X^2$ p-value \\ |
| {\tt DEVIANCE\_G2} & & + & Deviance from the saturated model $G^2$-statistic \\ |
| {\tt DEVIANCE\_G2\_BY\_DF} & & + & Deviance $G^2$ divided by degrees of freedom \\ |
| {\tt DEVIANCE\_G2\_PVAL} & & + & Deviance $G^2$ p-value \\ |
| {\tt AVG\_TOT\_Y} & + & & $Y$-column average for an individual response value \\ |
| {\tt STDEV\_TOT\_Y} & + & & $Y$-column st.\ dev.\ for an individual response value \\ |
| {\tt AVG\_RES\_Y} & + & & $Y$-column residual average of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\ |
| {\tt STDEV\_RES\_Y} & + & & $Y$-column residual st.\ dev.\ of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\ |
| {\tt PRED\_STDEV\_RES} & + & + & Model-predicted $Y$-column residual st.\ deviation\\ |
| {\tt PLAIN\_R2} & + & & Plain $R^2$ of $Y$-column residual with bias included \\ |
| {\tt ADJUSTED\_R2} & + & & Adjusted $R^2$ of $Y$-column residual w.\ bias included \\ |
| {\tt PLAIN\_R2\_NOBIAS} & + & & Plain $R^2$ of $Y$-column residual, bias subtracted \\ |
| {\tt ADJUSTED\_R2\_NOBIAS} & + & & Adjusted $R^2$ of $Y$-column residual, bias subtracted \\ |
| \hline |
| \end{tabular}} |
| \caption{The above goodness-of-fit statistics are provided in CSV format, one per each line, with four |
| columns: (Name, [CID], [Disp?], Value). The columns are: |
| ``Name'' is the string identifier for the statistic, see the table; |
| ``CID'' is an optional integer value that specifies the $Y$-column index for \mbox{per-}column statistics |
| (note that a bi-/multinomial one-column {\tt Y}-input is converted into multi-column); |
| ``Disp?'' is an optional Boolean value ({\tt TRUE} or {\tt FALSE}) that tells us |
| whether or not scaling by the input dispersion parameter {\tt disp} has been applied to this |
| statistic; |
| ``Value'' is the value of the statistic.} |
| \label{table:GLMpred:stats} |
| \end{table} |
| |
| \noindent{\bf Details} |
| \smallskip |
| |
| The output matrix $M$ of predicted means (or probabilities) is computed by matrix-multiplying $X$ |
| with the first column of~$B$ or with the whole~$B$ in the multinomial case, adding the intercept |
| if available (conceptually, appending an extra column of ones to~$X$); then applying the inverse |
| of the model's link function. The difference between ``means'' and ``probabilities'' in the |
| categorical case becomes significant when there are ${\geq}\,2$ observations per record |
| (with the multi-label records) or when the labels such as $-1$ and~$1$ are viewed and averaged |
| as numerical response values (with the single-label records). To avoid any \mbox{mix-up} or |
| information loss, we separately return the predicted probability of each category label for each |
| record. |
| |
| When the ``actual'' response values $Y$ are available, the summary statistics are computed |
| and written out as described in Table~\ref{table:GLMpred:stats}. Below we discuss each of |
| these statistics in detail. Note that in the categorical case (binomial and multinomial) |
| $Y$ is internally represented as the matrix of observation counts for each label in each record, |
| rather than just the label~ID for each record. The input~$Y$ may already be a matrix of counts, |
| in which case it is used as-is. But if $Y$ is given as a vector of response labels, each |
| response label is converted into an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$ where~$l$ |
| is the label~ID for this record. All negative (e.g.~$-1$) or zero label~IDs are converted to |
| the $1 + {}$maximum label~ID. The largest label~ID is viewed as the ``baseline'' as explained |
| in the section on Multinomial Logistic Regression. We assume that there are $k\geq 1$ |
| non-baseline categories and one (last) baseline category. |
| |
| We also estimate residual variances for each response value, although we do not output them, |
| but use them only inside the summary statistics, scaled and unscaled by the input dispersion |
| parameter {\tt disp}, as described below. |
| |
| \smallskip |
| {\tt LOGLHOOD\_Z} and {\tt LOGLHOOD\_Z\_PVAL} statistics measure how far the log-likelihood |
| of~$Y$ deviates from its expected value according to the model. The script implements them |
| only for the binomial and the multinomial distributions, returning NaN for all other distributions. |
| Pearson's~$X^2$ and deviance~$G^2$ often perform poorly with bi- and multinomial distributions |
| due to low cell counts, hence we need this extra goodness-of-fit measure. To compute these |
| statistics, we use: |
| \begin{Itemize} |
| \item the $n\times (k\,{+}\,1)$-matrix~$Y$ of multi-label response counts, in which $y_{i,j}$ |
| is the number of times label~$j$ was observed in record~$i$; |
| \item the model-estimated probability matrix~$P$ of the same dimensions that satisfies |
| $\sum_{j=1}^{k+1} p_{i,j} = 1$ for all~$i=1,\ldots,n$ and where $p_{i,j}$ is the model |
| probability of observing label~$j$ in record~$i$; |
| \item the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of observations |
| in record~$i$ (all $N_i = 1$ if each record has only one response label). |
| \end{Itemize} |
| We start by computing the multinomial log-likelihood of $Y$ given~$P$ and~$N$, as well as |
| the expected log-likelihood given a random~$Y$ and the variance of this log-likelihood if |
| $Y$ indeed follows the proposed distribution: |
| \begin{align*} |
| \ell (Y) \,\,&=\,\, \log \Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1} \,y_{i,j}\log p_{i,j} \\ |
| \E_Y \ell (Y) \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j} |
| \,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\ |
| \Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2 |
| - \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right) |
| \end{align*} |
| Then we compute the $Z$-score as the difference between the actual and the expected |
| log-likelihood $\ell(Y)$ divided by its expected standard deviation, and its two-sided |
| p-value in the Normal distribution assumption ($\ell(Y)$ should approach normality due |
| to the Central Limit Theorem): |
| \begin{equation*} |
| Z \,=\, \frac {\ell(Y) - \E_Y \ell(Y)}{\sqrt{\Var_Y \ell(Y)}};\quad |
| \mathop{\textrm{p-value}}(Z) \,=\, \Prob \Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big] |
| \end{equation*} |
| A low p-value would indicate ``underfitting'' if $Z\ll 0$ or ``overfitting'' if $Z\gg 0$. Here |
| ``overfitting'' means that higher-probability labels occur more often than their probabilities |
| suggest. |
| |
| We also apply the dispersion input ({\tt disp}) to compute the ``scaled'' version of the $Z$-score |
| and its p-value. Since $\ell(Y)$ is a linear function of~$Y$, multiplying the GLM-predicted |
| variance of~$Y$ by {\tt disp} results in multiplying $\Var_Y \ell(Y)$ by the same {\tt disp}. This, in turn, |
| translates into dividing the $Z$-score by the square root of the dispersion: |
| \begin{equation*} |
| Z_{\texttt{disp}} \,=\, \big(\ell(Y) \,-\, \E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot\Var_Y \ell(Y)} |
| \,=\, Z / \sqrt{\texttt{disp}} |
| \end{equation*} |
| Finally, we recalculate the p-value with this new $Z$-score. |
| |
| \smallskip |
| {\tt PEARSON\_X2}, {\tt PEARSON\_X2\_BY\_DF}, and {\tt PEARSON\_X2\_PVAL}: |
| Pearson's residual $X^2$-statistic is a commonly used goodness-of-fit measure for linear models~\cite{McCullagh1989:GLM}. |
| The idea is to measure how well the model-predicted means and variances match the actual behavior |
| of response values. For each record $i$, we estimate the mean $\mu_i$ and the variance $v_i$ |
| (or $\texttt{disp}\cdot v_i$) and use them to normalize the residual: |
| $r_i = (y_i - \mu_i) / \sqrt{v_i}$. These normalized residuals are then squared, aggregated |
| by summation, and tested against an appropriate $\chi^2$~distribution. The computation of~$X^2$ |
| is slightly different for categorical data (bi- and multinomial) than it is for numerical data, |
| since $y_i$ has multiple correlated dimensions~\cite{McCullagh1989:GLM}: |
| \begin{equation*} |
| X^2\,\textrm{(numer.)} \,=\, \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad |
| X^2\,\textrm{(categ.)} \,=\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i |
| \hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}} |
| \end{equation*} |
| The number of degrees of freedom~\#d.f.\ for the $\chi^2$~distribution is $n - m$ for numerical data and |
| $(n - m)k$ for categorical data, where $k = \mathop{\texttt{ncol}}(Y) - 1$. Given the dispersion |
| parameter {\tt disp}, the $X^2$ statistic is scaled by division: \mbox{$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$}. |
| If the dispersion is accurate, $X^2 / \texttt{disp}$ should be close to~\#d.f. In fact, $X^2 / \textrm{\#d.f.}$ |
| over the \emph{training} data is the dispersion estimator used in our {\tt GLM.dml} script, |
| see~(\ref{eqn:dispersion}). Here we provide $X^2 / \textrm{\#d.f.}$ and $X^2_{\texttt{disp}} / \textrm{\#d.f.}$ |
| as {\tt PEARSON\_X2\_BY\_DF} to enable dispersion comparison between the training data and |
| the test data. |
| |
| NOTE: For categorical data, both Pearson's $X^2$ and the deviance $G^2$ are unreliable (i.e.\ do not |
| approach the $\chi^2$~distribution) unless the predicted means of multi-label counts |
| $\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$ are fairly large: all ${\geq}\,1$ and 80\% are |
| at least~$5$~\cite{Cochran1954:chisq}. They should not be used for ``one label per record'' categoricals. |
| |
| \smallskip |
| {\tt DEVIANCE\_G2}, {\tt DEVIANCE\_G2\_BY\_DF}, and {\tt DEVIANCE\_G2\_PVAL}: |
| Deviance $G^2$ is the log of the likelihood ratio between the ``saturated'' model and the |
| linear model being tested for the given dataset, multiplied by two: |
| \begin{equation} |
| G^2 \,=\, 2 \,\log \frac{\Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}% |
| {\Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]} |
| \label{eqn:GLMpred:deviance} |
| \end{equation} |
| The ``saturated'' model sets the mean $\mu_i^{\mathrm{sat}}$ to equal~$y_i$ for every record |
| (for categorical data, $p^{\mathrm{sat}}_{i,j} = y_{i,j} / N_i$), which represents the ``perfect fit.'' |
| For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at a boundary, by continuity we set |
| $0 \log 0 = 0$. The GLM~likelihood functions defined in~(\ref{eqn:GLM}) become simplified |
| in ratio~(\ref{eqn:GLMpred:deviance}) due to canceling out the term $c(y, a)$ since it is the same |
| in both models. |
| |
| The log of a likelihood ratio between two nested models, times two, is known to approach |
| a $\chi^2$ distribution as $n\to\infty$ if both models have fixed parameter spaces. |
| But this is not the case for the ``saturated'' model: it adds more parameters with each record. |
| In practice, however, $\chi^2$ distributions are used to compute the p-value of~$G^2$~\cite{McCullagh1989:GLM}. |
| The number of degrees of freedom~\#d.f.\ and the treatment of dispersion are the same as for |
| Pearson's~$X^2$, see above. |
| |
| \Paragraph{Column-wise statistics.} The rest of the statistics are computed separately |
| for each column of~$Y$. As explained above, $Y$~has two or more columns in bi- and multinomial case, |
| either at input or after conversion. Moreover, each $y_{i,j}$ in record~$i$ with $N_i \geq 2$ is |
| counted as $N_i$ separate observations $y_{i,j,l}$ of 0 or~1 (where $l=1,\ldots,N_i$) with |
| $y_{i,j}$~ones and $N_i-y_{i,j}$ zeros. |
| For power distributions, including linear regression, $Y$~has only one column and all |
| $N_i = 1$, so the statistics are computed for all~$Y$ with each record counted once. |
| Below we denote $N = \sum_{i=1}^n N_i \,\geq n$. |
| Here is the total average and the residual average (residual bias) of~$y_{i,j,l}$ for each $Y$-column: |
| \begin{equation*} |
| \texttt{AVG\_TOT\_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n y_{i,j}; \quad |
| \texttt{AVG\_RES\_Y}_j \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j}) |
| \end{equation*} |
| Dividing by~$N$ (rather than~$n$) gives the averages for~$y_{i,j,l}$ (rather than~$y_{i,j}$). |
| The total variance, and the standard deviation, for individual observations~$y_{i,j,l}$ is |
| estimated from the total variance for response values~$y_{i,j}$ using independence assumption: |
| $\Var y_{i,j} = \Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} \Var y_{i,j,l}$. |
| This allows us to estimate the sum of squares for~$y_{i,j,l}$ via the sum of squares for~$y_{i,j}$: |
| \begin{equation*} |
| \texttt{STDEV\_TOT\_Y}_j \,=\, |
| \Bigg[\frac{1}{N-1} \sum_{i=1}^n \Big( y_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n y_{i'\!,j}\Big)^2\Bigg]^{1/2} |
| \end{equation*} |
| Analogously, we estimate the standard deviation of the residual $y_{i,j,l} - \mu_{i,j,l}$: |
| \begin{equation*} |
| \texttt{STDEV\_RES\_Y}_j \,=\, |
| \Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n \Big( y_{i,j} - \mu_{i,j} - \frac{N_i}{N} \sum_{i'=1}^n (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2} |
| \end{equation*} |
| Here $m'=m$ if $m$ includes the intercept as a feature and $m'=m+1$ if it does not. |
| The estimated standard deviations can be compared to the model-predicted residual standard deviation |
| computed from the predicted means by the GLM variance formula and scaled by the dispersion: |
| \begin{equation*} |
| \texttt{PRED\_STDEV\_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2} |
| \end{equation*} |
| We also compute the $R^2$ statistics for each column of~$Y$, see Table~\ref{table:GLMpred:R2} for details. |
| We compute two versions of~$R^2$: in one version the residual sum-of-squares (RSS) includes any bias in |
| the residual that might be present (due to the lack of, or inaccuracy in, the intercept); in the other |
| version of~RSS the bias is subtracted by ``centering'' the residual. In both cases we subtract the bias in the total |
| sum-of-squares (in the denominator), and $m'$ equals $m$~with the intercept or $m+1$ without the intercept. |
| |
| \begin{table}[t]\small\centerline{% |
| \begin{tabular}{|c|c|} |
| \multicolumn{2}{c}{$R^2$ where the residual sum-of-squares includes the bias contribution:} \\ |
| \hline |
| \multicolumn{1}{|l|}{\tt PLAIN\_R2${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2${}_j \,\,= {}$} \\ |
| $ \displaystyle 1 - |
| \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}% |
| {\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ & |
| $ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}} \, |
| \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}% |
| {\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ \\ |
| \hline |
| \multicolumn{2}{c}{} \\ |
| \multicolumn{2}{c}{$R^2$ where the residual sum-of-squares is centered so that the bias is subtracted:} \\ |
| \hline |
| \multicolumn{1}{|l|}{\tt PLAIN\_R2\_NOBIAS${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2\_NOBIAS${}_j \,\,= {}$} \\ |
| $ \displaystyle 1 - |
| \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! |
| \sum\limits_{i'=1}^n (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}% |
| {\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ & |
| $ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, |
| \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! |
| \sum\limits_{i'=1}^n (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}% |
| {\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n \! y_{i'\!,j} \Big)^{\! 2}} $ \\ |
| \hline |
| \end{tabular}} |
| \caption{The $R^2$ statistics we compute in {\tt GLM-predict.dml}} |
| \label{table:GLMpred:R2} |
| \end{table} |
| |
| \smallskip |
| \noindent{\bf Returns} |
| \smallskip |
| |
| The matrix of predicted means (if the response is numerical) or probabilities (if the response |
| is categorical), see ``Description'' subsection above for more information. Given {\tt Y}, we |
| return some statistics in CSV format as described in Table~\ref{table:GLMpred:stats} and in the |
| above text. |
| |
| \smallskip |
| \noindent{\bf Examples} |
| \smallskip |
| |
| Note that in the examples below the value for ``{\tt disp}'' input argument |
| is set arbitrarily. The correct dispersion value should be computed from the training |
| data during model estimation, or omitted if unknown (which sets it to~{\tt 1.0}). |
| |
| \smallskip\noindent |
| Linear regression example: |
| \par\hangindent=\parindent\noindent{\tt |
| \hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 disp=5.67 |
| X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv |
| Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv |
| |
| }\smallskip\noindent |
| Linear regression example, prediction only (no {\tt Y} given): |
| \par\hangindent=\parindent\noindent{\tt |
| \hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 |
| X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv |
| |
| }\smallskip\noindent |
| Binomial logistic regression example: |
| \par\hangindent=\parindent\noindent{\tt |
| \hml -f GLM-predict.dml -nvargs dfam=2 link=2 disp=3.0004464 |
| X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv |
| Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv |
| |
| }\smallskip\noindent |
| Binomial probit regression example: |
| \par\hangindent=\parindent\noindent{\tt |
| \hml -f GLM-predict.dml -nvargs dfam=2 link=3 disp=3.0004464 |
| X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv |
| Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv |
| |
| }\smallskip\noindent |
| Multinomial logistic regression example: |
| \par\hangindent=\parindent\noindent{\tt |
| \hml -f GLM-predict.dml -nvargs dfam=3 |
| X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv |
| Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv |
| |
| }\smallskip\noindent |
| Poisson regression with the log link example: |
| \par\hangindent=\parindent\noindent{\tt |
| \hml -f GLM-predict.dml -nvargs dfam=1 vpow=1.0 link=1 lpow=0.0 disp=3.45 |
| X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv |
| Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv |
| |
| }\smallskip\noindent |
| Gamma regression with the inverse (reciprocal) link example: |
| \par\hangindent=\parindent\noindent{\tt |
| \hml -f GLM-predict.dml -nvargs dfam=1 vpow=2.0 link=1 lpow=-1.0 disp=1.99118 |
| X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv |
| Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv |
| |
| } |