apache / systemds / refs/heads/branch-0.11 / . / docs / Algorithms Reference / DescriptiveStratStats.tex

\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{Stratified Bivariate Statistics} | |

\noindent{\bf Description} | |

\smallskip | |

The {\tt stratstats.dml} script computes common bivariate statistics, such | |

as correlation, slope, and their p-value, in parallel for many pairs of input | |

variables in the presence of a confounding categorical variable. The values | |

of this confounding variable group the records into strata (subpopulations), | |

in which all bivariate pairs are assumed free of confounding. The script | |

uses the same data model as in one-way analysis of covariance (ANCOVA), with | |

strata representing population samples. It also outputs univariate stratified | |

and bivariate unstratified statistics. | |

\begin{table}[t]\hfil | |

\begin{tabular}{|l|ll|ll|ll||ll|} | |

\hline | |

Month of the year & \multicolumn{2}{l|}{October} & \multicolumn{2}{l|}{November} & | |

\multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\ | |

Customers, millions & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\ | |

\hline | |

Promotion (0 or 1) & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ | |

Avg.\ sales per 1000 & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\ | |

\hline | |

\end{tabular}\hfil | |

\caption{Stratification example: the effect of the promotion on average sales | |

becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the months.} | |

\label{table:stratexample} | |

\end{table} | |

To see how data stratification mitigates confounding, consider an (artificial) | |

example in Table~\ref{table:stratexample}. A highly seasonal retail item | |

was marketed with and without a promotion over the final 3~months of the year. | |

In each month the sale was more likely with the promotion than without it. | |

But during the peak holiday season, when shoppers came in greater numbers and | |

bought the item more often, the promotion was less frequently used. As a result, | |

if the 4-th quarter data is pooled together, the promotion's effect becomes | |

reversed and magnified. Stratifying by month restores the positive correlation. | |

The script computes its statistics in parallel over all possible pairs from two | |

specified sets of covariates. The 1-st covariate is a column in input matrix~$X$ | |

and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ may | |

be the same or different. The columns of interest are given by their index numbers | |

in special matrices. The stratum column, specified in its own matrix, is the same | |

for all covariate pairs. | |

Both covariates in each pair must be numerical, with the 2-nd covariate normally | |

distributed given the 1-st covariate (see~Details). Missing covariate values or | |

strata are represented by~``NaN''. Records with NaN's are selectively omitted | |

wherever their NaN's are material to the output statistic. | |

\smallskip | |

\pagebreak[3] | |

\noindent{\bf Usage} | |

\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}stratstats.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} Xcid=}path/file | |

{\tt{} Y=}path/file | |

{\tt{} Ycid=}path/file | |

{\tt{} S=}path/file | |

{\tt{} Scid=}int | |

{\tt{} O=}path/file | |

{\tt{} fmt=}format | |

} | |

\smallskip | |

\noindent{\bf Arguments} | |

\begin{Description} | |

\item[{\tt X}:] | |

Location (on HDFS) to read matrix $X$ whose columns we want to use as | |

the 1-st covariate (i.e.~as the feature variable) | |

\item[{\tt Xcid}:] (default:\mbox{ }{\tt " "}) | |

Location to read the single-row matrix that lists all index numbers | |

of the $X$-columns used as the 1-st covariate; the default value means | |

``use all $X$-columns'' | |

\item[{\tt Y}:] (default:\mbox{ }{\tt " "}) | |

Location to read matrix $Y$ whose columns we want to use as the 2-nd | |

covariate (i.e.~as the response variable); the default value means | |

``use $X$ in place of~$Y$'' | |

\item[{\tt Ycid}:] (default:\mbox{ }{\tt " "}) | |

Location to read the single-row matrix that lists all index numbers | |

of the $Y$-columns used as the 2-nd covariate; the default value means | |

``use all $Y$-columns'' | |

\item[{\tt S}:] (default:\mbox{ }{\tt " "}) | |

Location to read matrix $S$ that has the stratum column. | |

Note: the stratum column must contain small positive integers; all fractional | |

values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as | |

missing. The default value for {\tt S} means ``use $X$ in place of~$S$'' | |

\item[{\tt Scid}:] (default:\mbox{ }{\tt 1}) | |

The index number of the stratum column in~$S$ | |

\item[{\tt O}:] | |

Location to store the output matrix defined in Table~\ref{table:stratoutput} | |

\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; | |

see read/write functions in SystemML Language Reference for details. | |

\end{Description} | |

\begin{table}[t]\small\hfil | |

\begin{tabular}{|rcl|rcl|} | |

\hline | |

& Col.\# & Meaning & & Col.\# & Meaning \\ | |

\hline | |

\multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em} | |

& 01 & $X$-column number & | |

\multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em} | |

& 11 & $Y$-column number \\ | |

& 02 & presence count for $x$ & | |

& 12 & presence count for $y$ \\ | |

& 03 & global mean $(x)$ & | |

& 13 & global mean $(y)$ \\ | |

& 04 & global std.\ dev. $(x)$ & | |

& 14 & global std.\ dev. $(y)$ \\ | |

& 05 & stratified std.\ dev. $(x)$ & | |

& 15 & stratified std.\ dev. $(y)$ \\ | |

& 06 & $R^2$ for $x \sim {}$strata & | |

& 16 & $R^2$ for $y \sim {}$strata \\ | |

& 07 & adjusted $R^2$ for $x \sim {}$strata & | |

& 17 & adjusted $R^2$ for $y \sim {}$strata \\ | |

& 08 & p-value, $x \sim {}$strata & | |

& 18 & p-value, $y \sim {}$strata \\ | |

& 09--10 & reserved & | |

& 19--20 & reserved \\ | |

\hline | |

\multirow{9}{*}{\begin{sideways}$y\sim x$, NO strata\end{sideways}}\hspace{-1.15em} | |

& 21 & presence count $(x, y)$ & | |

\multirow{10}{*}{\begin{sideways}$y\sim x$ AND strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em} | |

& 31 & presence count $(x, y, s)$ \\ | |

& 22 & regression slope & | |

& 32 & regression slope \\ | |

& 23 & regres.\ slope std.\ dev. & | |

& 33 & regres.\ slope std.\ dev. \\ | |

& 24 & correlation${} = \pm\sqrt{R^2}$ & | |

& 34 & correlation${} = \pm\sqrt{R^2}$ \\ | |

& 25 & residual std.\ dev. & | |

& 35 & residual std.\ dev. \\ | |

& 26 & $R^2$ in $y$ due to $x$ & | |

& 36 & $R^2$ in $y$ due to $x$ \\ | |

& 27 & adjusted $R^2$ in $y$ due to $x$ & | |

& 37 & adjusted $R^2$ in $y$ due to $x$ \\ | |

& 28 & p-value for ``slope = 0'' & | |

& 38 & p-value for ``slope = 0'' \\ | |

& 29 & reserved & | |

& 39 & \# strata with ${\geq}\,2$ count \\ | |

& 30 & reserved & | |

& 40 & reserved \\ | |

\hline | |

\end{tabular}\hfil | |

\caption{The {\tt stratstats.dml} output matrix has one row per each distinct | |

pair of 1-st and 2-nd covariates, and 40 columns with the statistics described | |

here.} | |

\label{table:stratoutput} | |

\end{table} | |

\noindent{\bf Details} | |

\smallskip | |

Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ is | |

a stratum number and $(x, y)$ are two numerical covariates. We want to analyze | |

conditional linear relationship between $y$ and $x$ conditioned by~$i$. | |

Note that $x$, but not~$y$, may represent a categorical variable if we assign a | |

numerical value to each category, for example 0 and 1 for two categories. | |

We assume a linear regression model for~$y$: | |

\begin{equation} | |

y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, \quad\textrm{where}\,\,\,\, | |

\eps_{i,j} \sim \Normal(0, \sigma^2) | |

\label{eqn:stratlinmodel} | |

\end{equation} | |

Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record number | |

in stratum~$i$; by $n_i$ we denote the number of records available in stratum~$i$. | |

The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata. | |

When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in | |

stratum~$i$ as | |

\begin{equation*} | |

\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad | |

\bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i | |

\end{equation*} | |

If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta \bar{x}_i$, | |

which gives the prediction error sum-of-squares of | |

\begin{equation} | |

\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}_i)\big)^2 | |

\,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y | |

\label{eqn:stratsumsq} | |

\end{equation} | |

where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' sample | |

estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ computed as | |

\begin{align*} | |

V_x \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)^2; \quad | |

V_y \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}_i\big)^2;\\ | |

V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big) | |

\end{align*} | |

They are stratified because we compute the sample (co-)variances in each stratum~$i$ | |

separately, then combine by summation. The stratified estimates for $\Var(X)$ and $\Var(Y)$ | |

tend to be smaller than the non-stratified ones (with the global mean instead of $\bar{x}_i$ | |

and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and $y_{i,j}$ | |

than the global means. The stratified variance estimates the uncertainty in $x_{i,j}$ | |

and~$y_{i,j}$ given their stratum~$i$. | |

Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq}) | |

gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$}, | |

with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS): | |

\begin{equation*} | |

\mathrm{RSS} \,\,=\, \, | |

\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - | |

\hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2 | |

\,\,=\,\, V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big) | |

\end{equation*} | |

The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, estimates the fraction | |

of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the linear | |

regression model~(\ref{eqn:stratlinmodel}). We define \emph{stratified correlation} as the | |

square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$. We also use RSS to estimate | |

the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that models the prediction error | |

of $y_{i,j}$ given $x_{i,j}$ and the stratum: | |

\begin{equation*} | |

\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}}; | |

\,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y}; | |

\,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\, | |

\Big(n = \sum_{i=1}^k n_i\Big) | |

\end{equation*} | |

The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are | |

obtained by considering the effect of $\hat{\beta}$ on the residual sum-of-squares, | |

measured by the decrease from $V_y$ to~RSS. | |

The $F$-statistic is the ratio of the ``explained'' sum-of-squares | |

to the residual sum-of-squares, divided by their corresponding degrees of freedom. | |

There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces that | |

to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 degree | |

of freedom: | |

\begin{equation*} | |

F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)} | |

\,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad | |

t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}. | |

\end{equation*} | |

The $t$-statistic is simply the square root of the $F$-statistic with the appropriate | |

choice of sign. If the null hypothesis and the linear model are both true, the $t$-statistic | |

has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom. We can | |

also compute it if we divide $\hat{\beta}$ by its estimated standard deviation: | |

\begin{equation*} | |

\stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} \quad\Longrightarrow\quad | |

t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, \stdev(\hat{\beta})_{\mathrm{est}} | |

\end{equation*} | |

The standard deviation estimate for~$\beta$ is included in {\tt stratstats.dml} output. | |

\smallskip | |

\noindent{\bf Returns} | |

\smallskip | |

The output matrix format is defined in Table~\ref{table:stratoutput}. | |

\smallskip | |

\noindent{\bf Examples} | |

\smallskip | |

{\hangindent=\parindent\noindent\tt | |

\hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx Xcid=/user/biadmin/Xcid.mtx | |

Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx Scid=2 | |

O=/user/biadmin/Out.mtx fmt=csv | |

} | |

{\hangindent=\parindent\noindent\tt | |

\hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx Xcid=/user/biadmin/Xcid.mtx | |

Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx | |

} | |

%\smallskip | |

%\noindent{\bf See Also} | |

%\smallskip | |

% | |

%For non-stratified bivariate statistics with a wider variety of input data types | |

%and statistical tests, see \ldots. For general linear regression, see | |

%{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}. For logistic regression, appropriate | |

%when the response variable is categorical, see {\tt MultiLogReg.dml}. | |