blob: be0cffd0c3857dbdd7a394103cbd449e11e92162 [file] [log] [blame]
 \begin{comment} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. \end{comment} \subsection{Stratified Bivariate Statistics} \noindent{\bf Description} \smallskip The {\tt stratstats.dml} script computes common bivariate statistics, such as correlation, slope, and their p-value, in parallel for many pairs of input variables in the presence of a confounding categorical variable. The values of this confounding variable group the records into strata (subpopulations), in which all bivariate pairs are assumed free of confounding. The script uses the same data model as in one-way analysis of covariance (ANCOVA), with strata representing population samples. It also outputs univariate stratified and bivariate unstratified statistics. \begin{table}[t]\hfil \begin{tabular}{|l|ll|ll|ll||ll|} \hline Month of the year & \multicolumn{2}{l|}{October} & \multicolumn{2}{l|}{November} & \multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\ Customers, millions & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\ \hline Promotion (0 or 1) & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ Avg.\ sales per 1000 & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\ \hline \end{tabular}\hfil \caption{Stratification example: the effect of the promotion on average sales becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the months.} \label{table:stratexample} \end{table} To see how data stratification mitigates confounding, consider an (artificial) example in Table~\ref{table:stratexample}. A highly seasonal retail item was marketed with and without a promotion over the final 3~months of the year. In each month the sale was more likely with the promotion than without it. But during the peak holiday season, when shoppers came in greater numbers and bought the item more often, the promotion was less frequently used. As a result, if the 4-th quarter data is pooled together, the promotion's effect becomes reversed and magnified. Stratifying by month restores the positive correlation. The script computes its statistics in parallel over all possible pairs from two specified sets of covariates. The 1-st covariate is a column in input matrix~$X$ and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ may be the same or different. The columns of interest are given by their index numbers in special matrices. The stratum column, specified in its own matrix, is the same for all covariate pairs. Both covariates in each pair must be numerical, with the 2-nd covariate normally distributed given the 1-st covariate (see~Details). Missing covariate values or strata are represented by~NaN''. Records with NaN's are selectively omitted wherever their NaN's are material to the output statistic. \smallskip \pagebreak[3] \noindent{\bf Usage} \smallskip {\hangindent=\parindent\noindent\it% {\tt{}-f }path/\/{\tt{}stratstats.dml} {\tt{} -nvargs} {\tt{} X=}path/file {\tt{} Xcid=}path/file {\tt{} Y=}path/file {\tt{} Ycid=}path/file {\tt{} S=}path/file {\tt{} Scid=}int {\tt{} O=}path/file {\tt{} fmt=}format } \smallskip \noindent{\bf Arguments} \begin{Description} \item[{\tt X}:] Location (on HDFS) to read matrix $X$ whose columns we want to use as the 1-st covariate (i.e.~as the feature variable) \item[{\tt Xcid}:] (default:\mbox{ }{\tt " "}) Location to read the single-row matrix that lists all index numbers of the $X$-columns used as the 1-st covariate; the default value means use all $X$-columns'' \item[{\tt Y}:] (default:\mbox{ }{\tt " "}) Location to read matrix $Y$ whose columns we want to use as the 2-nd covariate (i.e.~as the response variable); the default value means use $X$ in place of~$Y$'' \item[{\tt Ycid}:] (default:\mbox{ }{\tt " "}) Location to read the single-row matrix that lists all index numbers of the $Y$-columns used as the 2-nd covariate; the default value means use all $Y$-columns'' \item[{\tt S}:] (default:\mbox{ }{\tt " "}) Location to read matrix $S$ that has the stratum column. Note: the stratum column must contain small positive integers; all fractional values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as missing. The default value for {\tt S} means use $X$ in place of~$S$'' \item[{\tt Scid}:] (default:\mbox{ }{\tt 1}) The index number of the stratum column in~$S$ \item[{\tt O}:] Location to store the output matrix defined in Table~\ref{table:stratoutput} \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; see read/write functions in SystemML Language Reference for details. \end{Description} \begin{table}[t]\small\hfil \begin{tabular}{|rcl|rcl|} \hline & Col.\# & Meaning & & Col.\# & Meaning \\ \hline \multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em} & 01 & $X$-column number & \multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em} & 11 & $Y$-column number \\ & 02 & presence count for $x$ & & 12 & presence count for $y$ \\ & 03 & global mean $(x)$ & & 13 & global mean $(y)$ \\ & 04 & global std.\ dev. $(x)$ & & 14 & global std.\ dev. $(y)$ \\ & 05 & stratified std.\ dev. $(x)$ & & 15 & stratified std.\ dev. $(y)$ \\ & 06 & $R^2$ for $x \sim {}$strata & & 16 & $R^2$ for $y \sim {}$strata \\ & 07 & adjusted $R^2$ for $x \sim {}$strata & & 17 & adjusted $R^2$ for $y \sim {}$strata \\ & 08 & p-value, $x \sim {}$strata & & 18 & p-value, $y \sim {}$strata \\ & 09--10 & reserved & & 19--20 & reserved \\ \hline \multirow{9}{*}{\begin{sideways}$y\sim x$, NO strata\end{sideways}}\hspace{-1.15em} & 21 & presence count $(x, y)$ & \multirow{10}{*}{\begin{sideways}$y\sim x$ AND strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em} & 31 & presence count $(x, y, s)$ \\ & 22 & regression slope & & 32 & regression slope \\ & 23 & regres.\ slope std.\ dev. & & 33 & regres.\ slope std.\ dev. \\ & 24 & correlation${} = \pm\sqrt{R^2}$ & & 34 & correlation${} = \pm\sqrt{R^2}$ \\ & 25 & residual std.\ dev. & & 35 & residual std.\ dev. \\ & 26 & $R^2$ in $y$ due to $x$ & & 36 & $R^2$ in $y$ due to $x$ \\ & 27 & adjusted $R^2$ in $y$ due to $x$ & & 37 & adjusted $R^2$ in $y$ due to $x$ \\ & 28 & p-value for slope = 0'' & & 38 & p-value for slope = 0'' \\ & 29 & reserved & & 39 & \# strata with ${\geq}\,2$ count \\ & 30 & reserved & & 40 & reserved \\ \hline \end{tabular}\hfil \caption{The {\tt stratstats.dml} output matrix has one row per each distinct pair of 1-st and 2-nd covariates, and 40 columns with the statistics described here.} \label{table:stratoutput} \end{table} \noindent{\bf Details} \smallskip Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ is a stratum number and $(x, y)$ are two numerical covariates. We want to analyze conditional linear relationship between $y$ and $x$ conditioned by~$i$. Note that $x$, but not~$y$, may represent a categorical variable if we assign a numerical value to each category, for example 0 and 1 for two categories. We assume a linear regression model for~$y$: \begin{equation} y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, \quad\textrm{where}\,\,\,\, \eps_{i,j} \sim \Normal(0, \sigma^2) \label{eqn:stratlinmodel} \end{equation} Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record number in stratum~$i$; by $n_i$ we denote the number of records available in stratum~$i$. The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata. When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in stratum~$i$ as \begin{equation*} \bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad \bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i \end{equation*} If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta \bar{x}_i$, which gives the prediction error sum-of-squares of \begin{equation} \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}_i)\big)^2 \,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y \label{eqn:stratsumsq} \end{equation} where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the stratified'' sample estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ computed as \begin{align*} V_x \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)^2; \quad V_y \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}_i\big)^2;\\ V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big) \end{align*} They are stratified because we compute the sample (co-)variances in each stratum~$i$ separately, then combine by summation. The stratified estimates for $\Var(X)$ and $\Var(Y)$ tend to be smaller than the non-stratified ones (with the global mean instead of $\bar{x}_i$ and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and $y_{i,j}$ than the global means. The stratified variance estimates the uncertainty in $x_{i,j}$ and~$y_{i,j}$ given their stratum~$i$. Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq}) gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$}, with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS): \begin{equation*} \mathrm{RSS} \,\,=\, \, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2 \,\,=\,\, V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big) \end{equation*} The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, estimates the fraction of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the linear regression model~(\ref{eqn:stratlinmodel}). We define \emph{stratified correlation} as the square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$. We also use RSS to estimate the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that models the prediction error of $y_{i,j}$ given $x_{i,j}$ and the stratum: \begin{equation*} \hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}}; \,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y}; \,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\, \Big(n = \sum_{i=1}^k n_i\Big) \end{equation*} The $t$-test and the $F$-test for the null-hypothesis of $\beta = 0$'' are obtained by considering the effect of $\hat{\beta}$ on the residual sum-of-squares, measured by the decrease from $V_y$ to~RSS. The $F$-statistic is the ratio of the explained'' sum-of-squares to the residual sum-of-squares, divided by their corresponding degrees of freedom. There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces that to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 degree of freedom: \begin{equation*} F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)} \,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}. \end{equation*} The $t$-statistic is simply the square root of the $F$-statistic with the appropriate choice of sign. If the null hypothesis and the linear model are both true, the $t$-statistic has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom. We can also compute it if we divide $\hat{\beta}$ by its estimated standard deviation: \begin{equation*} \stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} \quad\Longrightarrow\quad t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, \stdev(\hat{\beta})_{\mathrm{est}} \end{equation*} The standard deviation estimate for~$\beta$ is included in {\tt stratstats.dml} output. \smallskip \noindent{\bf Returns} \smallskip The output matrix format is defined in Table~\ref{table:stratoutput}. \smallskip \noindent{\bf Examples} \smallskip {\hangindent=\parindent\noindent\tt \hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx Xcid=/user/biadmin/Xcid.mtx Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx Scid=2 O=/user/biadmin/Out.mtx fmt=csv } {\hangindent=\parindent\noindent\tt \hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx Xcid=/user/biadmin/Xcid.mtx Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx } %\smallskip %\noindent{\bf See Also} %\smallskip % %For non-stratified bivariate statistics with a wider variety of input data types %and statistical tests, see \ldots. For general linear regression, see %{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}. For logistic regression, appropriate %when the response variable is categorical, see {\tt MultiLogReg.dml}.