docs/Algorithms Reference/DescriptiveStratStats.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Stratified Bivariate Statistics}

 \noindent{\bf Description}
 \smallskip

 The {\tt stratstats.dml} script computes common bivariate statistics, such
 as correlation, slope, and their p-value, in parallel for many pairs of input
 variables in the presence of a confounding categorical variable.  The values
 of this confounding variable group the records into strata (subpopulations),
 in which all bivariate pairs are assumed free of confounding.  The script
 uses the same data model as in one-way analysis of covariance (ANCOVA), with
 strata representing population samples.  It also outputs univariate stratified
 and bivariate unstratified statistics.

 \begin{table}[t]\hfil
 \begin{tabular}{|l|ll|ll|ll||ll|}
 \hline
 Month of the year & \multicolumn{2}{l|}{October} & \multicolumn{2}{l|}{November} &
     \multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\
 Customers, millions    & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\
 \hline
 Promotion (0 or 1)     & 0   & 1   & 0   & 1   & 0   & 1   & 0   & 1   \\
 Avg.\ sales per 1000   & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\
 \hline
 \end{tabular}\hfil
 \caption{Stratification example: the effect of the promotion on average sales
 becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the months.}
 \label{table:stratexample}
 \end{table}

 To see how data stratification mitigates confounding, consider an (artificial)
 example in Table~\ref{table:stratexample}.  A highly seasonal retail item
 was marketed with and without a promotion over the final 3~months of the year.
 In each month the sale was more likely with the promotion than without it.
 But during the peak holiday season, when shoppers came in greater numbers and
 bought the item more often, the promotion was less frequently used.  As a result,
 if the 4-th quarter data is pooled together, the promotion's effect becomes
 reversed and magnified.  Stratifying by month restores the positive correlation.

 The script computes its statistics in parallel over all possible pairs from two
 specified sets of covariates.  The 1-st covariate is a column in input matrix~$X$
 and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ may
 be the same or different.  The columns of interest are given by their index numbers
 in special matrices.  The stratum column, specified in its own matrix, is the same
 for all covariate pairs.

 Both covariates in each pair must be numerical, with the 2-nd covariate normally
 distributed given the 1-st covariate (see~Details).  Missing covariate values or
 strata are represented by~``NaN''.  Records with NaN's are selectively omitted
 wherever their NaN's are material to the output statistic.

 \smallskip
 \pagebreak[3]

 \noindent{\bf Usage}
 \smallskip

 {\hangindent=\parindent\noindent\it%
 {\tt{}-f }path/\/{\tt{}stratstats.dml}
 {\tt{} -nvargs}
 {\tt{} X=}path/file
 {\tt{} Xcid=}path/file
 {\tt{} Y=}path/file
 {\tt{} Ycid=}path/file
 {\tt{} S=}path/file
 {\tt{} Scid=}int
 {\tt{} O=}path/file
 {\tt{} fmt=}format

 }


 \smallskip
 \noindent{\bf Arguments}
 \begin{Description}
 \item[{\tt X}:]
 Location (on HDFS) to read matrix $X$ whose columns we want to use as
 the 1-st covariate (i.e.~as the feature variable)
 \item[{\tt Xcid}:] (default:\mbox{ }{\tt " "})
 Location to read the single-row matrix that lists all index numbers
 of the $X$-columns used as the 1-st covariate; the default value means
 ``use all $X$-columns''
 \item[{\tt Y}:] (default:\mbox{ }{\tt " "})
 Location to read matrix $Y$ whose columns we want to use as the 2-nd
 covariate (i.e.~as the response variable); the default value means
 ``use $X$ in place of~$Y$''
 \item[{\tt Ycid}:] (default:\mbox{ }{\tt " "})
 Location to read the single-row matrix that lists all index numbers
 of the $Y$-columns used as the 2-nd covariate; the default value means
 ``use all $Y$-columns''
 \item[{\tt S}:] (default:\mbox{ }{\tt " "})
 Location to read matrix $S$ that has the stratum column.
 Note: the stratum column must contain small positive integers; all fractional
 values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as
 missing.  The default value for {\tt S} means ``use $X$ in place of~$S$''
 \item[{\tt Scid}:] (default:\mbox{ }{\tt 1})
 The index number of the stratum column in~$S$
 \item[{\tt O}:]
 Location to store the output matrix defined in Table~\ref{table:stratoutput}
 \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
 Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
 see read/write functions in SystemML Language Reference for details.
 \end{Description}


 \begin{table}[t]\small\hfil
 \begin{tabular}{|rcl|rcl|}
 \hline
 & Col.\# & Meaning & & Col.\# & Meaning \\
 \hline
 \multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em}
 & 01     & $X$-column number                &
 \multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em}
 & 11     & $Y$-column number                \\
 & 02     & presence count for $x$           &
 & 12     & presence count for $y$           \\
 & 03     & global mean $(x)$                &
 & 13     & global mean $(y)$                \\
 & 04     & global std.\ dev. $(x)$          &
 & 14     & global std.\ dev. $(y)$          \\
 & 05     & stratified std.\ dev. $(x)$      &
 & 15     & stratified std.\ dev. $(y)$      \\
 & 06     & $R^2$ for $x \sim {}$strata      &
 & 16     & $R^2$ for $y \sim {}$strata      \\
 & 07     & adjusted $R^2$ for $x \sim {}$strata      &
 & 17     & adjusted $R^2$ for $y \sim {}$strata      \\
 & 08     & p-value, $x \sim {}$strata       &
 & 18     & p-value, $y \sim {}$strata       \\
 & 09--10 & reserved                         &
 & 19--20 & reserved                         \\
 \hline
 \multirow{9}{*}{\begin{sideways}$y\sim x$, NO strata\end{sideways}}\hspace{-1.15em}
 & 21     & presence count $(x, y)$          &
 \multirow{10}{*}{\begin{sideways}$y\sim x$ AND strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em}
 & 31     & presence count $(x, y, s)$       \\
 & 22     & regression slope                 &
 & 32     & regression slope                 \\
 & 23     & regres.\ slope std.\ dev.        &
 & 33     & regres.\ slope std.\ dev.        \\
 & 24     & correlation${} = \pm\sqrt{R^2}$  &
 & 34     & correlation${} = \pm\sqrt{R^2}$  \\
 & 25     & residual std.\ dev.              &
 & 35     & residual std.\ dev.              \\
 & 26     & $R^2$ in $y$ due to $x$          &
 & 36     & $R^2$ in $y$ due to $x$          \\
 & 27     & adjusted $R^2$ in $y$ due to $x$ &
 & 37     & adjusted $R^2$ in $y$ due to $x$ \\
 & 28     & p-value for ``slope = 0''        &
 & 38     & p-value for ``slope = 0''        \\
 & 29     & reserved                         &
 & 39     & \# strata with ${\geq}\,2$ count \\
 & 30     & reserved                         &
 & 40     & reserved                         \\
 \hline
 \end{tabular}\hfil
 \caption{The {\tt stratstats.dml} output matrix has one row per each distinct
 pair of 1-st and 2-nd covariates, and 40 columns with the statistics described
 here.}
 \label{table:stratoutput}
 \end{table}


 \noindent{\bf Details}
 \smallskip

 Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ is
 a stratum number and $(x, y)$ are two numerical covariates.  We want to analyze
 conditional linear relationship between $y$ and $x$ conditioned by~$i$.
 Note that $x$, but not~$y$, may represent a categorical variable if we assign a
 numerical value to each category, for example 0 and 1 for two categories.

 We assume a linear regression model for~$y$:
 \begin{equation}
 y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, \quad\textrm{where}\,\,\,\,
 \eps_{i,j} \sim \Normal(0, \sigma^2)
 \label{eqn:stratlinmodel}
 \end{equation}
 Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record number
 in stratum~$i$; by $n_i$ we denote the number of records available in stratum~$i$.
 The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata.
 When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in
 stratum~$i$ as
 \begin{equation*}
 \bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad
 \bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i
 \end{equation*}
 If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta \bar{x}_i$,
 which gives the prediction error sum-of-squares of
 \begin{equation}
 \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}_i)\big)^2
 \,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y
 \label{eqn:stratsumsq}
 \end{equation}
 where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' sample
 estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ computed as
 \begin{align*}
 V_x     \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)^2; \quad
 V_y     \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}_i\big)^2;\\
 V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big)
 \end{align*}
 They are stratified because we compute the sample (co-)variances in each stratum~$i$
 separately, then combine by summation.  The stratified estimates for $\Var(X)$ and $\Var(Y)$
 tend to be smaller than the non-stratified ones (with the global mean instead of $\bar{x}_i$
 and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and $y_{i,j}$
 than the global means.  The stratified variance estimates the uncertainty in $x_{i,j}$
 and~$y_{i,j}$ given their stratum~$i$.

 Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq})
 gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$},
 with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS):
 \begin{equation*}
 \mathrm{RSS} \,\,=\, \,
 \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} -
 \hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2
 \,\,=\,\,  V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big)
 \end{equation*}
 The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, estimates the fraction
 of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the linear
 regression model~(\ref{eqn:stratlinmodel}).  We define \emph{stratified correlation} as the
 square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$.  We also use RSS to estimate
 the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that models the prediction error
 of $y_{i,j}$ given $x_{i,j}$ and the stratum:
 \begin{equation*}
 \hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}};
 \,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y};
 \,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\,
 \Big(n = \sum_{i=1}^k n_i\Big)
 \end{equation*}

 The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are
 obtained by considering the effect of $\hat{\beta}$ on the residual sum-of-squares,
 measured by the decrease from $V_y$ to~RSS.
 The $F$-statistic is the ratio of the ``explained'' sum-of-squares
 to the residual sum-of-squares, divided by their corresponding degrees of freedom.
 There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces that
 to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 degree
 of freedom:
 \begin{equation*}
 F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)}
 \,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad
 t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}.
 \end{equation*}
 The $t$-statistic is simply the square root of the $F$-statistic with the appropriate
 choice of sign.  If the null hypothesis and the linear model are both true, the $t$-statistic
 has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom.  We can
 also compute it if we divide $\hat{\beta}$ by its estimated standard deviation:
 \begin{equation*}
 \stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} \quad\Longrightarrow\quad
 t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, \stdev(\hat{\beta})_{\mathrm{est}}
 \end{equation*}
 The standard deviation estimate for~$\beta$ is included in {\tt stratstats.dml} output.

 \smallskip
 \noindent{\bf Returns}
 \smallskip

 The output matrix format is defined in Table~\ref{table:stratoutput}.

 \smallskip
 \noindent{\bf Examples}
 \smallskip

 {\hangindent=\parindent\noindent\tt
 \hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx Xcid=/user/biadmin/Xcid.mtx
   Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx Scid=2
   O=/user/biadmin/Out.mtx fmt=csv

 }
 {\hangindent=\parindent\noindent\tt
 \hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx Xcid=/user/biadmin/Xcid.mtx
   Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx

 }

 %\smallskip
 %\noindent{\bf See Also}
 %\smallskip
 %
 %For non-stratified bivariate statistics with a wider variety of input data types
 %and statistical tests, see \ldots.  For general linear regression, see
 %{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}.  For logistic regression, appropriate
 %when the response variable is categorical, see {\tt MultiLogReg.dml}.
	\begin{comment}

	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	\end{comment}

	\subsection{Stratified Bivariate Statistics}

	\noindent{\bf Description}
	\smallskip

	The {\tt stratstats.dml} script computes common bivariate statistics, such
	as correlation, slope, and their p-value, in parallel for many pairs of input
	variables in the presence of a confounding categorical variable. The values
	of this confounding variable group the records into strata (subpopulations),
	in which all bivariate pairs are assumed free of confounding. The script
	uses the same data model as in one-way analysis of covariance (ANCOVA), with
	strata representing population samples. It also outputs univariate stratified
	and bivariate unstratified statistics.

	\begin{table}[t]\hfil
	\begin{tabular}{\|l\|ll\|ll\|ll\|\|ll\|}
	\hline
	Month of the year & \multicolumn{2}{l\|}{October} & \multicolumn{2}{l\|}{November} &
	\multicolumn{2}{l\|\|}{December} & \multicolumn{2}{c\|}{Oct$\,$--$\,$Dec} \\
	Customers, millions & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\
	\hline
	Promotion (0 or 1) & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\
	Avg.\ sales per 1000 & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\
	\hline
	\end{tabular}\hfil
	\caption{Stratification example: the effect of the promotion on average sales
	becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the months.}
	\label{table:stratexample}
	\end{table}

	To see how data stratification mitigates confounding, consider an (artificial)
	example in Table~\ref{table:stratexample}. A highly seasonal retail item
	was marketed with and without a promotion over the final 3~months of the year.
	In each month the sale was more likely with the promotion than without it.
	But during the peak holiday season, when shoppers came in greater numbers and
	bought the item more often, the promotion was less frequently used. As a result,
	if the 4-th quarter data is pooled together, the promotion's effect becomes
	reversed and magnified. Stratifying by month restores the positive correlation.

	The script computes its statistics in parallel over all possible pairs from two
	specified sets of covariates. The 1-st covariate is a column in input matrix~$X$
	and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ may
	be the same or different. The columns of interest are given by their index numbers
	in special matrices. The stratum column, specified in its own matrix, is the same
	for all covariate pairs.

	Both covariates in each pair must be numerical, with the 2-nd covariate normally
	distributed given the 1-st covariate (see~Details). Missing covariate values or
	strata are represented by~``NaN''. Records with NaN's are selectively omitted
	wherever their NaN's are material to the output statistic.

	\smallskip
	\pagebreak[3]

	\noindent{\bf Usage}
	\smallskip

	{\hangindent=\parindent\noindent\it%
	{\tt{}-f }path/\/{\tt{}stratstats.dml}
	{\tt{} -nvargs}
	{\tt{} X=}path/file
	{\tt{} Xcid=}path/file
	{\tt{} Y=}path/file
	{\tt{} Ycid=}path/file
	{\tt{} S=}path/file
	{\tt{} Scid=}int
	{\tt{} O=}path/file
	{\tt{} fmt=}format

	}


	\smallskip
	\noindent{\bf Arguments}
	\begin{Description}
	\item[{\tt X}:]
	Location (on HDFS) to read matrix $X$ whose columns we want to use as
	the 1-st covariate (i.e.~as the feature variable)
	\item[{\tt Xcid}:] (default:\mbox{ }{\tt " "})
	Location to read the single-row matrix that lists all index numbers
	of the $X$-columns used as the 1-st covariate; the default value means
	``use all $X$-columns''
	\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
	Location to read matrix $Y$ whose columns we want to use as the 2-nd
	covariate (i.e.~as the response variable); the default value means
	``use $X$ in place of~$Y$''
	\item[{\tt Ycid}:] (default:\mbox{ }{\tt " "})
	Location to read the single-row matrix that lists all index numbers
	of the $Y$-columns used as the 2-nd covariate; the default value means
	``use all $Y$-columns''
	\item[{\tt S}:] (default:\mbox{ }{\tt " "})
	Location to read matrix $S$ that has the stratum column.
	Note: the stratum column must contain small positive integers; all fractional
	values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as
	missing. The default value for {\tt S} means ``use $X$ in place of~$S$''
	\item[{\tt Scid}:] (default:\mbox{ }{\tt 1})
	The index number of the stratum column in~$S$
	\item[{\tt O}:]
	Location to store the output matrix defined in Table~\ref{table:stratoutput}
	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
	see read/write functions in SystemML Language Reference for details.
	\end{Description}


	\begin{table}[t]\small\hfil
	\begin{tabular}{\|rcl\|rcl\|}
	\hline
	& Col.\# & Meaning & & Col.\# & Meaning \\
	\hline
	\multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em}
	& 01 & $X$-column number &
	\multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em}
	& 11 & $Y$-column number \\
	& 02 & presence count for $x$ &
	& 12 & presence count for $y$ \\
	& 03 & global mean $(x)$ &
	& 13 & global mean $(y)$ \\
	& 04 & global std.\ dev. $(x)$ &
	& 14 & global std.\ dev. $(y)$ \\
	& 05 & stratified std.\ dev. $(x)$ &
	& 15 & stratified std.\ dev. $(y)$ \\
	& 06 & $R^2$ for $x \sim {}$strata &
	& 16 & $R^2$ for $y \sim {}$strata \\
	& 07 & adjusted $R^2$ for $x \sim {}$strata &
	& 17 & adjusted $R^2$ for $y \sim {}$strata \\
	& 08 & p-value, $x \sim {}$strata &
	& 18 & p-value, $y \sim {}$strata \\
	& 09--10 & reserved &
	& 19--20 & reserved \\
	\hline
	\multirow{9}{*}{\begin{sideways}$y\sim x$, NO strata\end{sideways}}\hspace{-1.15em}
	& 21 & presence count $(x, y)$ &
	\multirow{10}{*}{\begin{sideways}$y\sim x$ AND strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em}
	& 31 & presence count $(x, y, s)$ \\
	& 22 & regression slope &
	& 32 & regression slope \\
	& 23 & regres.\ slope std.\ dev. &
	& 33 & regres.\ slope std.\ dev. \\
	& 24 & correlation${} = \pm\sqrt{R^2}$ &
	& 34 & correlation${} = \pm\sqrt{R^2}$ \\
	& 25 & residual std.\ dev. &
	& 35 & residual std.\ dev. \\
	& 26 & $R^2$ in $y$ due to $x$ &
	& 36 & $R^2$ in $y$ due to $x$ \\
	& 27 & adjusted $R^2$ in $y$ due to $x$ &
	& 37 & adjusted $R^2$ in $y$ due to $x$ \\
	& 28 & p-value for ``slope = 0'' &
	& 38 & p-value for ``slope = 0'' \\
	& 29 & reserved &
	& 39 & \# strata with ${\geq}\,2$ count \\
	& 30 & reserved &
	& 40 & reserved \\
	\hline
	\end{tabular}\hfil
	\caption{The {\tt stratstats.dml} output matrix has one row per each distinct
	pair of 1-st and 2-nd covariates, and 40 columns with the statistics described
	here.}
	\label{table:stratoutput}
	\end{table}




	\noindent{\bf Details}
	\smallskip

	Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ is
	a stratum number and $(x, y)$ are two numerical covariates. We want to analyze
	conditional linear relationship between $y$ and $x$ conditioned by~$i$.
	Note that $x$, but not~$y$, may represent a categorical variable if we assign a
	numerical value to each category, for example 0 and 1 for two categories.

	We assume a linear regression model for~$y$:
	\begin{equation}
	y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, \quad\textrm{where}\,\,\,\,
	\eps_{i,j} \sim \Normal(0, \sigma^2)
	\label{eqn:stratlinmodel}
	\end{equation}
	Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record number
	in stratum~$i$; by $n_i$ we denote the number of records available in stratum~$i$.
	The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata.
	When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in
	stratum~$i$ as
	\begin{equation*}
	\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad
	\bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i
	\end{equation*}
	If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta \bar{x}_i$,
	which gives the prediction error sum-of-squares of
	\begin{equation}
	\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}_i)\big)^2
	\,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y
	\label{eqn:stratsumsq}
	\end{equation}
	where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' sample
	estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ computed as
	\begin{align*}
	V_x \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)^2; \quad
	V_y \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}_i\big)^2;\\
	V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big)
	\end{align*}
	They are stratified because we compute the sample (co-)variances in each stratum~$i$
	separately, then combine by summation. The stratified estimates for $\Var(X)$ and $\Var(Y)$
	tend to be smaller than the non-stratified ones (with the global mean instead of $\bar{x}_i$
	and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and $y_{i,j}$
	than the global means. The stratified variance estimates the uncertainty in $x_{i,j}$
	and~$y_{i,j}$ given their stratum~$i$.

	Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq})
	gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$},
	with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS):
	\begin{equation*}
	\mathrm{RSS} \,\,=\, \,
	\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} -
	\hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2
	\,\,=\,\, V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big)
	\end{equation*}
	The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, estimates the fraction
	of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the linear
	regression model~(\ref{eqn:stratlinmodel}). We define \emph{stratified correlation} as the
	square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$. We also use RSS to estimate
	the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that models the prediction error
	of $y_{i,j}$ given $x_{i,j}$ and the stratum:
	\begin{equation*}
	\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}};
	\,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y};
	\,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\,
	\Big(n = \sum_{i=1}^k n_i\Big)
	\end{equation*}

	The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are
	obtained by considering the effect of $\hat{\beta}$ on the residual sum-of-squares,
	measured by the decrease from $V_y$ to~RSS.
	The $F$-statistic is the ratio of the ``explained'' sum-of-squares
	to the residual sum-of-squares, divided by their corresponding degrees of freedom.
	There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces that
	to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 degree
	of freedom:
	\begin{equation*}
	F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)}
	\,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad
	t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}.
	\end{equation*}
	The $t$-statistic is simply the square root of the $F$-statistic with the appropriate
	choice of sign. If the null hypothesis and the linear model are both true, the $t$-statistic
	has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom. We can
	also compute it if we divide $\hat{\beta}$ by its estimated standard deviation:
	\begin{equation*}
	\stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} \quad\Longrightarrow\quad
	t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, \stdev(\hat{\beta})_{\mathrm{est}}
	\end{equation*}
	The standard deviation estimate for~$\beta$ is included in {\tt stratstats.dml} output.

	\smallskip
	\noindent{\bf Returns}
	\smallskip

	The output matrix format is defined in Table~\ref{table:stratoutput}.

	\smallskip
	\noindent{\bf Examples}
	\smallskip

	{\hangindent=\parindent\noindent\tt
	\hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx Xcid=/user/biadmin/Xcid.mtx
	Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx Scid=2
	O=/user/biadmin/Out.mtx fmt=csv

	}
	{\hangindent=\parindent\noindent\tt
	\hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx Xcid=/user/biadmin/Xcid.mtx
	Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx

	}

	%\smallskip
	%\noindent{\bf See Also}
	%\smallskip
	%
	%For non-stratified bivariate statistics with a wider variety of input data types
	%and statistical tests, see \ldots. For general linear regression, see
	%{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}. For logistic regression, appropriate
	%when the response variable is categorical, see {\tt MultiLogReg.dml}.