docs/Algorithms Reference/DescriptiveBivarStats.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Bivariate Statistics}

 \noindent{\bf Description}
 \smallskip

 Bivariate statistics are used to quantitatively describe the association between
 two features, such as test their statistical (in-)dependence or measure
 the accuracy of one data feature predicting the other feature, in a sample.
 The \BivarScriptName{} script computes common bivariate statistics,
 such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs
 of data features.  For a given dataset matrix, script \BivarScriptName{} computes
 certain bivariate statistics for the given feature (column) pairs in the
 matrix.  The feature types govern the exact set of statistics computed for that pair.
 For example, \NameStatR{} can only be computed on two quantitative (scale)
 features like `Height' and `Temperature'.
 It does not make sense to compute the linear correlation of two categorical attributes
 like `Hair Color'.


 \smallskip
 \noindent{\bf Usage}
 \smallskip

 {\hangindent=\parindent\noindent\it%\tolerance=0
 {\tt{}-f }path/\/\BivarScriptName{}
 {\tt{} -nvargs}
 {\tt{} X=}path/file
 {\tt{} index1=}path/file
 {\tt{} index2=}path/file
 {\tt{} types1=}path/file
 {\tt{} types2=}path/file
 {\tt{} OUTDIR=}path
 % {\tt{} fmt=}format

 }


 \smallskip
 \noindent{\bf Arguments}
 \begin{Description}
 \item[{\tt X}:]
 Location (on HDFS) to read the data matrix $X$ whose columns are the features
 that we want to compare and correlate with bivariate statistics.
 \item[{\tt index1}:] % (default:\mbox{ }{\tt " "})
 Location (on HDFS) to read the single-row matrix that lists the column indices
 of the \emph{first-argument} features in pairwise statistics.
 Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the
 index $k$ of column \texttt{X[,$\,k$]} in the data matrix
 whose bivariate statistics need to be computed.
 % The default value means ``use all $X$-columns from the first to the last.''
 \item[{\tt index2}:] % (default:\mbox{ }{\tt " "})
 Location (on HDFS) to read the single-row matrix that lists the column indices
 of the \emph{second-argument} features in pairwise statistics.
 Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the
 index $l$ of column \texttt{X[,$\,l$]} in the data matrix
 whose bivariate statistics need to be computed.
 % The default value means ``use all $X$-columns from the first to the last.''
 \item[{\tt types1}:] % (default:\mbox{ }{\tt " "})
 Location (on HDFS) to read the single-row matrix that lists the \emph{types}
 of the \emph{first-argument} features in pairwise statistics.
 Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the type
 of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the $i^{\textrm{th}}$
 entry in the {\tt index1} matrix.  Feature types must be encoded by
 integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
 % The default value means ``treat all referenced $X$-columns as scale.''
 \item[{\tt types2}:] % (default:\mbox{ }{\tt " "})
 Location (on HDFS) to read the single-row matrix that lists the \emph{types}
 of the \emph{second-argument} features in pairwise statistics.
 Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the type
 of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the $j^{\textrm{th}}$
 entry in the {\tt index2} matrix.  Feature types must be encoded by
 integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
 % The default value means ``treat all referenced $X$-columns as scale.''
 \item[{\tt OUTDIR}:]
 Location path (on HDFS) where the output matrices with computed bivariate
 statistics will be stored.  The matrices' file names and format are defined
 in Table~\ref{table:bivars}.
 % \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
 % Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
 % see read/write functions in SystemML Language Reference for details.
 \end{Description}

 \begin{table}[t]\hfil
 \begin{tabular}{|lll|}
 \hline\rule{0pt}{12pt}%
 Ouput File / Matrix         & Row$\,$\# & Name of Statistic   \\[2pt]
 \hline\hline\rule{0pt}{12pt}%
 \emph{All Files}            &     1     & 1-st feature column \\
 \rule{1em}{0pt}"            &     2     & 2-nd feature column \\[2pt]
 \hline\rule{0pt}{12pt}%
 bivar.scale.scale.stats     &     3     & \NameStatR          \\[2pt]
 \hline\rule{0pt}{12pt}%
 bivar.nominal.nominal.stats &     3     & \NameStatChi        \\
 \rule{1em}{0pt}"            &     4     & Degrees of freedom  \\
 \rule{1em}{0pt}"            &     5     & \NameStatPChi       \\
 \rule{1em}{0pt}"            &     6     & \NameStatV          \\[2pt]
 \hline\rule{0pt}{12pt}%
 bivar.nominal.scale.stats   &     3     & \NameStatEta        \\
 \rule{1em}{0pt}"            &     4     & \NameStatF          \\[2pt]
 \hline\rule{0pt}{12pt}%
 bivar.ordinal.ordinal.stats &     3     & \NameStatRho        \\[2pt]
 \hline
 \end{tabular}\hfil
 \caption{%
 The output matrices of \BivarScriptName{} have one row per one bivariate
 statistic and one column per one pair of input features.  This table lists
 the meaning of each matrix and each row.%
 % Signs ``+'' show applicability to scale or/and to categorical features.
 }
 \label{table:bivars}
 \end{table}


 \pagebreak[2]

 \noindent{\bf Details}
 \smallskip

 Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns represent
 the features and whose rows represent the records of a data sample.
 Given \texttt{X}, the script computes certain relevant bivariate statistics
 for specified pairs of feature columns \texttt{X[,$\,i$]} and \texttt{X[,$\,j$]}.
 Command-line parameters \texttt{index1} and \texttt{index2} specify the files with
 column pairs of interest to the user.  Namely, the file given by \texttt{index1}
 contains the vector of the 1st-attribute column indices and the file given
 by \texttt{index2} has the vector of the 2nd-attribute column indices, with
 ``1st'' and ``2nd'' referring to their places in bivariate statistics.
 Note that both \texttt{index1} and \texttt{index2} files should contain a 1-row matrix
 of positive integers.

 The bivariate statistics to be computed depend on the \emph{types}, or
 \emph{measurement levels}, of the two columns.
 The types for each pair are provided in the files whose locations are specified by
 \texttt{types1} and \texttt{types2} command-line parameters.
 These files are also 1-row matrices, i.e.\ vectors, that list the 1st-attribute and
 the 2nd-attribute column types in the same order as their indices in the
 \texttt{index1} and \texttt{index2} files.  The types must be provided as per
 the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.

 The script orgainizes its results into (potentially) four output matrices, one per
 each type combination.  The types of bivariate statistics are defined using the types
 of the columns that were used for their arguments, with ``ordinal'' sometimes
 retrogressing to ``nominal.''  Table~\ref{table:bivars} describes what each column
 in each output matrix contains.  In particular, the script includes the following
 statistics:
 \begin{Itemize}
 \item For a pair of scale (quantitative) columns, \NameStatR;
 \item For a pair of nominal columns (with finite-sized, fixed, unordered domains),
 the \NameStatChi{} and its p-value;
 \item For a pair of one scale column and one nominal column, \NameStatF{};
 \item For a pair of ordinal columns (ordered domains depicting ranks), \NameStatRho.
 \end{Itemize}
 Note that, as shown in Table~\ref{table:bivars}, the output matrices contain the
 column indices of the features involved in each statistic.
 Moreover, if the output matrix does not contain
 a value in a certain cell then it should be interpreted as a~$0$
 (sparse matrix representation).

 Below we list all bivariate statistics computed by script \BivarScriptName.
 The statistics are collected into several groups by the type of their input
 features.  We refer to the two input features as $v_1$ and $v_2$ unless
 specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for $i=1,\ldots,n$,
 where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size.


 \paragraph{Scale-vs-scale statistics.}
 Sample statistics that describe association between two quantitative (scale) features.
 A scale feature has numerical values, with the natural ordering relation.
 \begin{Description}
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it\NameStatR]:
 A measure of linear dependence between two numerical features:
 \begin{equation*}
 r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}}
 \,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}%
 {\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n (v_{2,i} - \bar{v}_2)^{2\mathstrut}}}
 \end{equation*}
 Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching ${\pm}1$ when all value
 pairs $(v_{1,i}, v_{2,i})$ lie on the same line.  Correlation near~0 means that a line is not a good
 way to represent the dependence between the two features; however, this does not imply independence.
 The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if one feature tends to
 linearly increase (decrease) when the other feature increases.  Nonlinear association, if present,
 may disobey this sign.
 \NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if we transform $v_1$ and $v_2$
 to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$.

 Suppose that we use simple linear regression to represent one feature given the other, say
 represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and $\beta$
 to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$.
 Then the best error equals
 \begin{equation*}
 \min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta v_{1,i}\big)^2 \,\,=\,\,
 (1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2
 \end{equation*}
 In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to
 the total sum of squares.  Hence, $r^2$ is an accuracy measure of the linear regression.
 \end{Description}


 \paragraph{Nominal-vs-nominal statistics.}
 Sample statistics that describe association between two nominal categorical features.
 Both features' value domains are encoded with positive integers in arbitrary order:
 nominal features do not order their value domains.
 \begin{Description}
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it\NameStatChi]:
 A measure of how much the frequencies of value pairs of two categorical features deviate from
 statistical independence.  Under independence, the probability of every value pair must equal
 the product of probabilities of each value in the pair:
 $\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$.  But we do not know these (hypothesized) probabilities;
 we only know the sample frequency counts.  Let $n_{a,b}$ be the frequency count of pair
 $(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of $b$~alone.  Under
 independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be exactly~0 due
 to sample randomness, yet it is unlikely to be too far from~0.  For some pairs $(a,b)$ it may
 deviate from~0 farther than for other pairs.  \NameStatChi{}~is an aggregate measure that
 combines squares of these differences across all value pairs:
 \begin{equation*}
 \chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - \frac{n_a n_b}{n}\Big)^2
 \,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}}
 \end{equation*}
 where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = (n_a n_b){/}n$ are
 the \emph{expected} frequencies for all pairs~$(a,b)$.  Under independence (plus other standard
 assumptions) the sample~$\chi^2$ closely follows a well-known distribution, making it a basis for
 statistical tests for independence, see~\emph{\NameStatPChi} for details.  Note that \NameStatChi{}
 does \emph{not} measure the strength of dependence: even very weak dependence may result in a
 significant deviation from independence if the counts are large enough.  Use~\NameStatV{} instead
 to measure the strength of dependence.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Degrees of freedom]:
 An integer parameter required for the interpretation of~\NameStatChi{} measure.  Under independence
 (plus other standard assumptions) the sample~$\chi^2$ statistic is approximately distributed as the
 sum of $d$~squares of independent normal random variables with mean~0 and variance~1, where $d$ is
 this integer parameter.  For a pair of categorical features such that the $1^{\textrm{st}}$~feature
 has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, the number of degrees
 of freedom is $d = (k_1 - 1)(k_2 - 1)$.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it\NameStatPChi]:
 A measure of how likely we would observe the current frequencies of value pairs of two categorical
 features assuming their statistical independence.  More precisely, it computes the probability that
 the sum of $d$~squares of independent normal random variables with mean~0 and variance~1
 (called the $\chi^2$~distribution with $d$ degrees of freedom) generates a value at least as large
 as the current sample \NameStatChi.  The $d$ parameter is \emph{degrees of freedom}, see above.
 Under independence (plus other standard assumptions) the sample \NameStatChi{} closely follows the
 $\chi^2$~distribution and is unlikely to land very far into its tail.  On the other hand, if the
 two features are dependent, their sample \NameStatChi{} becomes arbitrarily large as $n\to\infty$
 and lands extremely far into the tail of the $\chi^2$~distribution given a large enough data sample.
 \NameStatPChi{} returns the tail ``weight'' on the right-hand side of \NameStatChi:
 \begin{equation*}
 P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim \textrm{the $\chi^2$ distribution}\big]
 \end{equation*}
 As any probability, $P$ ranges between 0 and~1.  If $P\leq 0.05$, the dependence between the two
 features may be considered statistically significant (i.e.\ their independence is considered
 statistically ruled out).  For highly dependent features, it is not unusual to have $P\leq 10^{-20}$
 or less, in which case our script will simply return $P = 0$.  Independent features should have
 their $P\geq 0.05$ in about 95\% cases.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it\NameStatV]:
 A measure for the strength of association, i.e.\ of statistical dependence, between two categorical
 features, conceptually similar to \NameStatR.  It divides the observed~\NameStatChi{} by the maximum
 possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of categories in each feature,
 then takes the square root.  Thus, \NameStatV{} ranges from 0 to~1,
 where 0 implies no association and 1 implies the maximum possible association (one-to-one
 correspondence) between the two features.  See \emph{\NameStatChi} for the computation of~$\chi^2$;
 its maximum${} = {}$%
 $n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature
 has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories~\cite{AcockStavig1979:CramersV},
 so
 \begin{equation*}
 \textrm{\NameStatV} \,\,=\,\, \sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}}
 \end{equation*}
 As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' dependence increases,
 \NameStatV{} goes towards~1 (slowly) as the dependence increases.  Both \NameStatChi{} and
 \NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is mitigated by taking the
 ratio.
 \end{Description}


 \paragraph{Nominal-vs-scale statistics.}
 Sample statistics that describe association between a categorical feature
 (order ignored) and a quantitative (scale) feature.
 The values of the categorical feature must be coded as positive integers.
 \begin{Description}
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it\NameStatEta]:
 A measure for the strength of association (statistical dependence) between a nominal feature
 and a scale feature, conceptually similar to \NameStatR.  Ranges from 0 to~1, approaching 0
 when there is no association and approaching 1 when there is a strong association.
 The nominal feature, treated as the independent variable, is assumed to have relatively few
 possible values, all with large frequency counts.  The scale feature is treated as the dependent
 variable.  Denoting the nominal feature by~$x$ and the scale feature by~$y$, we have:
 \begin{equation*}
 \eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
 \,\,\,\,\textrm{where}\,\,\,\,
 \hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n
 \,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & \textrm{otherwise}\end{array}\right.\!\!\!
 \end{equation*}
 and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean.  Value $\hat{y}[x]$ is the average
 of~$y_i$ among all records where $x_i = x$; it can also be viewed as the ``predictor''
 of $y$ given~$x$.  Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the residual error
 sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum-of-squares for~$y$.
 Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the
 ``R-squared'' statistic measures the accuracy of linear regression.  Our output $\eta$
 is the square root of~$\eta^2$.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it\NameStatF]:
 A measure of how much the values of the scale feature, denoted here by~$y$,
 deviate from statistical independence on the nominal feature, denoted by~$x$.
 The same measure appears in the one-way analysis of vari\-ance (ANOVA).
 Like \NameStatChi, \NameStatF{} is used to test the hypothesis that
 $y$~is independent from~$x$, given the following assumptions:
 \begin{Itemize}
 \item The scale feature $y$ has approximately normal distribution whose mean
 may depend only on~$x$ and variance is the same for all~$x$;
 \item The nominal feature $x$ has relatively small value domain with large
 frequency counts, the $x_i$-values are treated as fixed (non-random);
 \item All records are sampled independently of each other.
 \end{Itemize}
 To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$
 among all records where $x_i = x$.  These $\hat{y}[x]$ can be viewed as
 ``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should
 ``predict'' only the global mean~$\bar{y}$.  Then we form two sums-of-squares:
 \begin{Itemize}
 \item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - \hat{y}[x_i]$;
 \item \emph{Explained} sum-of-squares of the ``predictor'' variability: $\hat{y}[x_i] - \bar{y}$.
 \end{Itemize}
 \NameStatF{} is the ratio of the explained sum-of-squares to
 the residual sum-of-squares, each divided by their corresponding degrees
 of freedom:
 \begin{equation*}
 F \,\,=\,\,
 \frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 \,\big/\,\, (k\,{-}\,1)}%
 {\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} \,\,=\,\,
 \frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2}
 \end{equation*}
 Here $k$ is the domain size of the nominal feature~$x$.  The $k$ ``predictors''
 lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly,
 the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''.

 The statistic can test if the independence hypothesis of $y$ from $x$ is reasonable;
 more generally (with relaxed normality assumptions) it can test the hypothesis that
 \emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$.
 Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, n\,{-}\,k)$-distribution.
 But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{}
 becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely far
 into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large enough data sample.
 \end{Description}


 \paragraph{Ordinal-vs-ordinal statistics.}
 Sample statistics that describe association between two ordinal categorical features.
 Both features' value domains are encoded with positive integers, so that the natural
 order of the integers coincides with the order in each value domain.
 \begin{Description}
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it\NameStatRho]:
 A measure for the strength of association (statistical dependence) between
 two ordinal features, conceptually similar to \NameStatR.  Specifically, it is \NameStatR{}
 applied to the feature vectors in which all values are replaced by their ranks, i.e.\
 their positions if the vector is sorted.  The ranks of identical (duplicate) values
 are replaced with their average rank.  For example, in vector
 $(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the sorted
 order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their average
 rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 2,\, 5,\, 3.5,\, 1)$.

 Our implementation of \NameStatRho{} is geared towards features having small value domains
 and large counts for the values.  Given the two input vectors, we form a contingency table $T$
 of pairwise frequency counts, as well as a vector of frequency counts for each feature: $f_1$
 and~$f_2$.  Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer to the
 order-preserving integer encoding of the feature values.
 We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks:
 $r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously for~$r_2$.
 Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, f_2$ and their
 covariance weighted by~$T$, before applying the standard formula for \NameStatR:
 \begin{equation*}
 \rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}}
 \,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - \bar{r}_2)}%
 {\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}}
 \end{equation*}
 where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$.
 The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the prevalent direction
 of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to increase (decrease)
 when the other feature increases.  The correlation becomes~1 when the two features are
 monotonically related.
 \end{Description}


 \smallskip
 \noindent{\bf Returns}
 \smallskip

 A collection of (potentially) 4 matrices.  Each matrix contains bivariate statistics that
 resulted from a different combination of feature types.  There is one matrix for scale-scale
 statistics (which includes \NameStatR), one for nominal-nominal statistics (includes \NameStatChi{}),
 one for nominal-scale statistics (includes \NameStatF) and one for ordinal-ordinal statistics
 (includes \NameStatRho).  If any of these matrices is not produced, then no pair of columns required
 the corresponding type combination.  See Table~\ref{table:bivars} for the matrix naming and
 format details.


 \smallskip
 \pagebreak[2]

 \noindent{\bf Examples}
 \smallskip

 {\hangindent=\parindent\noindent\tt
 \hml -f \BivarScriptName{} -nvargs
 X=/user/biadmin/X.mtx
 index1=/user/biadmin/S1.mtx
 index2=/user/biadmin/S2.mtx
 types1=/user/biadmin/K1.mtx
 types2=/user/biadmin/K2.mtx
 OUTDIR=/user/biadmin/stats.mtx

 }