blob: a2d3db16efc574b1b1cf4b46d4e7705301cfb989 [file] [log] [blame]
\begin{comment}
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\end{comment}
\subsection{Bivariate Statistics}
\noindent{\bf Description}
\smallskip
Bivariate statistics are used to quantitatively describe the association between
two features, such as test their statistical (in-)dependence or measure
the accuracy of one data feature predicting the other feature, in a sample.
The \BivarScriptName{} script computes common bivariate statistics,
such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs
of data features. For a given dataset matrix, script \BivarScriptName{} computes
certain bivariate statistics for the given feature (column) pairs in the
matrix. The feature types govern the exact set of statistics computed for that pair.
For example, \NameStatR{} can only be computed on two quantitative (scale)
features like `Height' and `Temperature'.
It does not make sense to compute the linear correlation of two categorical attributes
like `Hair Color'.
\smallskip
\noindent{\bf Usage}
\smallskip
{\hangindent=\parindent\noindent\it%\tolerance=0
{\tt{}-f }path/\/\BivarScriptName{}
{\tt{} -nvargs}
{\tt{} X=}path/file
{\tt{} index1=}path/file
{\tt{} index2=}path/file
{\tt{} types1=}path/file
{\tt{} types2=}path/file
{\tt{} OUTDIR=}path
% {\tt{} fmt=}format
}
\smallskip
\noindent{\bf Arguments}
\begin{Description}
\item[{\tt X}:]
Location (on HDFS) to read the data matrix $X$ whose columns are the features
that we want to compare and correlate with bivariate statistics.
\item[{\tt index1}:] % (default:\mbox{ }{\tt " "})
Location (on HDFS) to read the single-row matrix that lists the column indices
of the \emph{first-argument} features in pairwise statistics.
Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the
index $k$ of column \texttt{X[,$\,k$]} in the data matrix
whose bivariate statistics need to be computed.
% The default value means ``use all $X$-columns from the first to the last.''
\item[{\tt index2}:] % (default:\mbox{ }{\tt " "})
Location (on HDFS) to read the single-row matrix that lists the column indices
of the \emph{second-argument} features in pairwise statistics.
Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the
index $l$ of column \texttt{X[,$\,l$]} in the data matrix
whose bivariate statistics need to be computed.
% The default value means ``use all $X$-columns from the first to the last.''
\item[{\tt types1}:] % (default:\mbox{ }{\tt " "})
Location (on HDFS) to read the single-row matrix that lists the \emph{types}
of the \emph{first-argument} features in pairwise statistics.
Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the type
of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the $i^{\textrm{th}}$
entry in the {\tt index1} matrix. Feature types must be encoded by
integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
% The default value means ``treat all referenced $X$-columns as scale.''
\item[{\tt types2}:] % (default:\mbox{ }{\tt " "})
Location (on HDFS) to read the single-row matrix that lists the \emph{types}
of the \emph{second-argument} features in pairwise statistics.
Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the type
of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the $j^{\textrm{th}}$
entry in the {\tt index2} matrix. Feature types must be encoded by
integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
% The default value means ``treat all referenced $X$-columns as scale.''
\item[{\tt OUTDIR}:]
Location path (on HDFS) where the output matrices with computed bivariate
statistics will be stored. The matrices' file names and format are defined
in Table~\ref{table:bivars}.
% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
% see read/write functions in SystemML Language Reference for details.
\end{Description}
\begin{table}[t]\hfil
\begin{tabular}{|lll|}
\hline\rule{0pt}{12pt}%
Ouput File / Matrix & Row$\,$\# & Name of Statistic \\[2pt]
\hline\hline\rule{0pt}{12pt}%
\emph{All Files} & 1 & 1-st feature column \\
\rule{1em}{0pt}" & 2 & 2-nd feature column \\[2pt]
\hline\rule{0pt}{12pt}%
bivar.scale.scale.stats & 3 & \NameStatR \\[2pt]
\hline\rule{0pt}{12pt}%
bivar.nominal.nominal.stats & 3 & \NameStatChi \\
\rule{1em}{0pt}" & 4 & Degrees of freedom \\
\rule{1em}{0pt}" & 5 & \NameStatPChi \\
\rule{1em}{0pt}" & 6 & \NameStatV \\[2pt]
\hline\rule{0pt}{12pt}%
bivar.nominal.scale.stats & 3 & \NameStatEta \\
\rule{1em}{0pt}" & 4 & \NameStatF \\[2pt]
\hline\rule{0pt}{12pt}%
bivar.ordinal.ordinal.stats & 3 & \NameStatRho \\[2pt]
\hline
\end{tabular}\hfil
\caption{%
The output matrices of \BivarScriptName{} have one row per one bivariate
statistic and one column per one pair of input features. This table lists
the meaning of each matrix and each row.%
% Signs ``+'' show applicability to scale or/and to categorical features.
}
\label{table:bivars}
\end{table}
\pagebreak[2]
\noindent{\bf Details}
\smallskip
Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns represent
the features and whose rows represent the records of a data sample.
Given \texttt{X}, the script computes certain relevant bivariate statistics
for specified pairs of feature columns \texttt{X[,$\,i$]} and \texttt{X[,$\,j$]}.
Command-line parameters \texttt{index1} and \texttt{index2} specify the files with
column pairs of interest to the user. Namely, the file given by \texttt{index1}
contains the vector of the 1st-attribute column indices and the file given
by \texttt{index2} has the vector of the 2nd-attribute column indices, with
``1st'' and ``2nd'' referring to their places in bivariate statistics.
Note that both \texttt{index1} and \texttt{index2} files should contain a 1-row matrix
of positive integers.
The bivariate statistics to be computed depend on the \emph{types}, or
\emph{measurement levels}, of the two columns.
The types for each pair are provided in the files whose locations are specified by
\texttt{types1} and \texttt{types2} command-line parameters.
These files are also 1-row matrices, i.e.\ vectors, that list the 1st-attribute and
the 2nd-attribute column types in the same order as their indices in the
\texttt{index1} and \texttt{index2} files. The types must be provided as per
the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
The script orgainizes its results into (potentially) four output matrices, one per
each type combination. The types of bivariate statistics are defined using the types
of the columns that were used for their arguments, with ``ordinal'' sometimes
retrogressing to ``nominal.'' Table~\ref{table:bivars} describes what each column
in each output matrix contains. In particular, the script includes the following
statistics:
\begin{Itemize}
\item For a pair of scale (quantitative) columns, \NameStatR;
\item For a pair of nominal columns (with finite-sized, fixed, unordered domains),
the \NameStatChi{} and its p-value;
\item For a pair of one scale column and one nominal column, \NameStatF{};
\item For a pair of ordinal columns (ordered domains depicting ranks), \NameStatRho.
\end{Itemize}
Note that, as shown in Table~\ref{table:bivars}, the output matrices contain the
column indices of the features involved in each statistic.
Moreover, if the output matrix does not contain
a value in a certain cell then it should be interpreted as a~$0$
(sparse matrix representation).
Below we list all bivariate statistics computed by script \BivarScriptName.
The statistics are collected into several groups by the type of their input
features. We refer to the two input features as $v_1$ and $v_2$ unless
specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for $i=1,\ldots,n$,
where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size.
\paragraph{Scale-vs-scale statistics.}
Sample statistics that describe association between two quantitative (scale) features.
A scale feature has numerical values, with the natural ordering relation.
\begin{Description}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it\NameStatR]:
A measure of linear dependence between two numerical features:
\begin{equation*}
r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}}
\,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}%
{\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n (v_{2,i} - \bar{v}_2)^{2\mathstrut}}}
\end{equation*}
Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching ${\pm}1$ when all value
pairs $(v_{1,i}, v_{2,i})$ lie on the same line. Correlation near~0 means that a line is not a good
way to represent the dependence between the two features; however, this does not imply independence.
The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if one feature tends to
linearly increase (decrease) when the other feature increases. Nonlinear association, if present,
may disobey this sign.
\NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if we transform $v_1$ and $v_2$
to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$.
Suppose that we use simple linear regression to represent one feature given the other, say
represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and $\beta$
to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$.
Then the best error equals
\begin{equation*}
\min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta v_{1,i}\big)^2 \,\,=\,\,
(1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2
\end{equation*}
In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to
the total sum of squares. Hence, $r^2$ is an accuracy measure of the linear regression.
\end{Description}
\paragraph{Nominal-vs-nominal statistics.}
Sample statistics that describe association between two nominal categorical features.
Both features' value domains are encoded with positive integers in arbitrary order:
nominal features do not order their value domains.
\begin{Description}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it\NameStatChi]:
A measure of how much the frequencies of value pairs of two categorical features deviate from
statistical independence. Under independence, the probability of every value pair must equal
the product of probabilities of each value in the pair:
$\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$. But we do not know these (hypothesized) probabilities;
we only know the sample frequency counts. Let $n_{a,b}$ be the frequency count of pair
$(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of $b$~alone. Under
independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be exactly~0 due
to sample randomness, yet it is unlikely to be too far from~0. For some pairs $(a,b)$ it may
deviate from~0 farther than for other pairs. \NameStatChi{}~is an aggregate measure that
combines squares of these differences across all value pairs:
\begin{equation*}
\chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - \frac{n_a n_b}{n}\Big)^2
\,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}}
\end{equation*}
where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = (n_a n_b){/}n$ are
the \emph{expected} frequencies for all pairs~$(a,b)$. Under independence (plus other standard
assumptions) the sample~$\chi^2$ closely follows a well-known distribution, making it a basis for
statistical tests for independence, see~\emph{\NameStatPChi} for details. Note that \NameStatChi{}
does \emph{not} measure the strength of dependence: even very weak dependence may result in a
significant deviation from independence if the counts are large enough. Use~\NameStatV{} instead
to measure the strength of dependence.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Degrees of freedom]:
An integer parameter required for the interpretation of~\NameStatChi{} measure. Under independence
(plus other standard assumptions) the sample~$\chi^2$ statistic is approximately distributed as the
sum of $d$~squares of independent normal random variables with mean~0 and variance~1, where $d$ is
this integer parameter. For a pair of categorical features such that the $1^{\textrm{st}}$~feature
has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, the number of degrees
of freedom is $d = (k_1 - 1)(k_2 - 1)$.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it\NameStatPChi]:
A measure of how likely we would observe the current frequencies of value pairs of two categorical
features assuming their statistical independence. More precisely, it computes the probability that
the sum of $d$~squares of independent normal random variables with mean~0 and variance~1
(called the $\chi^2$~distribution with $d$ degrees of freedom) generates a value at least as large
as the current sample \NameStatChi. The $d$ parameter is \emph{degrees of freedom}, see above.
Under independence (plus other standard assumptions) the sample \NameStatChi{} closely follows the
$\chi^2$~distribution and is unlikely to land very far into its tail. On the other hand, if the
two features are dependent, their sample \NameStatChi{} becomes arbitrarily large as $n\to\infty$
and lands extremely far into the tail of the $\chi^2$~distribution given a large enough data sample.
\NameStatPChi{} returns the tail ``weight'' on the right-hand side of \NameStatChi:
\begin{equation*}
P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim \textrm{the $\chi^2$ distribution}\big]
\end{equation*}
As any probability, $P$ ranges between 0 and~1. If $P\leq 0.05$, the dependence between the two
features may be considered statistically significant (i.e.\ their independence is considered
statistically ruled out). For highly dependent features, it is not unusual to have $P\leq 10^{-20}$
or less, in which case our script will simply return $P = 0$. Independent features should have
their $P\geq 0.05$ in about 95\% cases.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it\NameStatV]:
A measure for the strength of association, i.e.\ of statistical dependence, between two categorical
features, conceptually similar to \NameStatR. It divides the observed~\NameStatChi{} by the maximum
possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of categories in each feature,
then takes the square root. Thus, \NameStatV{} ranges from 0 to~1,
where 0 implies no association and 1 implies the maximum possible association (one-to-one
correspondence) between the two features. See \emph{\NameStatChi} for the computation of~$\chi^2$;
its maximum${} = {}$%
$n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature
has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories~\cite{AcockStavig1979:CramersV},
so
\begin{equation*}
\textrm{\NameStatV} \,\,=\,\, \sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}}
\end{equation*}
As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' dependence increases,
\NameStatV{} goes towards~1 (slowly) as the dependence increases. Both \NameStatChi{} and
\NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is mitigated by taking the
ratio.
\end{Description}
\paragraph{Nominal-vs-scale statistics.}
Sample statistics that describe association between a categorical feature
(order ignored) and a quantitative (scale) feature.
The values of the categorical feature must be coded as positive integers.
\begin{Description}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it\NameStatEta]:
A measure for the strength of association (statistical dependence) between a nominal feature
and a scale feature, conceptually similar to \NameStatR. Ranges from 0 to~1, approaching 0
when there is no association and approaching 1 when there is a strong association.
The nominal feature, treated as the independent variable, is assumed to have relatively few
possible values, all with large frequency counts. The scale feature is treated as the dependent
variable. Denoting the nominal feature by~$x$ and the scale feature by~$y$, we have:
\begin{equation*}
\eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
\,\,\,\,\textrm{where}\,\,\,\,
\hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n
\,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & \textrm{otherwise}\end{array}\right.\!\!\!
\end{equation*}
and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean. Value $\hat{y}[x]$ is the average
of~$y_i$ among all records where $x_i = x$; it can also be viewed as the ``predictor''
of $y$ given~$x$. Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the residual error
sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum-of-squares for~$y$.
Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the
``R-squared'' statistic measures the accuracy of linear regression. Our output $\eta$
is the square root of~$\eta^2$.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it\NameStatF]:
A measure of how much the values of the scale feature, denoted here by~$y$,
deviate from statistical independence on the nominal feature, denoted by~$x$.
The same measure appears in the one-way analysis of vari\-ance (ANOVA).
Like \NameStatChi, \NameStatF{} is used to test the hypothesis that
$y$~is independent from~$x$, given the following assumptions:
\begin{Itemize}
\item The scale feature $y$ has approximately normal distribution whose mean
may depend only on~$x$ and variance is the same for all~$x$;
\item The nominal feature $x$ has relatively small value domain with large
frequency counts, the $x_i$-values are treated as fixed (non-random);
\item All records are sampled independently of each other.
\end{Itemize}
To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$
among all records where $x_i = x$. These $\hat{y}[x]$ can be viewed as
``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should
``predict'' only the global mean~$\bar{y}$. Then we form two sums-of-squares:
\begin{Itemize}
\item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - \hat{y}[x_i]$;
\item \emph{Explained} sum-of-squares of the ``predictor'' variability: $\hat{y}[x_i] - \bar{y}$.
\end{Itemize}
\NameStatF{} is the ratio of the explained sum-of-squares to
the residual sum-of-squares, each divided by their corresponding degrees
of freedom:
\begin{equation*}
F \,\,=\,\,
\frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 \,\big/\,\, (k\,{-}\,1)}%
{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} \,\,=\,\,
\frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2}
\end{equation*}
Here $k$ is the domain size of the nominal feature~$x$. The $k$ ``predictors''
lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly,
the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''.
The statistic can test if the independence hypothesis of $y$ from $x$ is reasonable;
more generally (with relaxed normality assumptions) it can test the hypothesis that
\emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$.
Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, n\,{-}\,k)$-distribution.
But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{}
becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely far
into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large enough data sample.
\end{Description}
\paragraph{Ordinal-vs-ordinal statistics.}
Sample statistics that describe association between two ordinal categorical features.
Both features' value domains are encoded with positive integers, so that the natural
order of the integers coincides with the order in each value domain.
\begin{Description}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it\NameStatRho]:
A measure for the strength of association (statistical dependence) between
two ordinal features, conceptually similar to \NameStatR. Specifically, it is \NameStatR{}
applied to the feature vectors in which all values are replaced by their ranks, i.e.\
their positions if the vector is sorted. The ranks of identical (duplicate) values
are replaced with their average rank. For example, in vector
$(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the sorted
order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their average
rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 2,\, 5,\, 3.5,\, 1)$.
Our implementation of \NameStatRho{} is geared towards features having small value domains
and large counts for the values. Given the two input vectors, we form a contingency table $T$
of pairwise frequency counts, as well as a vector of frequency counts for each feature: $f_1$
and~$f_2$. Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer to the
order-preserving integer encoding of the feature values.
We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks:
$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously for~$r_2$.
Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, f_2$ and their
covariance weighted by~$T$, before applying the standard formula for \NameStatR:
\begin{equation*}
\rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}}
\,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - \bar{r}_2)}%
{\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}}
\end{equation*}
where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$.
The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the prevalent direction
of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to increase (decrease)
when the other feature increases. The correlation becomes~1 when the two features are
monotonically related.
\end{Description}
\smallskip
\noindent{\bf Returns}
\smallskip
A collection of (potentially) 4 matrices. Each matrix contains bivariate statistics that
resulted from a different combination of feature types. There is one matrix for scale-scale
statistics (which includes \NameStatR), one for nominal-nominal statistics (includes \NameStatChi{}),
one for nominal-scale statistics (includes \NameStatF) and one for ordinal-ordinal statistics
(includes \NameStatRho). If any of these matrices is not produced, then no pair of columns required
the corresponding type combination. See Table~\ref{table:bivars} for the matrix naming and
format details.
\smallskip
\pagebreak[2]
\noindent{\bf Examples}
\smallskip
{\hangindent=\parindent\noindent\tt
\hml -f \BivarScriptName{} -nvargs
X=/user/biadmin/X.mtx
index1=/user/biadmin/S1.mtx
index2=/user/biadmin/S2.mtx
types1=/user/biadmin/K1.mtx
types2=/user/biadmin/K2.mtx
OUTDIR=/user/biadmin/stats.mtx
}