| \begin{comment} |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| \end{comment} |
| |
| \subsection{Bivariate Statistics} |
| |
| \noindent{\bf Description} |
| \smallskip |
| |
| Bivariate statistics are used to quantitatively describe the association between |
| two features, such as test their statistical (in-)dependence or measure |
| the accuracy of one data feature predicting the other feature, in a sample. |
| The \BivarScriptName{} script computes common bivariate statistics, |
| such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs |
| of data features. For a given dataset matrix, script \BivarScriptName{} computes |
| certain bivariate statistics for the given feature (column) pairs in the |
| matrix. The feature types govern the exact set of statistics computed for that pair. |
| For example, \NameStatR{} can only be computed on two quantitative (scale) |
| features like `Height' and `Temperature'. |
| It does not make sense to compute the linear correlation of two categorical attributes |
| like `Hair Color'. |
| |
| |
| \smallskip |
| \noindent{\bf Usage} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\it%\tolerance=0 |
| {\tt{}-f }path/\/\BivarScriptName{} |
| {\tt{} -nvargs} |
| {\tt{} X=}path/file |
| {\tt{} index1=}path/file |
| {\tt{} index2=}path/file |
| {\tt{} types1=}path/file |
| {\tt{} types2=}path/file |
| {\tt{} OUTDIR=}path |
| % {\tt{} fmt=}format |
| |
| } |
| |
| |
| \smallskip |
| \noindent{\bf Arguments} |
| \begin{Description} |
| \item[{\tt X}:] |
| Location (on HDFS) to read the data matrix $X$ whose columns are the features |
| that we want to compare and correlate with bivariate statistics. |
| \item[{\tt index1}:] % (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to read the single-row matrix that lists the column indices |
| of the \emph{first-argument} features in pairwise statistics. |
| Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the |
| index $k$ of column \texttt{X[,$\,k$]} in the data matrix |
| whose bivariate statistics need to be computed. |
| % The default value means ``use all $X$-columns from the first to the last.'' |
| \item[{\tt index2}:] % (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to read the single-row matrix that lists the column indices |
| of the \emph{second-argument} features in pairwise statistics. |
| Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the |
| index $l$ of column \texttt{X[,$\,l$]} in the data matrix |
| whose bivariate statistics need to be computed. |
| % The default value means ``use all $X$-columns from the first to the last.'' |
| \item[{\tt types1}:] % (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to read the single-row matrix that lists the \emph{types} |
| of the \emph{first-argument} features in pairwise statistics. |
| Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the type |
| of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the $i^{\textrm{th}}$ |
| entry in the {\tt index1} matrix. Feature types must be encoded by |
| integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. |
| % The default value means ``treat all referenced $X$-columns as scale.'' |
| \item[{\tt types2}:] % (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to read the single-row matrix that lists the \emph{types} |
| of the \emph{second-argument} features in pairwise statistics. |
| Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the type |
| of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the $j^{\textrm{th}}$ |
| entry in the {\tt index2} matrix. Feature types must be encoded by |
| integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. |
| % The default value means ``treat all referenced $X$-columns as scale.'' |
| \item[{\tt OUTDIR}:] |
| Location path (on HDFS) where the output matrices with computed bivariate |
| statistics will be stored. The matrices' file names and format are defined |
| in Table~\ref{table:bivars}. |
| % \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) |
| % Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; |
| % see read/write functions in SystemML Language Reference for details. |
| \end{Description} |
| |
| \begin{table}[t]\hfil |
| \begin{tabular}{|lll|} |
| \hline\rule{0pt}{12pt}% |
| Ouput File / Matrix & Row$\,$\# & Name of Statistic \\[2pt] |
| \hline\hline\rule{0pt}{12pt}% |
| \emph{All Files} & 1 & 1-st feature column \\ |
| \rule{1em}{0pt}" & 2 & 2-nd feature column \\[2pt] |
| \hline\rule{0pt}{12pt}% |
| bivar.scale.scale.stats & 3 & \NameStatR \\[2pt] |
| \hline\rule{0pt}{12pt}% |
| bivar.nominal.nominal.stats & 3 & \NameStatChi \\ |
| \rule{1em}{0pt}" & 4 & Degrees of freedom \\ |
| \rule{1em}{0pt}" & 5 & \NameStatPChi \\ |
| \rule{1em}{0pt}" & 6 & \NameStatV \\[2pt] |
| \hline\rule{0pt}{12pt}% |
| bivar.nominal.scale.stats & 3 & \NameStatEta \\ |
| \rule{1em}{0pt}" & 4 & \NameStatF \\[2pt] |
| \hline\rule{0pt}{12pt}% |
| bivar.ordinal.ordinal.stats & 3 & \NameStatRho \\[2pt] |
| \hline |
| \end{tabular}\hfil |
| \caption{% |
| The output matrices of \BivarScriptName{} have one row per one bivariate |
| statistic and one column per one pair of input features. This table lists |
| the meaning of each matrix and each row.% |
| % Signs ``+'' show applicability to scale or/and to categorical features. |
| } |
| \label{table:bivars} |
| \end{table} |
| |
| |
| |
| \pagebreak[2] |
| |
| \noindent{\bf Details} |
| \smallskip |
| |
| Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns represent |
| the features and whose rows represent the records of a data sample. |
| Given \texttt{X}, the script computes certain relevant bivariate statistics |
| for specified pairs of feature columns \texttt{X[,$\,i$]} and \texttt{X[,$\,j$]}. |
| Command-line parameters \texttt{index1} and \texttt{index2} specify the files with |
| column pairs of interest to the user. Namely, the file given by \texttt{index1} |
| contains the vector of the 1st-attribute column indices and the file given |
| by \texttt{index2} has the vector of the 2nd-attribute column indices, with |
| ``1st'' and ``2nd'' referring to their places in bivariate statistics. |
| Note that both \texttt{index1} and \texttt{index2} files should contain a 1-row matrix |
| of positive integers. |
| |
| The bivariate statistics to be computed depend on the \emph{types}, or |
| \emph{measurement levels}, of the two columns. |
| The types for each pair are provided in the files whose locations are specified by |
| \texttt{types1} and \texttt{types2} command-line parameters. |
| These files are also 1-row matrices, i.e.\ vectors, that list the 1st-attribute and |
| the 2nd-attribute column types in the same order as their indices in the |
| \texttt{index1} and \texttt{index2} files. The types must be provided as per |
| the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. |
| |
| The script orgainizes its results into (potentially) four output matrices, one per |
| each type combination. The types of bivariate statistics are defined using the types |
| of the columns that were used for their arguments, with ``ordinal'' sometimes |
| retrogressing to ``nominal.'' Table~\ref{table:bivars} describes what each column |
| in each output matrix contains. In particular, the script includes the following |
| statistics: |
| \begin{Itemize} |
| \item For a pair of scale (quantitative) columns, \NameStatR; |
| \item For a pair of nominal columns (with finite-sized, fixed, unordered domains), |
| the \NameStatChi{} and its p-value; |
| \item For a pair of one scale column and one nominal column, \NameStatF{}; |
| \item For a pair of ordinal columns (ordered domains depicting ranks), \NameStatRho. |
| \end{Itemize} |
| Note that, as shown in Table~\ref{table:bivars}, the output matrices contain the |
| column indices of the features involved in each statistic. |
| Moreover, if the output matrix does not contain |
| a value in a certain cell then it should be interpreted as a~$0$ |
| (sparse matrix representation). |
| |
| Below we list all bivariate statistics computed by script \BivarScriptName. |
| The statistics are collected into several groups by the type of their input |
| features. We refer to the two input features as $v_1$ and $v_2$ unless |
| specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for $i=1,\ldots,n$, |
| where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size. |
| |
| |
| \paragraph{Scale-vs-scale statistics.} |
| Sample statistics that describe association between two quantitative (scale) features. |
| A scale feature has numerical values, with the natural ordering relation. |
| \begin{Description} |
| %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% |
| \item[\it\NameStatR]: |
| A measure of linear dependence between two numerical features: |
| \begin{equation*} |
| r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}} |
| \,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}% |
| {\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n (v_{2,i} - \bar{v}_2)^{2\mathstrut}}} |
| \end{equation*} |
| Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching ${\pm}1$ when all value |
| pairs $(v_{1,i}, v_{2,i})$ lie on the same line. Correlation near~0 means that a line is not a good |
| way to represent the dependence between the two features; however, this does not imply independence. |
| The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if one feature tends to |
| linearly increase (decrease) when the other feature increases. Nonlinear association, if present, |
| may disobey this sign. |
| \NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if we transform $v_1$ and $v_2$ |
| to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$. |
| |
| Suppose that we use simple linear regression to represent one feature given the other, say |
| represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and $\beta$ |
| to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$. |
| Then the best error equals |
| \begin{equation*} |
| \min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta v_{1,i}\big)^2 \,\,=\,\, |
| (1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2 |
| \end{equation*} |
| In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to |
| the total sum of squares. Hence, $r^2$ is an accuracy measure of the linear regression. |
| \end{Description} |
| |
| |
| \paragraph{Nominal-vs-nominal statistics.} |
| Sample statistics that describe association between two nominal categorical features. |
| Both features' value domains are encoded with positive integers in arbitrary order: |
| nominal features do not order their value domains. |
| \begin{Description} |
| %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% |
| \item[\it\NameStatChi]: |
| A measure of how much the frequencies of value pairs of two categorical features deviate from |
| statistical independence. Under independence, the probability of every value pair must equal |
| the product of probabilities of each value in the pair: |
| $\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$. But we do not know these (hypothesized) probabilities; |
| we only know the sample frequency counts. Let $n_{a,b}$ be the frequency count of pair |
| $(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of $b$~alone. Under |
| independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be exactly~0 due |
| to sample randomness, yet it is unlikely to be too far from~0. For some pairs $(a,b)$ it may |
| deviate from~0 farther than for other pairs. \NameStatChi{}~is an aggregate measure that |
| combines squares of these differences across all value pairs: |
| \begin{equation*} |
| \chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - \frac{n_a n_b}{n}\Big)^2 |
| \,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}} |
| \end{equation*} |
| where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = (n_a n_b){/}n$ are |
| the \emph{expected} frequencies for all pairs~$(a,b)$. Under independence (plus other standard |
| assumptions) the sample~$\chi^2$ closely follows a well-known distribution, making it a basis for |
| statistical tests for independence, see~\emph{\NameStatPChi} for details. Note that \NameStatChi{} |
| does \emph{not} measure the strength of dependence: even very weak dependence may result in a |
| significant deviation from independence if the counts are large enough. Use~\NameStatV{} instead |
| to measure the strength of dependence. |
| %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% |
| \item[\it Degrees of freedom]: |
| An integer parameter required for the interpretation of~\NameStatChi{} measure. Under independence |
| (plus other standard assumptions) the sample~$\chi^2$ statistic is approximately distributed as the |
| sum of $d$~squares of independent normal random variables with mean~0 and variance~1, where $d$ is |
| this integer parameter. For a pair of categorical features such that the $1^{\textrm{st}}$~feature |
| has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, the number of degrees |
| of freedom is $d = (k_1 - 1)(k_2 - 1)$. |
| %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% |
| \item[\it\NameStatPChi]: |
| A measure of how likely we would observe the current frequencies of value pairs of two categorical |
| features assuming their statistical independence. More precisely, it computes the probability that |
| the sum of $d$~squares of independent normal random variables with mean~0 and variance~1 |
| (called the $\chi^2$~distribution with $d$ degrees of freedom) generates a value at least as large |
| as the current sample \NameStatChi. The $d$ parameter is \emph{degrees of freedom}, see above. |
| Under independence (plus other standard assumptions) the sample \NameStatChi{} closely follows the |
| $\chi^2$~distribution and is unlikely to land very far into its tail. On the other hand, if the |
| two features are dependent, their sample \NameStatChi{} becomes arbitrarily large as $n\to\infty$ |
| and lands extremely far into the tail of the $\chi^2$~distribution given a large enough data sample. |
| \NameStatPChi{} returns the tail ``weight'' on the right-hand side of \NameStatChi: |
| \begin{equation*} |
| P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim \textrm{the $\chi^2$ distribution}\big] |
| \end{equation*} |
| As any probability, $P$ ranges between 0 and~1. If $P\leq 0.05$, the dependence between the two |
| features may be considered statistically significant (i.e.\ their independence is considered |
| statistically ruled out). For highly dependent features, it is not unusual to have $P\leq 10^{-20}$ |
| or less, in which case our script will simply return $P = 0$. Independent features should have |
| their $P\geq 0.05$ in about 95\% cases. |
| %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% |
| \item[\it\NameStatV]: |
| A measure for the strength of association, i.e.\ of statistical dependence, between two categorical |
| features, conceptually similar to \NameStatR. It divides the observed~\NameStatChi{} by the maximum |
| possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of categories in each feature, |
| then takes the square root. Thus, \NameStatV{} ranges from 0 to~1, |
| where 0 implies no association and 1 implies the maximum possible association (one-to-one |
| correspondence) between the two features. See \emph{\NameStatChi} for the computation of~$\chi^2$; |
| its maximum${} = {}$% |
| $n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature |
| has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories~\cite{AcockStavig1979:CramersV}, |
| so |
| \begin{equation*} |
| \textrm{\NameStatV} \,\,=\,\, \sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}} |
| \end{equation*} |
| As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' dependence increases, |
| \NameStatV{} goes towards~1 (slowly) as the dependence increases. Both \NameStatChi{} and |
| \NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is mitigated by taking the |
| ratio. |
| \end{Description} |
| |
| |
| \paragraph{Nominal-vs-scale statistics.} |
| Sample statistics that describe association between a categorical feature |
| (order ignored) and a quantitative (scale) feature. |
| The values of the categorical feature must be coded as positive integers. |
| \begin{Description} |
| %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% |
| \item[\it\NameStatEta]: |
| A measure for the strength of association (statistical dependence) between a nominal feature |
| and a scale feature, conceptually similar to \NameStatR. Ranges from 0 to~1, approaching 0 |
| when there is no association and approaching 1 when there is a strong association. |
| The nominal feature, treated as the independent variable, is assumed to have relatively few |
| possible values, all with large frequency counts. The scale feature is treated as the dependent |
| variable. Denoting the nominal feature by~$x$ and the scale feature by~$y$, we have: |
| \begin{equation*} |
| \eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}, |
| \,\,\,\,\textrm{where}\,\,\,\, |
| \hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n |
| \,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & \textrm{otherwise}\end{array}\right.\!\!\! |
| \end{equation*} |
| and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean. Value $\hat{y}[x]$ is the average |
| of~$y_i$ among all records where $x_i = x$; it can also be viewed as the ``predictor'' |
| of $y$ given~$x$. Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the residual error |
| sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum-of-squares for~$y$. |
| Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the |
| ``R-squared'' statistic measures the accuracy of linear regression. Our output $\eta$ |
| is the square root of~$\eta^2$. |
| %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% |
| \item[\it\NameStatF]: |
| A measure of how much the values of the scale feature, denoted here by~$y$, |
| deviate from statistical independence on the nominal feature, denoted by~$x$. |
| The same measure appears in the one-way analysis of vari\-ance (ANOVA). |
| Like \NameStatChi, \NameStatF{} is used to test the hypothesis that |
| $y$~is independent from~$x$, given the following assumptions: |
| \begin{Itemize} |
| \item The scale feature $y$ has approximately normal distribution whose mean |
| may depend only on~$x$ and variance is the same for all~$x$; |
| \item The nominal feature $x$ has relatively small value domain with large |
| frequency counts, the $x_i$-values are treated as fixed (non-random); |
| \item All records are sampled independently of each other. |
| \end{Itemize} |
| To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$ |
| among all records where $x_i = x$. These $\hat{y}[x]$ can be viewed as |
| ``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should |
| ``predict'' only the global mean~$\bar{y}$. Then we form two sums-of-squares: |
| \begin{Itemize} |
| \item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - \hat{y}[x_i]$; |
| \item \emph{Explained} sum-of-squares of the ``predictor'' variability: $\hat{y}[x_i] - \bar{y}$. |
| \end{Itemize} |
| \NameStatF{} is the ratio of the explained sum-of-squares to |
| the residual sum-of-squares, each divided by their corresponding degrees |
| of freedom: |
| \begin{equation*} |
| F \,\,=\,\, |
| \frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 \,\big/\,\, (k\,{-}\,1)}% |
| {\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} \,\,=\,\, |
| \frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2} |
| \end{equation*} |
| Here $k$ is the domain size of the nominal feature~$x$. The $k$ ``predictors'' |
| lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly, |
| the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''. |
| |
| The statistic can test if the independence hypothesis of $y$ from $x$ is reasonable; |
| more generally (with relaxed normality assumptions) it can test the hypothesis that |
| \emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$. |
| Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, n\,{-}\,k)$-distribution. |
| But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{} |
| becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely far |
| into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large enough data sample. |
| \end{Description} |
| |
| |
| \paragraph{Ordinal-vs-ordinal statistics.} |
| Sample statistics that describe association between two ordinal categorical features. |
| Both features' value domains are encoded with positive integers, so that the natural |
| order of the integers coincides with the order in each value domain. |
| \begin{Description} |
| %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% |
| \item[\it\NameStatRho]: |
| A measure for the strength of association (statistical dependence) between |
| two ordinal features, conceptually similar to \NameStatR. Specifically, it is \NameStatR{} |
| applied to the feature vectors in which all values are replaced by their ranks, i.e.\ |
| their positions if the vector is sorted. The ranks of identical (duplicate) values |
| are replaced with their average rank. For example, in vector |
| $(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the sorted |
| order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their average |
| rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 2,\, 5,\, 3.5,\, 1)$. |
| |
| Our implementation of \NameStatRho{} is geared towards features having small value domains |
| and large counts for the values. Given the two input vectors, we form a contingency table $T$ |
| of pairwise frequency counts, as well as a vector of frequency counts for each feature: $f_1$ |
| and~$f_2$. Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer to the |
| order-preserving integer encoding of the feature values. |
| We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks: |
| $r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously for~$r_2$. |
| Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, f_2$ and their |
| covariance weighted by~$T$, before applying the standard formula for \NameStatR: |
| \begin{equation*} |
| \rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}} |
| \,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - \bar{r}_2)}% |
| {\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}} |
| \end{equation*} |
| where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$. |
| The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the prevalent direction |
| of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to increase (decrease) |
| when the other feature increases. The correlation becomes~1 when the two features are |
| monotonically related. |
| \end{Description} |
| |
| |
| \smallskip |
| \noindent{\bf Returns} |
| \smallskip |
| |
| A collection of (potentially) 4 matrices. Each matrix contains bivariate statistics that |
| resulted from a different combination of feature types. There is one matrix for scale-scale |
| statistics (which includes \NameStatR), one for nominal-nominal statistics (includes \NameStatChi{}), |
| one for nominal-scale statistics (includes \NameStatF) and one for ordinal-ordinal statistics |
| (includes \NameStatRho). If any of these matrices is not produced, then no pair of columns required |
| the corresponding type combination. See Table~\ref{table:bivars} for the matrix naming and |
| format details. |
| |
| |
| \smallskip |
| \pagebreak[2] |
| |
| \noindent{\bf Examples} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\tt |
| \hml -f \BivarScriptName{} -nvargs |
| X=/user/biadmin/X.mtx |
| index1=/user/biadmin/S1.mtx |
| index2=/user/biadmin/S2.mtx |
| types1=/user/biadmin/K1.mtx |
| types2=/user/biadmin/K2.mtx |
| OUTDIR=/user/biadmin/stats.mtx |
| |
| } |
| |