apache / systemds / refs/heads/branch-0.11 / . / docs / Algorithms Reference / DescriptiveUnivarStats.tex

\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{Univariate Statistics} | |

\noindent{\bf Description} | |

\smallskip | |

\emph{Univariate statistics} are the simplest form of descriptive statistics in data | |

analysis. They are used to quantitatively describe the main characteristics of each | |

feature in the data. For a given dataset matrix, script \UnivarScriptName{} computes | |

certain univariate statistics for each feature column in the | |

matrix. The feature type governs the exact set of statistics computed for that feature. | |

For example, the statistic \emph{mean} can only be computed on a quantitative (scale) | |

feature like `Height' and `Temperature'. It does not make sense to compute the mean | |

of a categorical attribute like `Hair Color'. | |

\smallskip | |

\noindent{\bf Usage} | |

\smallskip | |

{\hangindent=\parindent\noindent\it%\tolerance=0 | |

{\tt{}-f } \UnivarScriptName{} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} TYPES=}path/file | |

{\tt{} STATS=}path/file | |

% {\tt{} fmt=}format | |

} | |

\medskip | |

\pagebreak[2] | |

\noindent{\bf Arguments} | |

\begin{Description} | |

\item[{\tt X}:] | |

Location (on HDFS) to read the data matrix $X$ whose columns we want to | |

analyze as the features. | |

\item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "}) | |

Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$ | |

column-cell contains the type of the $i^{\textrm{th}}$ feature column | |

\texttt{X[,$\,i$]} in the data matrix. Feature types must be encoded by | |

integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. | |

% The default value means ``treat all $X$-columns as scale.'' | |

\item[{\tt STATS}:] | |

Location (on HDFS) where the output matrix of computed statistics | |

will be stored. The format of the output matrix is defined by | |

Table~\ref{table:univars}. | |

% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; | |

% see read/write functions in SystemML Language Reference for details. | |

\end{Description} | |

\begin{table}[t]\hfil | |

\begin{tabular}{|rl|c|c|} | |

\hline | |

\multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & \multicolumn{2}{c|}{Applies to:} \\ | |

& & Scale & Categ.\\ | |

\hline | |

\OutputRowIDMinimum & Minimum & + & \\ | |

\OutputRowIDMaximum & Maximum & + & \\ | |

\OutputRowIDRange & Range & + & \\ | |

\OutputRowIDMean & Mean & + & \\ | |

\OutputRowIDVariance & Variance & + & \\ | |

\OutputRowIDStDeviation & Standard deviation & + & \\ | |

\OutputRowIDStErrorMean & Standard error of mean & + & \\ | |

\OutputRowIDCoeffVar & Coefficient of variation & + & \\ | |

\OutputRowIDSkewness & Skewness & + & \\ | |

\OutputRowIDKurtosis & Kurtosis & + & \\ | |

\OutputRowIDStErrorSkewness & Standard error of skewness & + & \\ | |

\OutputRowIDStErrorCurtosis & Standard error of kurtosis & + & \\ | |

\OutputRowIDMedian & Median & + & \\ | |

\OutputRowIDIQMean & Inter quartile mean & + & \\ | |

\OutputRowIDNumCategories & Number of categories & & + \\ | |

\OutputRowIDMode & Mode & & + \\ | |

\OutputRowIDNumModes & Number of modes & & + \\ | |

\hline | |

\end{tabular}\hfil | |

\caption{The output matrix of \UnivarScriptName{} has one row per each | |

univariate statistic and one column per input feature. This table lists | |

the meaning of each row. Signs ``+'' show applicability to scale or/and | |

to categorical features.} | |

\label{table:univars} | |

\end{table} | |

\pagebreak[1] | |

\smallskip | |

\noindent{\bf Details} | |

\smallskip | |

Given an input matrix \texttt{X}, this script computes the set of all | |

relevant univariate statistics for each feature column \texttt{X[,$\,i$]} | |

in~\texttt{X}. The list of statistics to be computed depends on the | |

\emph{type}, or \emph{measurement level}, of each column. | |

The \textrm{TYPES} command-line argument points to a vector containing | |

the types of all columns. The types must be provided as per the following | |

convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. | |

Below we list all univariate statistics computed by script \UnivarScriptName. | |

The statistics are collected by relevance into several groups, namely: central | |

tendency, dispersion, shape, and categorical measures. The first three groups | |

contain statistics computed for a quantitative (also known as: numerical, scale, | |

or continuous) feature; the last group contains the statistics for a categorical | |

(either nominal or ordinal) feature. | |

Let~$n$ be the number of data records (rows) with feature values. | |

In what follows we fix a column index \texttt{idx} and consider | |

sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}. | |

Let $v = (v_1, v_2, \ldots, v_n)$ be the values of \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]} | |

in their original unsorted order: $v_i = \texttt{X[}i\texttt{,}\,\texttt{idx]}$. | |

Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted order, | |

preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. | |

\paragraph{Central tendency measures.} | |

Sample statistics that describe the location of the quantitative (scale) feature distribution, | |

represent it with a single value. | |

\begin{Description} | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Mean] | |

\OutputRowText{\OutputRowIDMean} | |

The arithmetic average over a sample of a quantitative feature. | |

Computed as the ratio between the sum of values and the number of values: | |

$\left(\sum_{i=1}^n v_i\right)\!/n$. | |

Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

equals~5.2. | |

Note that the mean is significantly affected by extreme values in the sample | |

and may be misleading as a central tendency measure if the feature varies on | |

exponential scale. For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 100.0$\}$ | |

is 22.222, greater than all the sample values except the~largest. | |

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% | |

\begin{figure}[t] | |

\setlength{\unitlength}{10pt} | |

\begin{picture}(33,12) | |

\put( 6.2, 0.0){\small 2.2} | |

\put(10.2, 0.0){\small 3.2} | |

\put(12.2, 0.0){\small 3.7} | |

\put(15.0, 0.0){\small 4.4} | |

\put(18.6, 0.0){\small 5.3} | |

\put(20.2, 0.0){\small 5.7} | |

\put(21.75,0.0){\small 6.1} | |

\put(23.05,0.0){\small 6.4} | |

\put(26.2, 0.0){\small 7.2} | |

\put(28.6, 0.0){\small 7.8} | |

\put( 0.5, 0.7){\small 0.0} | |

\put( 0.1, 3.2){\small 0.25} | |

\put( 0.5, 5.7){\small 0.5} | |

\put( 0.1, 8.2){\small 0.75} | |

\put( 0.5,10.7){\small 1.0} | |

\linethickness{1.5pt} | |

\put( 2.0, 1.0){\line(1,0){4.8}} | |

\put( 6.8, 1.0){\line(0,1){1.0}} | |

\put( 6.8, 2.0){\line(1,0){4.0}} | |

\put(10.8, 2.0){\line(0,1){1.0}} | |

\put(10.8, 3.0){\line(1,0){2.0}} | |

\put(12.8, 3.0){\line(0,1){1.0}} | |

\put(12.8, 4.0){\line(1,0){2.8}} | |

\put(15.6, 4.0){\line(0,1){1.0}} | |

\put(15.6, 5.0){\line(1,0){3.6}} | |

\put(19.2, 5.0){\line(0,1){1.0}} | |

\put(19.2, 6.0){\line(1,0){1.6}} | |

\put(20.8, 6.0){\line(0,1){1.0}} | |

\put(20.8, 7.0){\line(1,0){1.6}} | |

\put(22.4, 7.0){\line(0,1){1.0}} | |

\put(22.4, 8.0){\line(1,0){1.2}} | |

\put(23.6, 8.0){\line(0,1){1.0}} | |

\put(23.6, 9.0){\line(1,0){3.2}} | |

\put(26.8, 9.0){\line(0,1){1.0}} | |

\put(26.8,10.0){\line(1,0){2.4}} | |

\put(29.2,10.0){\line(0,1){1.0}} | |

\put(29.2,11.0){\line(1,0){4.8}} | |

\linethickness{0.3pt} | |

\put( 6.8, 1.0){\circle*{0.3}} | |

\put(10.8, 1.0){\circle*{0.3}} | |

\put(12.8, 1.0){\circle*{0.3}} | |

\put(15.6, 1.0){\circle*{0.3}} | |

\put(19.2, 1.0){\circle*{0.3}} | |

\put(20.8, 1.0){\circle*{0.3}} | |

\put(22.4, 1.0){\circle*{0.3}} | |

\put(23.6, 1.0){\circle*{0.3}} | |

\put(26.8, 1.0){\circle*{0.3}} | |

\put(29.2, 1.0){\circle*{0.3}} | |

\put( 6.8, 1.0){\vector(1,0){27.2}} | |

\put( 2.0, 1.0){\vector(0,1){10.8}} | |

\put( 2.0, 3.5){\line(1,0){10.8}} | |

\put( 2.0, 6.0){\line(1,0){17.2}} | |

\put( 2.0, 8.5){\line(1,0){21.6}} | |

\put( 2.0,11.0){\line(1,0){27.2}} | |

\put(12.8, 1.0){\line(0,1){2.0}} | |

\put(19.2, 1.0){\line(0,1){5.0}} | |

\put(20.0, 1.0){\line(0,1){5.0}} | |

\put(23.6, 1.0){\line(0,1){7.0}} | |

\put( 9.0, 4.0){\line(1,0){3.8}} | |

\put( 9.2, 2.7){\vector(0,1){0.8}} | |

\put( 9.2, 4.8){\vector(0,-1){0.8}} | |

\put(19.4, 8.0){\line(1,0){3.0}} | |

\put(19.6, 7.2){\vector(0,1){0.8}} | |

\put(19.6, 9.3){\vector(0,-1){0.8}} | |

\put(13.0, 2.2){\small $q_{25\%}$} | |

\put(17.3, 2.2){\small $q_{50\%}$} | |

\put(23.8, 2.2){\small $q_{75\%}$} | |

\put(20.15,3.5){\small $\mu$} | |

\put( 8.0, 3.75){\small $\phi_1$} | |

\put(18.35,7.8){\small $\phi_2$} | |

\end{picture} | |

\label{fig:example_quartiles} | |

\caption{The computation of quartiles, median, and interquartile mean from the | |

empirical distribution function of the 10-point | |

sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$. Each vertical step in | |

the graph has height~$1{/}n = 0.1$. Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote | |

the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly; | |

value~$\mu$ denotes the median. Values $\phi_1$ and $\phi_2$ show the partial contribution | |

of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.} | |

\end{figure} | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Median] | |

\OutputRowText{\OutputRowIDMedian} | |

The ``middle'' value that separates the higher half of the sample values | |

(in a sorted order) from the lower half. | |

To compute the median, we sort the sample in the increasing order, preserving | |

duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. | |

If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$, | |

same as the $50^{\textrm{th}}$~percentile of the sample. | |

If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$, | |

so we compute the median as the mean of these two values. | |

(For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$, | |

not as the median.) Example: the median of sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see Figure~\ref{fig:example_quartiles}. | |

Unlike the mean, the median is not sensitive to extreme values in the sample, | |

i.e.\ it is robust to outliers. It works better as a measure of central tendency | |

for heavy-tailed distributions and features that vary on exponential scale. | |

However, the median is sensitive to small sample size. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Interquartile mean] | |

\OutputRowText{\OutputRowIDIQMean} | |

For a sample of a quantitative feature, this is | |

the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile | |

and less than or equal the $3^{\textrm{rd}}$ quartile. | |

In other words, it is a ``truncated mean'' where the lowest 25$\%$ and | |

the highest 25$\%$ of the sorted values are omitted in its computation. | |

The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$ | |

quartiles themselves, contribute to this mean only partially. | |

This measure is occasionally used as the ``robust'' version of the mean | |

that is less sensitive to the extreme values. | |

To compute the measure, we sort the sample in the increasing order, | |

preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. | |

We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index | |

and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index, | |

then compute the following weighted mean: | |

\begin{equation*} | |

\frac{1}{3{/}4 - 1{/}4} \left[ | |

\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+ | |

\sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i | |

\,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right] | |

\end{equation*} | |

In other words, all sample values between the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$ | |

quartile enter the sum with weights $2{/}n$, times their number of duplicates, while the | |

two quartiles themselves enter the sum with reduced weights. The weights are proportional | |

to the vertical steps in the empirical distribution function of the sample, see | |

Figure~\ref{fig:example_quartiles} for an illustration. | |

Example: the interquartile mean of sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum | |

$0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$, | |

which equals~5.31. | |

\end{Description} | |

\paragraph{Dispersion measures.} | |

Statistics that describe the amount of variation or spread in a quantitative | |

(scale) data feature. | |

\begin{Description} | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Variance] | |

\OutputRowText{\OutputRowIDVariance} | |

A measure of dispersion, or spread-out, of sample values around their mean, | |

expressed in units that are the square of those of the feature itself. | |

Computed as the sum of squared differences between the values | |

in the sample and their mean, divided by one less than the number of | |

values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where | |

$\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. | |

Example: the variance of sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24. | |

Note that at least two values ($n\geq 2$) are required to avoid division | |

by zero. Sample variance is sensitive to outliers, even more than the mean. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Standard deviation] | |

\OutputRowText{\OutputRowIDStDeviation} | |

A measure of dispersion around the mean, the square root of variance. | |

Computed by taking the square root of the sample variance; | |

see \emph{Variance} above on computing the variance. | |

Example: the standard deviation of sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8. | |

At least two values are required to avoid division by zero. | |

Note that standard deviation is sensitive to outliers. | |

Standard deviation is used in conjunction with the mean to determine | |

an interval containing a given percentage of the feature values, | |

assuming the normal distribution. In a large sample from a normal | |

distribution, around 68\% of the cases fall within one standard | |

deviation and around 95\% of cases fall within two standard deviations | |

of the mean. For example, if the mean age is 45 with a standard deviation | |

of 10, around 95\% of the cases would be between 25 and 65 in a normal | |

distribution. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Coefficient of variation] | |

\OutputRowText{\OutputRowIDCoeffVar} | |

The ratio of the standard deviation to the mean, i.e.\ the | |

\emph{relative} standard deviation, of a quantitative feature sample. | |

Computed by dividing the sample \emph{standard deviation} by the | |

sample \emph{mean}, see above for their computation details. | |

Example: the coefficient of variation for sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

equals 1.8$\,{/}\,$5.2~${\approx}$~0.346. | |

This metric is used primarily with non-negative features such as | |

financial or population data. It is sensitive to outliers. | |

Note: zero mean causes division by zero, returning infinity or \texttt{NaN}. | |

At least two values (records) are required to compute the standard deviation. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Minimum] | |

\OutputRowText{\OutputRowIDMinimum} | |

The smallest value of a quantitative sample, computed as $\min v = v^s_1$. | |

Example: the minimum of sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

equals~2.2. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Maximum] | |

\OutputRowText{\OutputRowIDMaximum} | |

The largest value of a quantitative sample, computed as $\max v = v^s_n$. | |

Example: the maximum of sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

equals~7.8. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Range] | |

\OutputRowText{\OutputRowIDRange} | |

The difference between the largest and the smallest value of a quantitative | |

sample, computed as $\max v - \min v = v^s_n - v^s_1$. | |

It provides information about the overall spread of the sample values. | |

Example: the range of sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

equals 7.8$\,{-}\,$2.2~${=}$~5.6. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Standard error of the mean] | |

\OutputRowText{\OutputRowIDStErrorMean} | |

A measure of how much the value of the sample mean may vary from sample | |

to sample taken from the same (hypothesized) distribution of the feature. | |

It helps to roughly bound the distribution mean, i.e.\ | |

the limit of the sample mean as the sample size tends to infinity. | |

Under certain assumptions (e.g.\ normality and large sample), the difference | |

between the distribution mean and the sample mean is unlikely to exceed | |

2~standard errors. | |

The measure is computed by dividing the sample standard deviation | |

by the square root of the number of values~$n$; see \emph{standard deviation} | |

for its computation details. Ensure $n\,{\geq}\,2$ to avoid division by~0. | |

Example: for sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

with the mean of~5.2 the standard error of the mean | |

equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569. | |

Note that the standard error itself is subject to sample randomness. | |

Its accuracy as an error estimator may be low if the sample size is small | |

or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has | |

heavy tails. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

% \item[\it Quartiles] | |

% \OutputRowText{\OutputRowIDQuartiles} | |

% %%% dsDefn %%%% | |

% The values of a quantitative feature | |

% that divide an ordered/sorted set of data records into four equal-size groups. | |

% The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits | |

% the sorted data into the lowest $25\%$ and the highest~$75\%$. In other words, | |

% it is the middle value between the minimum and the median. The $2^{\textrm{nd}}$ | |

% quartile is the median itself, the value that separates the higher half of | |

% the data (in the sorted order) from the lower half. Finally, the $3^{\textrm{rd}}$ | |

% quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into | |

% lowest $75\%$ and highest~$25\%$.\par | |

% %%% dsComp %%%% | |

% To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical values | |

% we sort it in the increasing order, preserving duplicates, then return | |

% \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]} | |

% where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$. | |

% When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further adjusted | |

% to equal the mean of two middle values | |

% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and | |

% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$. | |

% %%% dsWarn %%%% | |

% We assume that the feature column does not contain \texttt{NaN}s or coded non-numeric values. | |

% %%% dsExmpl %%% | |

% \textbf{Example(s).} | |

\end{Description} | |

\paragraph{Shape measures.} | |

Statistics that describe the shape and symmetry of the quantitative (scale) | |

feature distribution estimated from a sample of its values. | |

\begin{Description} | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Skewness] | |

\OutputRowText{\OutputRowIDSkewness} | |

It measures how symmetrically the values of a feature are spread out | |

around the mean. A significant positive skewness implies a longer (or fatter) | |

right tail, i.e. feature values tend to lie farther away from the mean on the | |

right side. A significant negative skewness implies a longer (or fatter) left | |

tail. The normal distribution is symmetric and has a skewness value of~0; | |

however, its sample skewness is likely to be nonzero, just close to zero. | |

As a guideline, a skewness value more than twice its standard error is taken | |

to indicate a departure from symmetry. | |

Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the cube | |

of the standard deviation. We estimate the $3^{\textrm{rd}}$~central moment as | |

the sum of cubed differences between the values in the feature column and their | |

sample mean, divided by the number of values: | |

$\sum_{i=1}^n (v_i - \bar{v})^3 / n$ | |

where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. | |

The standard deviation is computed | |

as described above in \emph{standard deviation}. To avoid division by~0, | |

at least two different sample values are required. Example: for sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

with the mean of~5.2 and the standard deviation of~1.8 | |

skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$. | |

Note: skewness is sensitive to outliers. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Standard error in skewness] | |

\OutputRowText{\OutputRowIDStErrorSkewness} | |

A measure of how much the sample skewness may vary from sample to sample, | |

assuming that the feature is normally distributed, which makes its | |

distribution skewness equal~0. | |

Given the number~$n$ of sample values, the standard error is computed as | |

\begin{equation*} | |

\sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}} | |

\end{equation*} | |

This measure can tell us, for example: | |

\begin{Itemize} | |

\item If the sample skewness lands within two standard errors from~0, its | |

positive or negative sign is non-significant, may just be accidental. | |

\item If the sample skewness lands outside this interval, the feature | |

is unlikely to be normally distributed. | |

\end{Itemize} | |

At least 3~values ($n\geq 3$) are required to avoid arithmetic failure. | |

Note that the standard error is inaccurate if the feature distribution is | |

far from normal or if the number of samples is small. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Kurtosis] | |

\OutputRowText{\OutputRowIDKurtosis} | |

As a distribution parameter, kurtosis is a measure of the extent to which | |

feature values cluster around a central point. In other words, it quantifies | |

``peakedness'' of the distribution: how tall and sharp the central peak is | |

relative to a standard bell curve. | |

Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative | |

to a normal distribution: | |

\begin{Itemize} | |

\item observations cluster more about the center (peak-shaped), | |

\item the tails are thinner at non-extreme values, | |

\item the tails are thicker at extreme values. | |

\end{Itemize} | |

Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative | |

to a normal distribution: | |

\begin{Itemize} | |

\item observations cluster less about the center (box-shaped), | |

\item the tails are thicker at non-extreme values, | |

\item the tails are thinner at extreme values. | |

\end{Itemize} | |

Kurtosis of a normal distribution is zero; however, the sample kurtosis | |

(computed here) is likely to deviate from zero. | |

Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided | |

by the $4^{\textrm{th}}$~power of the standard deviation, minus~3. | |

We estimate the $4^{\textrm{th}}$~central moment as the sum of the | |

$4^{\textrm{th}}$~powers of differences between the values in the feature column | |

and their sample mean, divided by the number of values: | |

$\sum_{i=1}^n (v_i - \bar{v})^4 / n$ | |

where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. | |

The standard deviation is computed as described above, see \emph{standard deviation}. | |

Note that kurtosis is sensitive to outliers, and requires at least two different | |

sample values. Example: for sample | |

$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ | |

with the mean of~5.2 and the standard deviation of~1.8, | |

sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Standard error in kurtosis] | |

\OutputRowText{\OutputRowIDStErrorCurtosis} | |

A measure of how much the sample kurtosis may vary from sample to sample, | |

assuming that the feature is normally distributed, which makes its | |

distribution kurtosis equal~0. | |

Given the number~$n$ of sample values, the standard error is computed as | |

\begin{equation*} | |

\sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}} | |

\end{equation*} | |

This measure can tell us, for example: | |

\begin{Itemize} | |

\item If the sample kurtosis lands within two standard errors from~0, its | |

positive or negative sign is non-significant, may just be accidental. | |

\item If the sample kurtosis lands outside this interval, the feature | |

is unlikely to be normally distributed. | |

\end{Itemize} | |

At least 4~values ($n\geq 4$) are required to avoid arithmetic failure. | |

Note that the standard error is inaccurate if the feature distribution is | |

far from normal or if the number of samples is small. | |

\end{Description} | |

\paragraph{Categorical measures.} Statistics that describe the sample of | |

a categorical feature, either nominal or ordinal. We represent all | |

categories by integers from~1 to the number of categories; we call | |

these integers \emph{category~IDs}. | |

\begin{Description} | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Number of categories] | |

\OutputRowText{\OutputRowIDNumCategories} | |

The maximum category~ID that occurs in the sample. Note that some | |

categories with~IDs \emph{smaller} than this maximum~ID may have | |

no~occurrences in the sample, without reducing the number of categories. | |

However, any categories with~IDs \emph{larger} than the maximum~ID with | |

no occurrences in the sample will not be counted. | |

Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ | |

the number of categories is reported as~8. Category~IDs 2 and~6, which have | |

zero occurrences, are still counted; but if there is a category with | |

ID${}=9$ and zero occurrences, it is not counted. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Mode] | |

\OutputRowText{\OutputRowIDMode} | |

The most frequently occurring category value. | |

If several values share the greatest frequency of occurrence, then each | |

of them is a mode; but here we report only the smallest of these modes. | |

Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ | |

the modes are 3 and~7, with 3 reported. | |

Computed by counting the number of occurrences for each category, | |

then taking the smallest category~ID that has the maximum count. | |

Note that the sample modes may be different from the distribution modes, | |

i.e.\ the categories whose (hypothesized) underlying probability is the | |

maximum over all categories. | |

%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% | |

\item[\it Number of modes] | |

\OutputRowText{\OutputRowIDNumModes} | |

The number of category values that each have the largest frequency | |

count in the sample. | |

Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ | |

there are two category IDs (3 and~7) that occur the maximum count of 4~times; | |

hence, we return~2. | |

Computed by counting the number of occurrences for each category, | |

then counting how many categories have the maximum count. | |

Note that the sample modes may be different from the distribution modes, | |

i.e.\ the categories whose (hypothesized) underlying probability is the | |

maximum over all categories. | |

\end{Description} | |

\smallskip | |

\noindent{\bf Returns} | |

\smallskip | |

The output matrix containing all computed statistics is of size $17$~rows and | |

as many columns as in the input matrix~\texttt{X}. Each row corresponds to | |

a particular statistic, according to the convention specified in | |

Table~\ref{table:univars}. The first $14$~statistics are applicable for | |

\emph{scale} columns, and the last $3$~statistics are applicable for categorical, | |

i.e.\ nominal and ordinal, columns. | |

\pagebreak[2] | |

\smallskip | |

\noindent{\bf Examples} | |

\smallskip | |

{\hangindent=\parindent\noindent\tt | |

\hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx | |

TYPES=/user/biadmin/types.mtx | |

STATS=/user/biadmin/stats.mtx | |

} |