blob: 5838e3e4893b07be1d8e326af5dcd478ae56d112 [file] [log] [blame]
\begin{comment}
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\end{comment}
\subsection{Univariate Statistics}
\noindent{\bf Description}
\smallskip
\emph{Univariate statistics} are the simplest form of descriptive statistics in data
analysis. They are used to quantitatively describe the main characteristics of each
feature in the data. For a given dataset matrix, script \UnivarScriptName{} computes
certain univariate statistics for each feature column in the
matrix. The feature type governs the exact set of statistics computed for that feature.
For example, the statistic \emph{mean} can only be computed on a quantitative (scale)
feature like `Height' and `Temperature'. It does not make sense to compute the mean
of a categorical attribute like `Hair Color'.
\smallskip
\noindent{\bf Usage}
\smallskip
{\hangindent=\parindent\noindent\it%\tolerance=0
{\tt{}-f } \UnivarScriptName{}
{\tt{} -nvargs}
{\tt{} X=}path/file
{\tt{} TYPES=}path/file
{\tt{} STATS=}path/file
% {\tt{} fmt=}format
}
\medskip
\pagebreak[2]
\noindent{\bf Arguments}
\begin{Description}
\item[{\tt X}:]
Location (on HDFS) to read the data matrix $X$ whose columns we want to
analyze as the features.
\item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "})
Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$
column-cell contains the type of the $i^{\textrm{th}}$ feature column
\texttt{X[,$\,i$]} in the data matrix. Feature types must be encoded by
integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
% The default value means ``treat all $X$-columns as scale.''
\item[{\tt STATS}:]
Location (on HDFS) where the output matrix of computed statistics
will be stored. The format of the output matrix is defined by
Table~\ref{table:univars}.
% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
% see read/write functions in SystemML Language Reference for details.
\end{Description}
\begin{table}[t]\hfil
\begin{tabular}{|rl|c|c|}
\hline
\multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & \multicolumn{2}{c|}{Applies to:} \\
& & Scale & Categ.\\
\hline
\OutputRowIDMinimum & Minimum & + & \\
\OutputRowIDMaximum & Maximum & + & \\
\OutputRowIDRange & Range & + & \\
\OutputRowIDMean & Mean & + & \\
\OutputRowIDVariance & Variance & + & \\
\OutputRowIDStDeviation & Standard deviation & + & \\
\OutputRowIDStErrorMean & Standard error of mean & + & \\
\OutputRowIDCoeffVar & Coefficient of variation & + & \\
\OutputRowIDSkewness & Skewness & + & \\
\OutputRowIDKurtosis & Kurtosis & + & \\
\OutputRowIDStErrorSkewness & Standard error of skewness & + & \\
\OutputRowIDStErrorCurtosis & Standard error of kurtosis & + & \\
\OutputRowIDMedian & Median & + & \\
\OutputRowIDIQMean & Inter quartile mean & + & \\
\OutputRowIDNumCategories & Number of categories & & + \\
\OutputRowIDMode & Mode & & + \\
\OutputRowIDNumModes & Number of modes & & + \\
\hline
\end{tabular}\hfil
\caption{The output matrix of \UnivarScriptName{} has one row per each
univariate statistic and one column per input feature. This table lists
the meaning of each row. Signs ``+'' show applicability to scale or/and
to categorical features.}
\label{table:univars}
\end{table}
\pagebreak[1]
\smallskip
\noindent{\bf Details}
\smallskip
Given an input matrix \texttt{X}, this script computes the set of all
relevant univariate statistics for each feature column \texttt{X[,$\,i$]}
in~\texttt{X}. The list of statistics to be computed depends on the
\emph{type}, or \emph{measurement level}, of each column.
The \textrm{TYPES} command-line argument points to a vector containing
the types of all columns. The types must be provided as per the following
convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
Below we list all univariate statistics computed by script \UnivarScriptName.
The statistics are collected by relevance into several groups, namely: central
tendency, dispersion, shape, and categorical measures. The first three groups
contain statistics computed for a quantitative (also known as: numerical, scale,
or continuous) feature; the last group contains the statistics for a categorical
(either nominal or ordinal) feature.
Let~$n$ be the number of data records (rows) with feature values.
In what follows we fix a column index \texttt{idx} and consider
sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}.
Let $v = (v_1, v_2, \ldots, v_n)$ be the values of \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}
in their original unsorted order: $v_i = \texttt{X[}i\texttt{,}\,\texttt{idx]}$.
Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted order,
preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
\paragraph{Central tendency measures.}
Sample statistics that describe the location of the quantitative (scale) feature distribution,
represent it with a single value.
\begin{Description}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Mean]
\OutputRowText{\OutputRowIDMean}
The arithmetic average over a sample of a quantitative feature.
Computed as the ratio between the sum of values and the number of values:
$\left(\sum_{i=1}^n v_i\right)\!/n$.
Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
equals~5.2.
Note that the mean is significantly affected by extreme values in the sample
and may be misleading as a central tendency measure if the feature varies on
exponential scale. For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 100.0$\}$
is 22.222, greater than all the sample values except the~largest.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\setlength{\unitlength}{10pt}
\begin{picture}(33,12)
\put( 6.2, 0.0){\small 2.2}
\put(10.2, 0.0){\small 3.2}
\put(12.2, 0.0){\small 3.7}
\put(15.0, 0.0){\small 4.4}
\put(18.6, 0.0){\small 5.3}
\put(20.2, 0.0){\small 5.7}
\put(21.75,0.0){\small 6.1}
\put(23.05,0.0){\small 6.4}
\put(26.2, 0.0){\small 7.2}
\put(28.6, 0.0){\small 7.8}
\put( 0.5, 0.7){\small 0.0}
\put( 0.1, 3.2){\small 0.25}
\put( 0.5, 5.7){\small 0.5}
\put( 0.1, 8.2){\small 0.75}
\put( 0.5,10.7){\small 1.0}
\linethickness{1.5pt}
\put( 2.0, 1.0){\line(1,0){4.8}}
\put( 6.8, 1.0){\line(0,1){1.0}}
\put( 6.8, 2.0){\line(1,0){4.0}}
\put(10.8, 2.0){\line(0,1){1.0}}
\put(10.8, 3.0){\line(1,0){2.0}}
\put(12.8, 3.0){\line(0,1){1.0}}
\put(12.8, 4.0){\line(1,0){2.8}}
\put(15.6, 4.0){\line(0,1){1.0}}
\put(15.6, 5.0){\line(1,0){3.6}}
\put(19.2, 5.0){\line(0,1){1.0}}
\put(19.2, 6.0){\line(1,0){1.6}}
\put(20.8, 6.0){\line(0,1){1.0}}
\put(20.8, 7.0){\line(1,0){1.6}}
\put(22.4, 7.0){\line(0,1){1.0}}
\put(22.4, 8.0){\line(1,0){1.2}}
\put(23.6, 8.0){\line(0,1){1.0}}
\put(23.6, 9.0){\line(1,0){3.2}}
\put(26.8, 9.0){\line(0,1){1.0}}
\put(26.8,10.0){\line(1,0){2.4}}
\put(29.2,10.0){\line(0,1){1.0}}
\put(29.2,11.0){\line(1,0){4.8}}
\linethickness{0.3pt}
\put( 6.8, 1.0){\circle*{0.3}}
\put(10.8, 1.0){\circle*{0.3}}
\put(12.8, 1.0){\circle*{0.3}}
\put(15.6, 1.0){\circle*{0.3}}
\put(19.2, 1.0){\circle*{0.3}}
\put(20.8, 1.0){\circle*{0.3}}
\put(22.4, 1.0){\circle*{0.3}}
\put(23.6, 1.0){\circle*{0.3}}
\put(26.8, 1.0){\circle*{0.3}}
\put(29.2, 1.0){\circle*{0.3}}
\put( 6.8, 1.0){\vector(1,0){27.2}}
\put( 2.0, 1.0){\vector(0,1){10.8}}
\put( 2.0, 3.5){\line(1,0){10.8}}
\put( 2.0, 6.0){\line(1,0){17.2}}
\put( 2.0, 8.5){\line(1,0){21.6}}
\put( 2.0,11.0){\line(1,0){27.2}}
\put(12.8, 1.0){\line(0,1){2.0}}
\put(19.2, 1.0){\line(0,1){5.0}}
\put(20.0, 1.0){\line(0,1){5.0}}
\put(23.6, 1.0){\line(0,1){7.0}}
\put( 9.0, 4.0){\line(1,0){3.8}}
\put( 9.2, 2.7){\vector(0,1){0.8}}
\put( 9.2, 4.8){\vector(0,-1){0.8}}
\put(19.4, 8.0){\line(1,0){3.0}}
\put(19.6, 7.2){\vector(0,1){0.8}}
\put(19.6, 9.3){\vector(0,-1){0.8}}
\put(13.0, 2.2){\small $q_{25\%}$}
\put(17.3, 2.2){\small $q_{50\%}$}
\put(23.8, 2.2){\small $q_{75\%}$}
\put(20.15,3.5){\small $\mu$}
\put( 8.0, 3.75){\small $\phi_1$}
\put(18.35,7.8){\small $\phi_2$}
\end{picture}
\label{fig:example_quartiles}
\caption{The computation of quartiles, median, and interquartile mean from the
empirical distribution function of the 10-point
sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$. Each vertical step in
the graph has height~$1{/}n = 0.1$. Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote
the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly;
value~$\mu$ denotes the median. Values $\phi_1$ and $\phi_2$ show the partial contribution
of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.}
\end{figure}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Median]
\OutputRowText{\OutputRowIDMedian}
The ``middle'' value that separates the higher half of the sample values
(in a sorted order) from the lower half.
To compute the median, we sort the sample in the increasing order, preserving
duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$,
same as the $50^{\textrm{th}}$~percentile of the sample.
If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$,
so we compute the median as the mean of these two values.
(For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$,
not as the median.) Example: the median of sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see Figure~\ref{fig:example_quartiles}.
Unlike the mean, the median is not sensitive to extreme values in the sample,
i.e.\ it is robust to outliers. It works better as a measure of central tendency
for heavy-tailed distributions and features that vary on exponential scale.
However, the median is sensitive to small sample size.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Interquartile mean]
\OutputRowText{\OutputRowIDIQMean}
For a sample of a quantitative feature, this is
the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile
and less than or equal the $3^{\textrm{rd}}$ quartile.
In other words, it is a ``truncated mean'' where the lowest 25$\%$ and
the highest 25$\%$ of the sorted values are omitted in its computation.
The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$
quartiles themselves, contribute to this mean only partially.
This measure is occasionally used as the ``robust'' version of the mean
that is less sensitive to the extreme values.
To compute the measure, we sort the sample in the increasing order,
preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index
and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index,
then compute the following weighted mean:
\begin{equation*}
\frac{1}{3{/}4 - 1{/}4} \left[
\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+
\sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i
\,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right]
\end{equation*}
In other words, all sample values between the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$
quartile enter the sum with weights $2{/}n$, times their number of duplicates, while the
two quartiles themselves enter the sum with reduced weights. The weights are proportional
to the vertical steps in the empirical distribution function of the sample, see
Figure~\ref{fig:example_quartiles} for an illustration.
Example: the interquartile mean of sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum
$0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$,
which equals~5.31.
\end{Description}
\paragraph{Dispersion measures.}
Statistics that describe the amount of variation or spread in a quantitative
(scale) data feature.
\begin{Description}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Variance]
\OutputRowText{\OutputRowIDVariance}
A measure of dispersion, or spread-out, of sample values around their mean,
expressed in units that are the square of those of the feature itself.
Computed as the sum of squared differences between the values
in the sample and their mean, divided by one less than the number of
values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where
$\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
Example: the variance of sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24.
Note that at least two values ($n\geq 2$) are required to avoid division
by zero. Sample variance is sensitive to outliers, even more than the mean.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Standard deviation]
\OutputRowText{\OutputRowIDStDeviation}
A measure of dispersion around the mean, the square root of variance.
Computed by taking the square root of the sample variance;
see \emph{Variance} above on computing the variance.
Example: the standard deviation of sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8.
At least two values are required to avoid division by zero.
Note that standard deviation is sensitive to outliers.
Standard deviation is used in conjunction with the mean to determine
an interval containing a given percentage of the feature values,
assuming the normal distribution. In a large sample from a normal
distribution, around 68\% of the cases fall within one standard
deviation and around 95\% of cases fall within two standard deviations
of the mean. For example, if the mean age is 45 with a standard deviation
of 10, around 95\% of the cases would be between 25 and 65 in a normal
distribution.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Coefficient of variation]
\OutputRowText{\OutputRowIDCoeffVar}
The ratio of the standard deviation to the mean, i.e.\ the
\emph{relative} standard deviation, of a quantitative feature sample.
Computed by dividing the sample \emph{standard deviation} by the
sample \emph{mean}, see above for their computation details.
Example: the coefficient of variation for sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
equals 1.8$\,{/}\,$5.2~${\approx}$~0.346.
This metric is used primarily with non-negative features such as
financial or population data. It is sensitive to outliers.
Note: zero mean causes division by zero, returning infinity or \texttt{NaN}.
At least two values (records) are required to compute the standard deviation.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Minimum]
\OutputRowText{\OutputRowIDMinimum}
The smallest value of a quantitative sample, computed as $\min v = v^s_1$.
Example: the minimum of sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
equals~2.2.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Maximum]
\OutputRowText{\OutputRowIDMaximum}
The largest value of a quantitative sample, computed as $\max v = v^s_n$.
Example: the maximum of sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
equals~7.8.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Range]
\OutputRowText{\OutputRowIDRange}
The difference between the largest and the smallest value of a quantitative
sample, computed as $\max v - \min v = v^s_n - v^s_1$.
It provides information about the overall spread of the sample values.
Example: the range of sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
equals 7.8$\,{-}\,$2.2~${=}$~5.6.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Standard error of the mean]
\OutputRowText{\OutputRowIDStErrorMean}
A measure of how much the value of the sample mean may vary from sample
to sample taken from the same (hypothesized) distribution of the feature.
It helps to roughly bound the distribution mean, i.e.\
the limit of the sample mean as the sample size tends to infinity.
Under certain assumptions (e.g.\ normality and large sample), the difference
between the distribution mean and the sample mean is unlikely to exceed
2~standard errors.
The measure is computed by dividing the sample standard deviation
by the square root of the number of values~$n$; see \emph{standard deviation}
for its computation details. Ensure $n\,{\geq}\,2$ to avoid division by~0.
Example: for sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
with the mean of~5.2 the standard error of the mean
equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569.
Note that the standard error itself is subject to sample randomness.
Its accuracy as an error estimator may be low if the sample size is small
or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has
heavy tails.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
% \item[\it Quartiles]
% \OutputRowText{\OutputRowIDQuartiles}
% %%% dsDefn %%%%
% The values of a quantitative feature
% that divide an ordered/sorted set of data records into four equal-size groups.
% The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits
% the sorted data into the lowest $25\%$ and the highest~$75\%$. In other words,
% it is the middle value between the minimum and the median. The $2^{\textrm{nd}}$
% quartile is the median itself, the value that separates the higher half of
% the data (in the sorted order) from the lower half. Finally, the $3^{\textrm{rd}}$
% quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into
% lowest $75\%$ and highest~$25\%$.\par
% %%% dsComp %%%%
% To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical values
% we sort it in the increasing order, preserving duplicates, then return
% \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]}
% where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$.
% When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further adjusted
% to equal the mean of two middle values
% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and
% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$.
% %%% dsWarn %%%%
% We assume that the feature column does not contain \texttt{NaN}s or coded non-numeric values.
% %%% dsExmpl %%%
% \textbf{Example(s).}
\end{Description}
\paragraph{Shape measures.}
Statistics that describe the shape and symmetry of the quantitative (scale)
feature distribution estimated from a sample of its values.
\begin{Description}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Skewness]
\OutputRowText{\OutputRowIDSkewness}
It measures how symmetrically the values of a feature are spread out
around the mean. A significant positive skewness implies a longer (or fatter)
right tail, i.e. feature values tend to lie farther away from the mean on the
right side. A significant negative skewness implies a longer (or fatter) left
tail. The normal distribution is symmetric and has a skewness value of~0;
however, its sample skewness is likely to be nonzero, just close to zero.
As a guideline, a skewness value more than twice its standard error is taken
to indicate a departure from symmetry.
Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the cube
of the standard deviation. We estimate the $3^{\textrm{rd}}$~central moment as
the sum of cubed differences between the values in the feature column and their
sample mean, divided by the number of values:
$\sum_{i=1}^n (v_i - \bar{v})^3 / n$
where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
The standard deviation is computed
as described above in \emph{standard deviation}. To avoid division by~0,
at least two different sample values are required. Example: for sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
with the mean of~5.2 and the standard deviation of~1.8
skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$.
Note: skewness is sensitive to outliers.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Standard error in skewness]
\OutputRowText{\OutputRowIDStErrorSkewness}
A measure of how much the sample skewness may vary from sample to sample,
assuming that the feature is normally distributed, which makes its
distribution skewness equal~0.
Given the number~$n$ of sample values, the standard error is computed as
\begin{equation*}
\sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}}
\end{equation*}
This measure can tell us, for example:
\begin{Itemize}
\item If the sample skewness lands within two standard errors from~0, its
positive or negative sign is non-significant, may just be accidental.
\item If the sample skewness lands outside this interval, the feature
is unlikely to be normally distributed.
\end{Itemize}
At least 3~values ($n\geq 3$) are required to avoid arithmetic failure.
Note that the standard error is inaccurate if the feature distribution is
far from normal or if the number of samples is small.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Kurtosis]
\OutputRowText{\OutputRowIDKurtosis}
As a distribution parameter, kurtosis is a measure of the extent to which
feature values cluster around a central point. In other words, it quantifies
``peakedness'' of the distribution: how tall and sharp the central peak is
relative to a standard bell curve.
Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative
to a normal distribution:
\begin{Itemize}
\item observations cluster more about the center (peak-shaped),
\item the tails are thinner at non-extreme values,
\item the tails are thicker at extreme values.
\end{Itemize}
Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative
to a normal distribution:
\begin{Itemize}
\item observations cluster less about the center (box-shaped),
\item the tails are thicker at non-extreme values,
\item the tails are thinner at extreme values.
\end{Itemize}
Kurtosis of a normal distribution is zero; however, the sample kurtosis
(computed here) is likely to deviate from zero.
Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided
by the $4^{\textrm{th}}$~power of the standard deviation, minus~3.
We estimate the $4^{\textrm{th}}$~central moment as the sum of the
$4^{\textrm{th}}$~powers of differences between the values in the feature column
and their sample mean, divided by the number of values:
$\sum_{i=1}^n (v_i - \bar{v})^4 / n$
where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
The standard deviation is computed as described above, see \emph{standard deviation}.
Note that kurtosis is sensitive to outliers, and requires at least two different
sample values. Example: for sample
$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
with the mean of~5.2 and the standard deviation of~1.8,
sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Standard error in kurtosis]
\OutputRowText{\OutputRowIDStErrorCurtosis}
A measure of how much the sample kurtosis may vary from sample to sample,
assuming that the feature is normally distributed, which makes its
distribution kurtosis equal~0.
Given the number~$n$ of sample values, the standard error is computed as
\begin{equation*}
\sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}}
\end{equation*}
This measure can tell us, for example:
\begin{Itemize}
\item If the sample kurtosis lands within two standard errors from~0, its
positive or negative sign is non-significant, may just be accidental.
\item If the sample kurtosis lands outside this interval, the feature
is unlikely to be normally distributed.
\end{Itemize}
At least 4~values ($n\geq 4$) are required to avoid arithmetic failure.
Note that the standard error is inaccurate if the feature distribution is
far from normal or if the number of samples is small.
\end{Description}
\paragraph{Categorical measures.} Statistics that describe the sample of
a categorical feature, either nominal or ordinal. We represent all
categories by integers from~1 to the number of categories; we call
these integers \emph{category~IDs}.
\begin{Description}
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Number of categories]
\OutputRowText{\OutputRowIDNumCategories}
The maximum category~ID that occurs in the sample. Note that some
categories with~IDs \emph{smaller} than this maximum~ID may have
no~occurrences in the sample, without reducing the number of categories.
However, any categories with~IDs \emph{larger} than the maximum~ID with
no occurrences in the sample will not be counted.
Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
the number of categories is reported as~8. Category~IDs 2 and~6, which have
zero occurrences, are still counted; but if there is a category with
ID${}=9$ and zero occurrences, it is not counted.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Mode]
\OutputRowText{\OutputRowIDMode}
The most frequently occurring category value.
If several values share the greatest frequency of occurrence, then each
of them is a mode; but here we report only the smallest of these modes.
Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
the modes are 3 and~7, with 3 reported.
Computed by counting the number of occurrences for each category,
then taking the smallest category~ID that has the maximum count.
Note that the sample modes may be different from the distribution modes,
i.e.\ the categories whose (hypothesized) underlying probability is the
maximum over all categories.
%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
\item[\it Number of modes]
\OutputRowText{\OutputRowIDNumModes}
The number of category values that each have the largest frequency
count in the sample.
Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
there are two category IDs (3 and~7) that occur the maximum count of 4~times;
hence, we return~2.
Computed by counting the number of occurrences for each category,
then counting how many categories have the maximum count.
Note that the sample modes may be different from the distribution modes,
i.e.\ the categories whose (hypothesized) underlying probability is the
maximum over all categories.
\end{Description}
\smallskip
\noindent{\bf Returns}
\smallskip
The output matrix containing all computed statistics is of size $17$~rows and
as many columns as in the input matrix~\texttt{X}. Each row corresponds to
a particular statistic, according to the convention specified in
Table~\ref{table:univars}. The first $14$~statistics are applicable for
\emph{scale} columns, and the last $3$~statistics are applicable for categorical,
i.e.\ nominal and ordinal, columns.
\pagebreak[2]
\smallskip
\noindent{\bf Examples}
\smallskip
{\hangindent=\parindent\noindent\tt
\hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx
TYPES=/user/biadmin/types.mtx
STATS=/user/biadmin/stats.mtx
}