docs/Algorithms Reference/DescriptiveUnivarStats.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Univariate Statistics}

 \noindent{\bf Description}
 \smallskip

 \emph{Univariate statistics} are the simplest form of descriptive statistics in data
 analysis.  They are used to quantitatively describe the main characteristics of each
 feature in the data.  For a given dataset matrix, script \UnivarScriptName{} computes
 certain univariate statistics for each feature column in the
 matrix.  The feature type governs the exact set of statistics computed for that feature.
 For example, the statistic \emph{mean} can only be computed on a quantitative (scale)
 feature like `Height' and `Temperature'.  It does not make sense to compute the mean
 of a categorical attribute like `Hair Color'.


 \smallskip
 \noindent{\bf Usage}
 \smallskip

 {\hangindent=\parindent\noindent\it%\tolerance=0
 {\tt{}-f } \UnivarScriptName{}
 {\tt{} -nvargs}
 {\tt{} X=}path/file
 {\tt{} TYPES=}path/file
 {\tt{} STATS=}path/file
 % {\tt{} fmt=}format

 }


 \medskip
 \pagebreak[2]
 \noindent{\bf Arguments}
 \begin{Description}
 \item[{\tt X}:]
 Location (on HDFS) to read the data matrix $X$ whose columns we want to
 analyze as the features.
 \item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "})
 Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$
 column-cell contains the type of the $i^{\textrm{th}}$ feature column
 \texttt{X[,$\,i$]} in the data matrix.  Feature types must be encoded by
 integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
 % The default value means ``treat all $X$-columns as scale.''
 \item[{\tt STATS}:]
 Location (on HDFS) where the output matrix of computed statistics
 will be stored.  The format of the output matrix is defined by
 Table~\ref{table:univars}.
 % \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
 % Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
 % see read/write functions in SystemML Language Reference for details.
 \end{Description}

 \begin{table}[t]\hfil
 \begin{tabular}{|rl|c|c|}
 \hline
 \multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & \multicolumn{2}{c|}{Applies to:} \\
                             &                            & Scale & Categ.\\
 \hline
 \OutputRowIDMinimum         & Minimum                    &   +   &       \\
 \OutputRowIDMaximum         & Maximum                    &   +   &       \\
 \OutputRowIDRange           & Range                      &   +   &       \\
 \OutputRowIDMean            & Mean                       &   +   &       \\
 \OutputRowIDVariance        & Variance                   &   +   &       \\
 \OutputRowIDStDeviation     & Standard deviation         &   +   &       \\
 \OutputRowIDStErrorMean     & Standard error of mean     &   +   &       \\
 \OutputRowIDCoeffVar        & Coefficient of variation   &   +   &       \\
 \OutputRowIDSkewness        & Skewness                   &   +   &       \\
 \OutputRowIDKurtosis        & Kurtosis                   &   +   &       \\
 \OutputRowIDStErrorSkewness & Standard error of skewness &   +   &       \\
 \OutputRowIDStErrorCurtosis & Standard error of kurtosis &   +   &       \\
 \OutputRowIDMedian          & Median                     &   +   &       \\
 \OutputRowIDIQMean          & Inter quartile mean        &   +   &       \\
 \OutputRowIDNumCategories   & Number of categories       &       &   +   \\
 \OutputRowIDMode            & Mode                       &       &   +   \\
 \OutputRowIDNumModes        & Number of modes            &       &   +   \\
 \hline
 \end{tabular}\hfil
 \caption{The output matrix of \UnivarScriptName{} has one row per each
 univariate statistic and one column per input feature.  This table lists
 the meaning of each row.  Signs ``+'' show applicability to scale or/and
 to categorical features.}
 \label{table:univars}
 \end{table}


 \pagebreak[1]

 \smallskip
 \noindent{\bf Details}
 \smallskip

 Given an input matrix \texttt{X}, this script computes the set of all
 relevant univariate statistics for each feature column \texttt{X[,$\,i$]}
 in~\texttt{X}.  The list of statistics to be computed depends on the
 \emph{type}, or \emph{measurement level}, of each column.
 The \textrm{TYPES} command-line argument points to a vector containing
 the types of all columns.  The types must be provided as per the following
 convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.

 Below we list all univariate statistics computed by script \UnivarScriptName.
 The statistics are collected by relevance into several groups, namely: central
 tendency, dispersion, shape, and categorical measures.  The first three groups
 contain statistics computed for a quantitative (also known as: numerical, scale,
 or continuous) feature; the last group contains the statistics for a categorical
 (either nominal or ordinal) feature.

 Let~$n$ be the number of data records (rows) with feature values.
 In what follows we fix a column index \texttt{idx} and consider
 sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}.
 Let $v = (v_1, v_2, \ldots, v_n)$ be the values of \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}
 in their original unsorted order: $v_i = \texttt{X[}i\texttt{,}\,\texttt{idx]}$.
 Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted order,
 preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.

 \paragraph{Central tendency measures.}
 Sample statistics that describe the location of the quantitative (scale) feature distribution,
 represent it with a single value.
 \begin{Description}
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Mean]
 \OutputRowText{\OutputRowIDMean}
 The arithmetic average over a sample of a quantitative feature.
 Computed as the ratio between the sum of values and the number of values:
 $\left(\sum_{i=1}^n v_i\right)\!/n$.
 Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 equals~5.2.

 Note that the mean is significantly affected by extreme values in the sample
 and may be misleading as a central tendency measure if the feature varies on
 exponential scale.  For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 100.0$\}$
 is 22.222, greater than all the sample values except the~largest.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 \begin{figure}[t]
 \setlength{\unitlength}{10pt}
 \begin{picture}(33,12)
 \put( 6.2, 0.0){\small 2.2}
 \put(10.2, 0.0){\small 3.2}
 \put(12.2, 0.0){\small 3.7}
 \put(15.0, 0.0){\small 4.4}
 \put(18.6, 0.0){\small 5.3}
 \put(20.2, 0.0){\small 5.7}
 \put(21.75,0.0){\small 6.1}
 \put(23.05,0.0){\small 6.4}
 \put(26.2, 0.0){\small 7.2}
 \put(28.6, 0.0){\small 7.8}
 \put( 0.5, 0.7){\small 0.0}
 \put( 0.1, 3.2){\small 0.25}
 \put( 0.5, 5.7){\small 0.5}
 \put( 0.1, 8.2){\small 0.75}
 \put( 0.5,10.7){\small 1.0}
 \linethickness{1.5pt}
 \put( 2.0, 1.0){\line(1,0){4.8}}
 \put( 6.8, 1.0){\line(0,1){1.0}}
 \put( 6.8, 2.0){\line(1,0){4.0}}
 \put(10.8, 2.0){\line(0,1){1.0}}
 \put(10.8, 3.0){\line(1,0){2.0}}
 \put(12.8, 3.0){\line(0,1){1.0}}
 \put(12.8, 4.0){\line(1,0){2.8}}
 \put(15.6, 4.0){\line(0,1){1.0}}
 \put(15.6, 5.0){\line(1,0){3.6}}
 \put(19.2, 5.0){\line(0,1){1.0}}
 \put(19.2, 6.0){\line(1,0){1.6}}
 \put(20.8, 6.0){\line(0,1){1.0}}
 \put(20.8, 7.0){\line(1,0){1.6}}
 \put(22.4, 7.0){\line(0,1){1.0}}
 \put(22.4, 8.0){\line(1,0){1.2}}
 \put(23.6, 8.0){\line(0,1){1.0}}
 \put(23.6, 9.0){\line(1,0){3.2}}
 \put(26.8, 9.0){\line(0,1){1.0}}
 \put(26.8,10.0){\line(1,0){2.4}}
 \put(29.2,10.0){\line(0,1){1.0}}
 \put(29.2,11.0){\line(1,0){4.8}}
 \linethickness{0.3pt}
 \put( 6.8, 1.0){\circle*{0.3}}
 \put(10.8, 1.0){\circle*{0.3}}
 \put(12.8, 1.0){\circle*{0.3}}
 \put(15.6, 1.0){\circle*{0.3}}
 \put(19.2, 1.0){\circle*{0.3}}
 \put(20.8, 1.0){\circle*{0.3}}
 \put(22.4, 1.0){\circle*{0.3}}
 \put(23.6, 1.0){\circle*{0.3}}
 \put(26.8, 1.0){\circle*{0.3}}
 \put(29.2, 1.0){\circle*{0.3}}
 \put( 6.8, 1.0){\vector(1,0){27.2}}
 \put( 2.0, 1.0){\vector(0,1){10.8}}
 \put( 2.0, 3.5){\line(1,0){10.8}}
 \put( 2.0, 6.0){\line(1,0){17.2}}
 \put( 2.0, 8.5){\line(1,0){21.6}}
 \put( 2.0,11.0){\line(1,0){27.2}}
 \put(12.8, 1.0){\line(0,1){2.0}}
 \put(19.2, 1.0){\line(0,1){5.0}}
 \put(20.0, 1.0){\line(0,1){5.0}}
 \put(23.6, 1.0){\line(0,1){7.0}}
 \put( 9.0, 4.0){\line(1,0){3.8}}
 \put( 9.2, 2.7){\vector(0,1){0.8}}
 \put( 9.2, 4.8){\vector(0,-1){0.8}}
 \put(19.4, 8.0){\line(1,0){3.0}}
 \put(19.6, 7.2){\vector(0,1){0.8}}
 \put(19.6, 9.3){\vector(0,-1){0.8}}
 \put(13.0, 2.2){\small $q_{25\%}$}
 \put(17.3, 2.2){\small $q_{50\%}$}
 \put(23.8, 2.2){\small $q_{75\%}$}
 \put(20.15,3.5){\small $\mu$}
 \put( 8.0, 3.75){\small $\phi_1$}
 \put(18.35,7.8){\small $\phi_2$}
 \end{picture}
 \label{fig:example_quartiles}
 \caption{The computation of quartiles, median, and interquartile mean from the
 empirical distribution function of the 10-point
 sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$.  Each vertical step in
 the graph has height~$1{/}n = 0.1$.  Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote
 the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly;
 value~$\mu$ denotes the median.  Values $\phi_1$ and $\phi_2$ show the partial contribution
 of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.}
 \end{figure}

 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Median]
 \OutputRowText{\OutputRowIDMedian}
 The ``middle'' value that separates the higher half of the sample values
 (in a sorted order) from the lower half.
 To compute the median, we sort the sample in the increasing order, preserving
 duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
 If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$,
 same as the $50^{\textrm{th}}$~percentile of the sample.
 If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$,
 so we compute the median as the mean of these two values.
 (For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$,
 not as the median.)  Example: the median of sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see Figure~\ref{fig:example_quartiles}.

 Unlike the mean, the median is not sensitive to extreme values in the sample,
 i.e.\ it is robust to outliers.  It works better as a measure of central tendency
 for heavy-tailed distributions and features that vary on exponential scale.
 However, the median is sensitive to small sample size.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Interquartile mean]
 \OutputRowText{\OutputRowIDIQMean}
 For a sample of a quantitative feature, this is
 the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile
 and less than or equal the $3^{\textrm{rd}}$ quartile.
 In other words, it is a ``truncated mean'' where the lowest 25$\%$ and
 the highest 25$\%$ of the sorted values are omitted in its computation.
 The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$
 quartiles themselves, contribute to this mean only partially.
 This measure is occasionally used as the ``robust'' version of the mean
 that is less sensitive to the extreme values.

 To compute the measure, we sort the sample in the increasing order,
 preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
 We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index
 and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index,
 then compute the following weighted mean:
 \begin{equation*}
 \frac{1}{3{/}4 - 1{/}4} \left[
 \left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+
 \sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i
 \,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right]
 \end{equation*}
 In other words, all sample values between the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$
 quartile enter the sum with weights $2{/}n$, times their number of duplicates, while the
 two quartiles themselves enter the sum with reduced weights.  The weights are proportional
 to the vertical steps in the empirical distribution function of the sample, see
 Figure~\ref{fig:example_quartiles} for an illustration.
 Example: the interquartile mean of sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum
 $0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$,
 which equals~5.31.
 \end{Description}


 \paragraph{Dispersion measures.}
 Statistics that describe the amount of variation or spread in a quantitative
 (scale) data feature.
 \begin{Description}
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Variance]
 \OutputRowText{\OutputRowIDVariance}
 A measure of dispersion, or spread-out, of sample values around their mean,
 expressed in units that are the square of those of the feature itself.
 Computed as the sum of squared differences between the values
 in the sample and their mean, divided by one less than the number of
 values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where
 $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
 Example: the variance of sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24.
 Note that at least two values ($n\geq 2$) are required to avoid division
 by zero.  Sample variance is sensitive to outliers, even more than the mean.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Standard deviation]
 \OutputRowText{\OutputRowIDStDeviation}
 A measure of dispersion around the mean, the square root of variance.
 Computed by taking the square root of the sample variance;
 see \emph{Variance} above on computing the variance.
 Example: the standard deviation of sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8.
 At least two values are required to avoid division by zero.
 Note that standard deviation is sensitive to outliers.

 Standard deviation is used in conjunction with the mean to determine
 an interval containing a given percentage of the feature values,
 assuming the normal distribution.  In a large sample from a normal
 distribution, around 68\% of the cases fall within one standard
 deviation and around 95\% of cases fall within two standard deviations
 of the mean.  For example, if the mean age is 45 with a standard deviation
 of 10, around 95\% of the cases would be between 25 and 65 in a normal
 distribution.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Coefficient of variation]
 \OutputRowText{\OutputRowIDCoeffVar}
 The ratio of the standard deviation to the mean, i.e.\ the
 \emph{relative} standard deviation, of a quantitative feature sample.
 Computed by dividing the sample \emph{standard deviation} by the
 sample \emph{mean}, see above for their computation details.
 Example: the coefficient of variation for sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 equals 1.8$\,{/}\,$5.2~${\approx}$~0.346.

 This metric is used primarily with non-negative features such as
 financial or population data.  It is sensitive to outliers.
 Note: zero mean causes division by zero, returning infinity or \texttt{NaN}.
 At least two values (records) are required to compute the standard deviation.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Minimum]
 \OutputRowText{\OutputRowIDMinimum}
 The smallest value of a quantitative sample, computed as $\min v = v^s_1$.
 Example: the minimum of sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 equals~2.2.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Maximum]
 \OutputRowText{\OutputRowIDMaximum}
 The largest value of a quantitative sample, computed as $\max v = v^s_n$.
 Example: the maximum of sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 equals~7.8.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Range]
 \OutputRowText{\OutputRowIDRange}
 The difference between the largest and the smallest value of a quantitative
 sample, computed as $\max v - \min v = v^s_n - v^s_1$.
 It provides information about the overall spread of the sample values.
 Example: the range of sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 equals 7.8$\,{-}\,$2.2~${=}$~5.6.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Standard error of the mean]
 \OutputRowText{\OutputRowIDStErrorMean}
 A measure of how much the value of the sample mean may vary from sample
 to sample taken from the same (hypothesized) distribution of the feature.
 It helps to roughly bound the distribution mean, i.e.\
 the limit of the sample mean as the sample size tends to infinity.
 Under certain assumptions (e.g.\ normality and large sample), the difference
 between the distribution mean and the sample mean is unlikely to exceed
 2~standard errors.

 The measure is computed by dividing the sample standard deviation
 by the square root of the number of values~$n$; see \emph{standard deviation}
 for its computation details.  Ensure $n\,{\geq}\,2$ to avoid division by~0.
 Example: for sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 with the mean of~5.2 the standard error of the mean
 equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569.

 Note that the standard error itself is subject to sample randomness.
 Its accuracy as an error estimator may be low if the sample size is small
 or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has
 heavy tails.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 % \item[\it Quartiles]
 % \OutputRowText{\OutputRowIDQuartiles}
 % %%% dsDefn %%%%
 % The values of a quantitative feature
 % that divide an ordered/sorted set of data records into four equal-size groups.
 % The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits
 % the sorted data into the lowest $25\%$ and the highest~$75\%$.  In other words,
 % it is the middle value between the minimum and the median.  The $2^{\textrm{nd}}$
 % quartile is the median itself, the value that separates the higher half of
 % the data (in the sorted order) from the lower half.  Finally, the $3^{\textrm{rd}}$
 % quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into
 % lowest $75\%$ and highest~$25\%$.\par
 % %%% dsComp %%%%
 % To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical values
 % we sort it in the increasing order, preserving duplicates, then return
 % \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]}
 % where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$.
 % When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further adjusted
 % to equal the mean of two middle values
 % $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and
 % $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$.
 % %%% dsWarn %%%%
 % We assume that the feature column does not contain \texttt{NaN}s or coded non-numeric values.
 % %%% dsExmpl %%%
 % \textbf{Example(s).}
 \end{Description}


 \paragraph{Shape measures.}
 Statistics that describe the shape and symmetry of the quantitative (scale)
 feature distribution estimated from a sample of its values.
 \begin{Description}
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Skewness]
 \OutputRowText{\OutputRowIDSkewness}
 It measures how symmetrically the values of a feature are spread out
 around the mean.  A significant positive skewness implies a longer (or fatter)
 right tail, i.e. feature values tend to lie farther away from the mean on the
 right side.  A significant negative skewness implies a longer (or fatter) left
 tail.  The normal distribution is symmetric and has a skewness value of~0;
 however, its sample skewness is likely to be nonzero, just close to zero.
 As a guideline, a skewness value more than twice its standard error is taken
 to indicate a departure from symmetry.

 Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the cube
 of the standard deviation.  We estimate the $3^{\textrm{rd}}$~central moment as
 the sum of cubed differences between the values in the feature column and their
 sample mean, divided by the number of values:
 $\sum_{i=1}^n (v_i - \bar{v})^3 / n$
 where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
 The standard deviation is computed
 as described above in \emph{standard deviation}.  To avoid division by~0,
 at least two different sample values are required.  Example: for sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 with the mean of~5.2 and the standard deviation of~1.8
 skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$.
 Note: skewness is sensitive to outliers.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Standard error in skewness]
 \OutputRowText{\OutputRowIDStErrorSkewness}
 A measure of how much the sample skewness may vary from sample to sample,
 assuming that the feature is normally distributed, which makes its
 distribution skewness equal~0.
 Given the number~$n$ of sample values, the standard error is computed as
 \begin{equation*}
 \sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}}
 \end{equation*}
 This measure can tell us, for example:
 \begin{Itemize}
 \item If the sample skewness lands within two standard errors from~0, its
 positive or negative sign is non-significant, may just be accidental.
 \item If the sample skewness lands outside this interval, the feature
 is unlikely to be normally distributed.
 \end{Itemize}
 At least 3~values ($n\geq 3$) are required to avoid arithmetic failure.
 Note that the standard error is inaccurate if the feature distribution is
 far from normal or if the number of samples is small.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Kurtosis]
 \OutputRowText{\OutputRowIDKurtosis}
 As a distribution parameter, kurtosis is a measure of the extent to which
 feature values cluster around a central point.  In other words, it quantifies
 ``peakedness'' of the distribution: how tall and sharp the central peak is
 relative to a standard bell curve.

 Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative
 to a normal distribution:
 \begin{Itemize}
 \item observations cluster more about the center (peak-shaped),
 \item the tails are thinner at non-extreme values,
 \item the tails are thicker at extreme values.
 \end{Itemize}
 Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative
 to a normal distribution:
 \begin{Itemize}
 \item observations cluster less about the center (box-shaped),
 \item the tails are thicker at non-extreme values,
 \item the tails are thinner at extreme values.
 \end{Itemize}
 Kurtosis of a normal distribution is zero; however, the sample kurtosis
 (computed here) is likely to deviate from zero.

 Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided
 by the $4^{\textrm{th}}$~power of the standard deviation, minus~3.
 We estimate the $4^{\textrm{th}}$~central moment as the sum of the
 $4^{\textrm{th}}$~powers of differences between the values in the feature column
 and their sample mean, divided by the number of values:
 $\sum_{i=1}^n (v_i - \bar{v})^4 / n$
 where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
 The standard deviation is computed as described above, see \emph{standard deviation}.

 Note that kurtosis is sensitive to outliers, and requires at least two different
 sample values.  Example: for sample
 $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
 with the mean of~5.2 and the standard deviation of~1.8,
 sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Standard error in kurtosis]
 \OutputRowText{\OutputRowIDStErrorCurtosis}
 A measure of how much the sample kurtosis may vary from sample to sample,
 assuming that the feature is normally distributed, which makes its
 distribution kurtosis equal~0.
 Given the number~$n$ of sample values, the standard error is computed as
 \begin{equation*}
 \sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}}
 \end{equation*}
 This measure can tell us, for example:
 \begin{Itemize}
 \item If the sample kurtosis lands within two standard errors from~0, its
 positive or negative sign is non-significant, may just be accidental.
 \item If the sample kurtosis lands outside this interval, the feature
 is unlikely to be normally distributed.
 \end{Itemize}
 At least 4~values ($n\geq 4$) are required to avoid arithmetic failure.
 Note that the standard error is inaccurate if the feature distribution is
 far from normal or if the number of samples is small.
 \end{Description}


 \paragraph{Categorical measures.}  Statistics that describe the sample of
 a categorical feature, either nominal or ordinal.  We represent all
 categories by integers from~1 to the number of categories; we call
 these integers \emph{category~IDs}.
 \begin{Description}
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Number of categories]
 \OutputRowText{\OutputRowIDNumCategories}
 The maximum category~ID that occurs in the sample.  Note that some
 categories with~IDs \emph{smaller} than this maximum~ID may have
 no~occurrences in the sample, without reducing the number of categories.
 However, any categories with~IDs \emph{larger} than the maximum~ID with
 no occurrences in the sample will not be counted.
 Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
 the number of categories is reported as~8.  Category~IDs 2 and~6, which have
 zero occurrences, are still counted; but if there is a category with
 ID${}=9$ and zero occurrences, it is not counted.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Mode]
 \OutputRowText{\OutputRowIDMode}
 The most frequently occurring category value.
 If several values share the greatest frequency of occurrence, then each
 of them is a mode; but here we report only the smallest of these modes.
 Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
 the modes are 3 and~7, with 3 reported.

 Computed by counting the number of occurrences for each category,
 then taking the smallest category~ID that has the maximum count.
 Note that the sample modes may be different from the distribution modes,
 i.e.\ the categories whose (hypothesized) underlying probability is the
 maximum over all categories.
 %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
 \item[\it Number of modes]
 \OutputRowText{\OutputRowIDNumModes}
 The number of category values that each have the largest frequency
 count in the sample.
 Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
 there are two category IDs (3 and~7) that occur the maximum count of 4~times;
 hence, we return~2.

 Computed by counting the number of occurrences for each category,
 then counting how many categories have the maximum count.
 Note that the sample modes may be different from the distribution modes,
 i.e.\ the categories whose (hypothesized) underlying probability is the
 maximum over all categories.
 \end{Description}


 \smallskip
 \noindent{\bf Returns}
 \smallskip

 The output matrix containing all computed statistics is of size $17$~rows and
 as many columns as in the input matrix~\texttt{X}.  Each row corresponds to
 a particular statistic, according to the convention specified in
 Table~\ref{table:univars}.  The first $14$~statistics are applicable for
 \emph{scale} columns, and the last $3$~statistics are applicable for categorical,
 i.e.\ nominal and ordinal, columns.


 \pagebreak[2]

 \smallskip
 \noindent{\bf Examples}
 \smallskip

 {\hangindent=\parindent\noindent\tt
 \hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx
   TYPES=/user/biadmin/types.mtx
   STATS=/user/biadmin/stats.mtx

 }