blob: 5838e3e4893b07be1d8e326af5dcd478ae56d112 [file] [log] [blame]
 \begin{comment} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. \end{comment} \subsection{Univariate Statistics} \noindent{\bf Description} \smallskip \emph{Univariate statistics} are the simplest form of descriptive statistics in data analysis. They are used to quantitatively describe the main characteristics of each feature in the data. For a given dataset matrix, script \UnivarScriptName{} computes certain univariate statistics for each feature column in the matrix. The feature type governs the exact set of statistics computed for that feature. For example, the statistic \emph{mean} can only be computed on a quantitative (scale) feature like Height' and Temperature'. It does not make sense to compute the mean of a categorical attribute like Hair Color'. \smallskip \noindent{\bf Usage} \smallskip {\hangindent=\parindent\noindent\it%\tolerance=0 {\tt{}-f } \UnivarScriptName{} {\tt{} -nvargs} {\tt{} X=}path/file {\tt{} TYPES=}path/file {\tt{} STATS=}path/file % {\tt{} fmt=}format } \medskip \pagebreak[2] \noindent{\bf Arguments} \begin{Description} \item[{\tt X}:] Location (on HDFS) to read the data matrix $X$ whose columns we want to analyze as the features. \item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "}) Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$ column-cell contains the type of the $i^{\textrm{th}}$ feature column \texttt{X[,$\,i$]} in the data matrix. Feature types must be encoded by integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. % The default value means treat all $X$-columns as scale.'' \item[{\tt STATS}:] Location (on HDFS) where the output matrix of computed statistics will be stored. The format of the output matrix is defined by Table~\ref{table:univars}. % \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) % Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; % see read/write functions in SystemML Language Reference for details. \end{Description} \begin{table}[t]\hfil \begin{tabular}{|rl|c|c|} \hline \multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & \multicolumn{2}{c|}{Applies to:} \\ & & Scale & Categ.\\ \hline \OutputRowIDMinimum & Minimum & + & \\ \OutputRowIDMaximum & Maximum & + & \\ \OutputRowIDRange & Range & + & \\ \OutputRowIDMean & Mean & + & \\ \OutputRowIDVariance & Variance & + & \\ \OutputRowIDStDeviation & Standard deviation & + & \\ \OutputRowIDStErrorMean & Standard error of mean & + & \\ \OutputRowIDCoeffVar & Coefficient of variation & + & \\ \OutputRowIDSkewness & Skewness & + & \\ \OutputRowIDKurtosis & Kurtosis & + & \\ \OutputRowIDStErrorSkewness & Standard error of skewness & + & \\ \OutputRowIDStErrorCurtosis & Standard error of kurtosis & + & \\ \OutputRowIDMedian & Median & + & \\ \OutputRowIDIQMean & Inter quartile mean & + & \\ \OutputRowIDNumCategories & Number of categories & & + \\ \OutputRowIDMode & Mode & & + \\ \OutputRowIDNumModes & Number of modes & & + \\ \hline \end{tabular}\hfil \caption{The output matrix of \UnivarScriptName{} has one row per each univariate statistic and one column per input feature. This table lists the meaning of each row. Signs +'' show applicability to scale or/and to categorical features.} \label{table:univars} \end{table} \pagebreak[1] \smallskip \noindent{\bf Details} \smallskip Given an input matrix \texttt{X}, this script computes the set of all relevant univariate statistics for each feature column \texttt{X[,$\,i$]} in~\texttt{X}. The list of statistics to be computed depends on the \emph{type}, or \emph{measurement level}, of each column. The \textrm{TYPES} command-line argument points to a vector containing the types of all columns. The types must be provided as per the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. Below we list all univariate statistics computed by script \UnivarScriptName. The statistics are collected by relevance into several groups, namely: central tendency, dispersion, shape, and categorical measures. The first three groups contain statistics computed for a quantitative (also known as: numerical, scale, or continuous) feature; the last group contains the statistics for a categorical (either nominal or ordinal) feature. Let~$n$ be the number of data records (rows) with feature values. In what follows we fix a column index \texttt{idx} and consider sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}. Let $v = (v_1, v_2, \ldots, v_n)$ be the values of \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]} in their original unsorted order: $v_i = \texttt{X[}i\texttt{,}\,\texttt{idx]}$. Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted order, preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. \paragraph{Central tendency measures.} Sample statistics that describe the location of the quantitative (scale) feature distribution, represent it with a single value. \begin{Description} %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% \item[\it Mean] \OutputRowText{\OutputRowIDMean} The arithmetic average over a sample of a quantitative feature. Computed as the ratio between the sum of values and the number of values: $\left(\sum_{i=1}^n v_i\right)\!/n$. Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~5.2. Note that the mean is significantly affected by extreme values in the sample and may be misleading as a central tendency measure if the feature varies on exponential scale. For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 100.0$\}$ is 22.222, greater than all the sample values except the~largest. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{figure}[t] \setlength{\unitlength}{10pt} \begin{picture}(33,12) \put( 6.2, 0.0){\small 2.2} \put(10.2, 0.0){\small 3.2} \put(12.2, 0.0){\small 3.7} \put(15.0, 0.0){\small 4.4} \put(18.6, 0.0){\small 5.3} \put(20.2, 0.0){\small 5.7} \put(21.75,0.0){\small 6.1} \put(23.05,0.0){\small 6.4} \put(26.2, 0.0){\small 7.2} \put(28.6, 0.0){\small 7.8} \put( 0.5, 0.7){\small 0.0} \put( 0.1, 3.2){\small 0.25} \put( 0.5, 5.7){\small 0.5} \put( 0.1, 8.2){\small 0.75} \put( 0.5,10.7){\small 1.0} \linethickness{1.5pt} \put( 2.0, 1.0){\line(1,0){4.8}} \put( 6.8, 1.0){\line(0,1){1.0}} \put( 6.8, 2.0){\line(1,0){4.0}} \put(10.8, 2.0){\line(0,1){1.0}} \put(10.8, 3.0){\line(1,0){2.0}} \put(12.8, 3.0){\line(0,1){1.0}} \put(12.8, 4.0){\line(1,0){2.8}} \put(15.6, 4.0){\line(0,1){1.0}} \put(15.6, 5.0){\line(1,0){3.6}} \put(19.2, 5.0){\line(0,1){1.0}} \put(19.2, 6.0){\line(1,0){1.6}} \put(20.8, 6.0){\line(0,1){1.0}} \put(20.8, 7.0){\line(1,0){1.6}} \put(22.4, 7.0){\line(0,1){1.0}} \put(22.4, 8.0){\line(1,0){1.2}} \put(23.6, 8.0){\line(0,1){1.0}} \put(23.6, 9.0){\line(1,0){3.2}} \put(26.8, 9.0){\line(0,1){1.0}} \put(26.8,10.0){\line(1,0){2.4}} \put(29.2,10.0){\line(0,1){1.0}} \put(29.2,11.0){\line(1,0){4.8}} \linethickness{0.3pt} \put( 6.8, 1.0){\circle*{0.3}} \put(10.8, 1.0){\circle*{0.3}} \put(12.8, 1.0){\circle*{0.3}} \put(15.6, 1.0){\circle*{0.3}} \put(19.2, 1.0){\circle*{0.3}} \put(20.8, 1.0){\circle*{0.3}} \put(22.4, 1.0){\circle*{0.3}} \put(23.6, 1.0){\circle*{0.3}} \put(26.8, 1.0){\circle*{0.3}} \put(29.2, 1.0){\circle*{0.3}} \put( 6.8, 1.0){\vector(1,0){27.2}} \put( 2.0, 1.0){\vector(0,1){10.8}} \put( 2.0, 3.5){\line(1,0){10.8}} \put( 2.0, 6.0){\line(1,0){17.2}} \put( 2.0, 8.5){\line(1,0){21.6}} \put( 2.0,11.0){\line(1,0){27.2}} \put(12.8, 1.0){\line(0,1){2.0}} \put(19.2, 1.0){\line(0,1){5.0}} \put(20.0, 1.0){\line(0,1){5.0}} \put(23.6, 1.0){\line(0,1){7.0}} \put( 9.0, 4.0){\line(1,0){3.8}} \put( 9.2, 2.7){\vector(0,1){0.8}} \put( 9.2, 4.8){\vector(0,-1){0.8}} \put(19.4, 8.0){\line(1,0){3.0}} \put(19.6, 7.2){\vector(0,1){0.8}} \put(19.6, 9.3){\vector(0,-1){0.8}} \put(13.0, 2.2){\small $q_{25\%}$} \put(17.3, 2.2){\small $q_{50\%}$} \put(23.8, 2.2){\small $q_{75\%}$} \put(20.15,3.5){\small $\mu$} \put( 8.0, 3.75){\small $\phi_1$} \put(18.35,7.8){\small $\phi_2$} \end{picture} \label{fig:example_quartiles} \caption{The computation of quartiles, median, and interquartile mean from the empirical distribution function of the 10-point sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$. Each vertical step in the graph has height~$1{/}n = 0.1$. Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly; value~$\mu$ denotes the median. Values $\phi_1$ and $\phi_2$ show the partial contribution of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.} \end{figure} %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% \item[\it Median] \OutputRowText{\OutputRowIDMedian} The middle'' value that separates the higher half of the sample values (in a sorted order) from the lower half. To compute the median, we sort the sample in the increasing order, preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$, same as the $50^{\textrm{th}}$~percentile of the sample. If $n$ is even, there are two middle'' values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$, so we compute the median as the mean of these two values. (For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$, not as the median.) Example: the median of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see Figure~\ref{fig:example_quartiles}. Unlike the mean, the median is not sensitive to extreme values in the sample, i.e.\ it is robust to outliers. It works better as a measure of central tendency for heavy-tailed distributions and features that vary on exponential scale. However, the median is sensitive to small sample size. %%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% \item[\it Interquartile mean] \OutputRowText{\OutputRowIDIQMean} For a sample of a quantitative feature, this is the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile and less than or equal the $3^{\textrm{rd}}$ quartile. In other words, it is a truncated mean'' where the lowest 25$\%$ and the highest 25$\%$ of the sorted values are omitted in its computation. The two border values'', i.e.\ the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$ quartiles themselves, contribute to this mean only partially. This measure is occasionally used as the `robust'' version of the mean that is less sensitive to the extreme values. To compute the measure, we sort the sample in the increasing order, preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index, then compute the following weighted mean: \begin{equation*} \frac{1}{3{/}4 - 1{/}4} \left[ \left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+ \sum_{j