| \begin{comment} |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| \end{comment} |
| |
| \newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}} |
| \newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}} |
| |
| \newcommand{\OutputRowIDMinimum}{1} |
| \newcommand{\OutputRowIDMaximum}{2} |
| \newcommand{\OutputRowIDRange}{3} |
| \newcommand{\OutputRowIDMean}{4} |
| \newcommand{\OutputRowIDVariance}{5} |
| \newcommand{\OutputRowIDStDeviation}{6} |
| \newcommand{\OutputRowIDStErrorMean}{7} |
| \newcommand{\OutputRowIDCoeffVar}{8} |
| \newcommand{\OutputRowIDQuartiles}{?, 13, ?} |
| \newcommand{\OutputRowIDMedian}{13} |
| \newcommand{\OutputRowIDIQMean}{14} |
| \newcommand{\OutputRowIDSkewness}{9} |
| \newcommand{\OutputRowIDKurtosis}{10} |
| \newcommand{\OutputRowIDStErrorSkewness}{11} |
| \newcommand{\OutputRowIDStErrorCurtosis}{12} |
| \newcommand{\OutputRowIDNumCategories}{15} |
| \newcommand{\OutputRowIDMode}{16} |
| \newcommand{\OutputRowIDNumModes}{17} |
| \newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}} |
| |
| \newcommand{\NameStatR}{Pearson's correlation coefficient} |
| \newcommand{\NameStatChi}{Pearson's~$\chi^2$} |
| \newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$} |
| \newcommand{\NameStatV}{Cram\'er's~$V$} |
| \newcommand{\NameStatEta}{Eta statistic} |
| \newcommand{\NameStatF}{$F$~statistic} |
| \newcommand{\NameStatRho}{Spearman's rank correlation coefficient} |
| |
| Descriptive statistics are used to quantitatively describe the main characteristics of the data. |
| They provide meaningful summaries computed over different observations or data records |
| collected in a study. These summaries typically form the basis of the initial data exploration |
| as part of a more extensive statistical analysis. Such a quantitative analysis assumes that |
| every variable (also known as, attribute, feature, or column) in the data has a specific |
| \emph{level of measurement}~\cite{Stevens1946:scales}. |
| |
| The measurement level of a variable, often called as {\bf variable type}, can either be |
| \emph{scale} or \emph{categorical}. A \emph{scale} variable represents the data measured on |
| an interval scale or ratio scale. Examples of scale variables include `Height', `Weight', |
| `Salary', and `Temperature'. Scale variables are also referred to as \emph{quantitative} |
| or \emph{continuous} variables. In contrast, a \emph{categorical} variable has a fixed |
| limited number of distinct values or categories. Examples of categorical variables |
| include `Gender', `Region', `Hair color', `Zipcode', and `Level of Satisfaction'. |
| Categorical variables can further be classified into two types, \emph{nominal} and |
| \emph{ordinal}, depending on whether the categories in the variable can be ordered via an |
| intrinsic ranking. For example, there is no meaningful ranking among distinct values in |
| `Hair color' variable, while the categories in `Level of Satisfaction' can be ranked from |
| highly dissatisfied to highly satisfied. |
| |
| The input dataset for descriptive statistics is provided in the form of a matrix, whose |
| rows are the records (data points) and whose columns are the features (i.e.~variables). |
| Some scripts allow this matrix to be vertically split into two or three matrices. Descriptive |
| statistics are computed over the specified features (columns) in the matrix. Which |
| statistics are computed depends on the types of the features. It is important to keep |
| in mind the following caveats and restrictions: |
| \begin{Enumerate} |
| \item Given a finite set of data records, i.e.~a \emph{sample}, we take their feature |
| values and compute their \emph{sample statistics}. These statistics |
| will vary from sample to sample even if the underlying distribution of feature values |
| remains the same. Sample statistics are accurate for the given sample only. |
| If the goal is to estimate the \emph{distribution statistics} that are parameters of |
| the (hypothesized) underlying distribution of the features, the corresponding sample |
| statistics may sometimes be used as approximations, but their accuracy will vary. |
| \item In particular, the accuracy of the estimated distribution statistics will be low |
| if the number of values in the sample is small. That is, for small samples, the computed |
| statistics may depend on the randomness of the individual sample values more than on |
| the underlying distribution of the features. |
| \item The accuracy will also be low if the sample records cannot be assumed mutually |
| independent and identically distributed (i.i.d.), that is, sampled at random from the |
| same underlying distribution. In practice, feature values in one record often depend |
| on other features and other records, including unknown ones. |
| \item Most of the computed statistics will have low estimation accuracy in the presence of |
| extreme values (outliers) or if the underlying distribution has heavy tails, for example |
| obeys a power law. However, a few of the computed statistics, such as the median and |
| \NameStatRho{}, are \emph{robust} to outliers. |
| \item Some sample statistics are reported with their \emph{sample standard errors} |
| in an attempt to quantify their accuracy as distribution parameter estimators. But these |
| sample standard errors, in turn, only estimate the underlying distribution's standard |
| errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, outliers in samples, |
| or heavy-tailed distributions. |
| \item We assume that the quantitative (scale) feature columns do not contain missing |
| values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless otherwise |
| specified. We assume that each categorical feature column contains positive integers |
| from 1 to the number of categories; for ordinal features, the natural order on |
| the integers should coincide with the order on the categories. |
| \end{Enumerate} |
| |
| \input{DescriptiveUnivarStats} |
| |
| \input{DescriptiveBivarStats} |
| |
| \input{DescriptiveStratStats} |