docs/Algorithms Reference/DescriptiveStats.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}}
 \newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}}

 \newcommand{\OutputRowIDMinimum}{1}
 \newcommand{\OutputRowIDMaximum}{2}
 \newcommand{\OutputRowIDRange}{3}
 \newcommand{\OutputRowIDMean}{4}
 \newcommand{\OutputRowIDVariance}{5}
 \newcommand{\OutputRowIDStDeviation}{6}
 \newcommand{\OutputRowIDStErrorMean}{7}
 \newcommand{\OutputRowIDCoeffVar}{8}
 \newcommand{\OutputRowIDQuartiles}{?, 13, ?}
 \newcommand{\OutputRowIDMedian}{13}
 \newcommand{\OutputRowIDIQMean}{14}
 \newcommand{\OutputRowIDSkewness}{9}
 \newcommand{\OutputRowIDKurtosis}{10}
 \newcommand{\OutputRowIDStErrorSkewness}{11}
 \newcommand{\OutputRowIDStErrorCurtosis}{12}
 \newcommand{\OutputRowIDNumCategories}{15}
 \newcommand{\OutputRowIDMode}{16}
 \newcommand{\OutputRowIDNumModes}{17}
 \newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}}

 \newcommand{\NameStatR}{Pearson's correlation coefficient}
 \newcommand{\NameStatChi}{Pearson's~$\chi^2$}
 \newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$}
 \newcommand{\NameStatV}{Cram\'er's~$V$}
 \newcommand{\NameStatEta}{Eta statistic}
 \newcommand{\NameStatF}{$F$~statistic}
 \newcommand{\NameStatRho}{Spearman's rank correlation coefficient}

 Descriptive statistics are used to quantitatively describe the main characteristics of the data.
 They provide meaningful summaries computed over different observations or data records
 collected in a study.  These summaries typically form the basis of the initial data exploration
 as part of a more extensive statistical analysis.  Such a quantitative analysis assumes that
 every variable (also known as, attribute, feature, or column) in the data has a specific
 \emph{level of measurement}~\cite{Stevens1946:scales}.

 The measurement level of a variable, often called as {\bf variable type}, can either be
 \emph{scale} or \emph{categorical}.  A \emph{scale} variable represents the data measured on
 an interval scale or ratio scale.  Examples of scale variables include `Height', `Weight',
 `Salary', and `Temperature'.  Scale variables are also referred to as \emph{quantitative}
 or \emph{continuous} variables.  In contrast, a \emph{categorical} variable has a fixed
 limited number of distinct values or categories.  Examples of categorical variables
 include `Gender', `Region', `Hair color', `Zipcode', and `Level of Satisfaction'.
 Categorical variables can further be classified into two types, \emph{nominal} and
 \emph{ordinal}, depending on whether the categories in the variable can be ordered via an
 intrinsic ranking.  For example, there is no meaningful ranking among distinct values in
 `Hair color' variable, while the categories in `Level of Satisfaction' can be ranked from
 highly dissatisfied to highly satisfied.

 The input dataset for descriptive statistics is provided in the form of a matrix, whose
 rows are the records (data points) and whose columns are the features (i.e.~variables).
 Some scripts allow this matrix to be vertically split into two or three matrices.  Descriptive
 statistics are computed over the specified features (columns) in the matrix.  Which
 statistics are computed depends on the types of the features.  It is important to keep
 in mind the following caveats and restrictions:
 \begin{Enumerate}
 \item  Given a finite set of data records, i.e.~a \emph{sample}, we take their feature
 values and compute their \emph{sample statistics}.  These statistics
 will vary from sample to sample even if the underlying distribution of feature values
 remains the same.  Sample statistics are accurate for the given sample only.
 If the goal is to estimate the \emph{distribution statistics} that are parameters of
 the (hypothesized) underlying distribution of the features, the corresponding sample
 statistics may sometimes be used as approximations, but their accuracy will vary.
 \item  In particular, the accuracy of the estimated distribution statistics will be low
 if the number of values in the sample is small.  That is, for small samples, the computed
 statistics may depend on the randomness of the individual sample values more than on
 the underlying distribution of the features.
 \item  The accuracy will also be low if the sample records cannot be assumed mutually
 independent and identically distributed (i.i.d.), that is, sampled at random from the
 same underlying distribution.  In practice, feature values in one record often depend
 on other features and other records, including unknown ones.
 \item  Most of the computed statistics will have low estimation accuracy in the presence of
 extreme values (outliers) or if the underlying distribution has heavy tails, for example
 obeys a power law.  However, a few of the computed statistics, such as the median and
 \NameStatRho{}, are \emph{robust} to outliers.
 \item  Some sample statistics are reported with their \emph{sample standard errors}
 in an attempt to quantify their accuracy as distribution parameter estimators.  But these
 sample standard errors, in turn, only estimate the underlying distribution's standard
 errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, outliers in samples,
 or heavy-tailed distributions.
 \item  We assume that the quantitative (scale) feature columns do not contain missing
 values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless otherwise
 specified.  We assume that each categorical feature column contains positive integers
 from 1 to the number of categories; for ordinal features, the natural order on
 the integers should coincide with the order on the categories.
 \end{Enumerate}

 \input{DescriptiveUnivarStats}

 \input{DescriptiveBivarStats}

 \input{DescriptiveStratStats}
	\begin{comment}

	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	\end{comment}

	\newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}}
	\newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}}

	\newcommand{\OutputRowIDMinimum}{1}
	\newcommand{\OutputRowIDMaximum}{2}
	\newcommand{\OutputRowIDRange}{3}
	\newcommand{\OutputRowIDMean}{4}
	\newcommand{\OutputRowIDVariance}{5}
	\newcommand{\OutputRowIDStDeviation}{6}
	\newcommand{\OutputRowIDStErrorMean}{7}
	\newcommand{\OutputRowIDCoeffVar}{8}
	\newcommand{\OutputRowIDQuartiles}{?, 13, ?}
	\newcommand{\OutputRowIDMedian}{13}
	\newcommand{\OutputRowIDIQMean}{14}
	\newcommand{\OutputRowIDSkewness}{9}
	\newcommand{\OutputRowIDKurtosis}{10}
	\newcommand{\OutputRowIDStErrorSkewness}{11}
	\newcommand{\OutputRowIDStErrorCurtosis}{12}
	\newcommand{\OutputRowIDNumCategories}{15}
	\newcommand{\OutputRowIDMode}{16}
	\newcommand{\OutputRowIDNumModes}{17}
	\newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}}

	\newcommand{\NameStatR}{Pearson's correlation coefficient}
	\newcommand{\NameStatChi}{Pearson's~$\chi^2$}
	\newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$}
	\newcommand{\NameStatV}{Cram\'er's~$V$}
	\newcommand{\NameStatEta}{Eta statistic}
	\newcommand{\NameStatF}{$F$~statistic}
	\newcommand{\NameStatRho}{Spearman's rank correlation coefficient}

	Descriptive statistics are used to quantitatively describe the main characteristics of the data.
	They provide meaningful summaries computed over different observations or data records
	collected in a study. These summaries typically form the basis of the initial data exploration
	as part of a more extensive statistical analysis. Such a quantitative analysis assumes that
	every variable (also known as, attribute, feature, or column) in the data has a specific
	\emph{level of measurement}~\cite{Stevens1946:scales}.

	The measurement level of a variable, often called as {\bf variable type}, can either be
	\emph{scale} or \emph{categorical}. A \emph{scale} variable represents the data measured on
	an interval scale or ratio scale. Examples of scale variables include `Height', `Weight',
	`Salary', and `Temperature'. Scale variables are also referred to as \emph{quantitative}
	or \emph{continuous} variables. In contrast, a \emph{categorical} variable has a fixed
	limited number of distinct values or categories. Examples of categorical variables
	include `Gender', `Region', `Hair color', `Zipcode', and `Level of Satisfaction'.
	Categorical variables can further be classified into two types, \emph{nominal} and
	\emph{ordinal}, depending on whether the categories in the variable can be ordered via an
	intrinsic ranking. For example, there is no meaningful ranking among distinct values in
	`Hair color' variable, while the categories in `Level of Satisfaction' can be ranked from
	highly dissatisfied to highly satisfied.

	The input dataset for descriptive statistics is provided in the form of a matrix, whose
	rows are the records (data points) and whose columns are the features (i.e.~variables).
	Some scripts allow this matrix to be vertically split into two or three matrices. Descriptive
	statistics are computed over the specified features (columns) in the matrix. Which
	statistics are computed depends on the types of the features. It is important to keep
	in mind the following caveats and restrictions:
	\begin{Enumerate}
	\item Given a finite set of data records, i.e.~a \emph{sample}, we take their feature
	values and compute their \emph{sample statistics}. These statistics
	will vary from sample to sample even if the underlying distribution of feature values
	remains the same. Sample statistics are accurate for the given sample only.
	If the goal is to estimate the \emph{distribution statistics} that are parameters of
	the (hypothesized) underlying distribution of the features, the corresponding sample
	statistics may sometimes be used as approximations, but their accuracy will vary.
	\item In particular, the accuracy of the estimated distribution statistics will be low
	if the number of values in the sample is small. That is, for small samples, the computed
	statistics may depend on the randomness of the individual sample values more than on
	the underlying distribution of the features.
	\item The accuracy will also be low if the sample records cannot be assumed mutually
	independent and identically distributed (i.i.d.), that is, sampled at random from the
	same underlying distribution. In practice, feature values in one record often depend
	on other features and other records, including unknown ones.
	\item Most of the computed statistics will have low estimation accuracy in the presence of
	extreme values (outliers) or if the underlying distribution has heavy tails, for example
	obeys a power law. However, a few of the computed statistics, such as the median and
	\NameStatRho{}, are \emph{robust} to outliers.
	\item Some sample statistics are reported with their \emph{sample standard errors}
	in an attempt to quantify their accuracy as distribution parameter estimators. But these
	sample standard errors, in turn, only estimate the underlying distribution's standard
	errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, outliers in samples,
	or heavy-tailed distributions.
	\item We assume that the quantitative (scale) feature columns do not contain missing
	values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless otherwise
	specified. We assume that each categorical feature column contains positive integers
	from 1 to the number of categories; for ordinal features, the natural order on
	the integers should coincide with the order on the categories.
	\end{Enumerate}

	\input{DescriptiveUnivarStats}

	\input{DescriptiveBivarStats}

	\input{DescriptiveStratStats}