doc/design/modules/sampling.tex - madlib - Git at Google

 % When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines.
 % !TEX root = ../design.tex

 \chapter{Sampling}

 \begin{moduleinfo}
 \item[Author] \href{mailto:Florian.Schoppmann@emc.com}{Florian Schoppmann}
 \item[History]
 	\begin{modulehistory}
 		\item[v0.5] Initial revision
 	\end{modulehistory}
 \end{moduleinfo}

 \section{Sampling without Replacement} \label{sec:SampingWOReplacement}

 Given a list of known size $n$ through that we can iterate with arbitrary increments, sampling $m$ elements without replacement can be implemented in time $O(m)$, i.e., proportional to the sample size and independent of the list size \cite{V84a}. Even if the list size $n$ is unknown in advance, sampling can still be implemented in time $O(m(1 + \log \frac nm))$ \cite{V85a}.

 While being able to iterate through a list with arbitrary increments might seem like a very modest requirement, it is still not always an option in real-world databases (e.g., in PostgreSQL). It is therefore important to also consider more constrained algorithms.

 \subsection{Probabilistic Sampling}

 Probabilistic sampling selects each element in the list with probability $p$. Hence, the sample size is a random variable with Binomial distribution $B(n, p)$ and expectation $np$. The standard deviation is $\sqrt{np(1 - p)}$, i.e., approximately $\sqrt{np}$ if $p$ is small. In many applications a fixed sample size $m$ is needed, however. In this case, we could choose $p$ slightly larger than $m/n$, so that with high probability at least $m$ items are selected. Items in excess of $m$ are then discarded.

 \subsubsection{Formal Description}

 In the following, we discuss how to choose $p$ so that with high probability at least $m$ elements are sampled, but also not ``much'' more than $m$ (in fact, only $O(\sqrt m)$ more in expectation).

 In mathematical terms: What is a lower bound on the probability $p$ so that for a random variable $X \sim B(n,p)$ we have that $\Pr[X < m] \leq \epsilon$? We use the Chernoff bound for a fairly good estimate. It says
 %
 \begin{align*}
 	\Pr[X < (1 - \delta) \cdot \mu] \leq \exp\left( \frac{-\delta^2}{2} \cdot \mu \right)
 	\SiM,
 \end{align*}
 where $\mu = np$ is the expectation of $X$, and $\delta \geq 0$. We set $m = (1 - \delta) \cdot \mu$, or equivalently $\delta = \frac{\mu - m}{\mu}$.
 %
 This yields
 \begin{align}
 	\Pr[X < m] \leq \exp\left( \frac{-(\mu - m)^2}{2 \mu} \right)
 	\SiM. \label{eq:SamplingWOReplacent:GP:2}
 \end{align}
 %
 We want the right-hand side of \eqref{eq:SamplingWOReplacent:GP:2} to be bounded by $\epsilon$ from above. Rearranging this gives
 \begin{align*}
 	\mu \geq m - \ln(\epsilon) + \sqrt{\ln^2(\epsilon) - 2 m \ln(\epsilon)}
 	\SiM.
 \end{align*}
 Since $p = \mu / n$, this immediately translates into a lower bound for $p$. For instance, suppose we require $\epsilon = 10^{-6}$, i.e., we want the probability of our sample being too small to be less than one in a million. $\ln(10^{-6}) \approx -13.8$, so we could choose
 \begin{align*}
 	p \geq \frac{m + 14 + \sqrt{196 + 28m}}{n}
 	\SiM.
 \end{align*}

 Note that the bound on $\mu$ does not depend on $n$. So in expectation, only $O(m + \sqrt m)$ items are selected. At the same time, at least $m$ items are selected with very high probability.

 \subsubsection{Implementation in SQL}

 In real-world DBMSs, probabilistic sampling has the advantage that it is trivially data-parallel. Discarding excessive items can be done using the well-known \texttt{ORDER BY random() LIMIT} idiom. Tests show that PostgreSQL is very efficient in doing the sorting (today's CPUs can easily sort 1 million numbers in less than a couple hundred milliseconds). In fact, the sorting cost is almost not measurable if the sample size is only at the scale of several million or less. Since \texttt{ORDER BY random() LIMIT} is an often-used idiom, there is also hope that advanced optimizers might give it special treatment. Put together, in order to sample $m$ random rows uniformly at random, we write:
 \begin{sql}
 	SELECT * FROM list WHERE random() < p ORDER BY random() LIMIT m
 \end{sql}
 If necessary, checks can be added that indeed $m$ rows have been selected.


 \subsection{Generating a Random Variate According to a Discrete Probability Distribution}

 In practice, probability distributions are often induced by weights (that are not necessarily normalized to add up to 1). The following algorithm is a special case of the ``unequal probability sampling plan'' proposed by \textcite{C82a}. Its idea is very similar to reservoir sampling \cite{MB83a}.

 \subsubsection{Formal Description}

 \begin{algorithm}[WeightedSample$(A, w)$] \label{alg:WeightedSample}
 \alginput{Finite set $A$, Mapping $w$ of each element $a \in A$ to its weight $w[a] \geq 0$}
 \algoutput{Random element $\Sample \in A$ sampled according to distribution induced by $w$}
 \begin{algorithmic}[1]
 	\State $W \set 0$
 	\For{$a \in A$}
 		\State $W \set W + w[a]$ \label{alg:WeightedSample:UpdateWeight}
 		\With{probability $\frac{w[a]}{W}$} \label{alg:WeightedSample:Prob}
 			\State $\Sample \set a$ \label{alg:WeightedSample:SetSample}
 		\EndWith
 	\EndFor
 \end{algorithmic}
 \end{algorithm}

 \begin{description}
 	\item[Runtime] $O(n)$, single-pass streaming algorithm
 	\item[Space] $O(1)$, constant memory
 	\item[Correctness]
 		Let $a_1, \dots, a_n$ be the order in which the algorithm processes the elements. Denote by $\Sample_t$ the value of $\Sample$ at the end of iteration $t \in \Nupto n$. We prove by induction over $t$ that it holds for all $i \in \Nupto t$ that $\Pr[\Sample_t = a_i] = \frac{w[a_i]}{W_t}$ where $W_t := \sum_{j=1}^t w[a_j]$.

 		The base case $t = 1$ holds immediately by lines~\ref{alg:WeightedSample:Prob}--\ref{alg:WeightedSample:SetSample}. To see the induction step $t - 1 \to t$, note that $\Pr[\Sample_t = a_t] = \frac{w[a_t]}{W_t}$ (again by lines~\ref{alg:WeightedSample:Prob}--\ref{alg:WeightedSample:SetSample}) and that for all $i \in \Nupto{t-1}$
 		\begin{align*}
 			\Pr[\Sample_t = a_i]
 			=	\Pr[\Sample_t \neq a_t] \cdot \Pr[\Sample_{t-1} = a_i]
 			\stackrel{\text{IH}}{=}
 				\left(1 - \frac{w[a_t]}{W_t}\right) \cdot \frac{w[a_i]}{W_{t-1}}
 			=	\frac{w[a_i]}{W_t}
 			\SiM.
 		\end{align*}
 	\item[Scalability]
 		The algorithm can easily be transformed into a divide-and-conquer algorithm, as shown in the following.
 \end{description}

 \begin{algorithm}[RecursiveWeightedSample$(A_1, A_2, w)$] \label{alg:RecursiveWeightedSample}
 \alginput{Disjoint finite sets $A_1, A_2$, Mapping $w$ of each element $a \in A_1 \cup A_2$ to its weight $w[a] \geq 0$}
 \algoutput{Random element $\Sample \in A_1 \cup A_2$ sampled according to distribution induced by $w$}
 \begin{algorithmic}[1]
 	\State $\tilde A \set \emptyset$
 	\For{$i = 1,2$}
 		\State $\Sample_i \set \texttt{WeightedSample}(A_i, w)$
 		\State $\tilde A \set \tilde A \cup \{ \Sample_i \}$
 		\State $\tilde w[\Sample_i] \set \sum_{a \in A_i} w[a]$
 	\EndFor
 	\State $\Sample \set \texttt{WeightedSample}(\tilde A, \tilde w)$
 \end{algorithmic}
 \end{algorithm}

 \paragraph{Correctness}

 Define $W_i := \sum_{a \in A_i} w[a]$. Let $a \in A_i$ be arbitrary. Then $\Pr[\Sample = a] = \Pr[\Sample_i = a] \cdot \Pr[\Sample \in A_i] = \frac{w[a]}{W_i} \cdot \frac{W_i}{W} = \frac{w[a]}{W}.$

 \paragraph{Numerical Considerations}

 \begin{itemize}
 	\item When Algorithm~\ref{alg:WeightedSample} is used for large sets $A$, line~\ref{alg:WeightedSample:UpdateWeight} will eventually add two numbers that are very different in size. Compensated summation should be used \cite{ORO05a}.
 \end{itemize}

 \subsubsection{Implementation as User-Defined Aggregate}

 Algorithm~\ref{alg:WeightedSample} is implemented as the user-defined aggregate \symlabel{weighted\_sample}{sym:weighted_sample}. Data-parallelism is implemented as in Algorithm~\ref{alg:RecursiveWeightedSample}.

 \paragraph{Input} The aggregate expects the following arguments:

 \begin{center}
 	\begin{tabularx}{\linewidth}{lXl}
 		\toprule%
 		\textbf{Column} & \textbf{Description} & \textbf{Type}
 		\\\otoprule
 		\texttt{value} &
 		Row identifier, each row corresponds to an $a \in A$. There is no need to enforce uniqueness. If a value occurs multiple times, the probability of sampling this value is proportional to the sum of its weights. &
 		\textit{generic}
 		\\\midrule
 		\texttt{weight} &
 		weight for row, corresponds to $w[a]$ &
 		floating-point
 		\\\bottomrule
 	\end{tabularx}
 \end{center}
 %
 While it would be desirable to define a user-defined aggregate with a first argument of generic type, this would require a generic transition type (see below). Unfortunately, this is currently not supported in PostgreSQL and Greenplum. However, the internal C++ accumulator type is a generic template class, i.e., only the SQL interface contains redundancies.

 \paragraph{Output}

 Value of column \texttt{id} in row that was selected.

 \paragraph{Components}

 \begin{itemize}
 	\item Transition State:
 		\begin{center}
 			\begin{tabularx}{\linewidth}{lXl}
 				\toprule%
 				\textbf{Field Name} & \textbf{Description} & \textbf{Type}
 				\\\otoprule
 				\texttt{weight\_sum} &
 				corresponds to $W$ in Algorithm~\ref{alg:WeightedSample} &
 				floating-point
 				\\\midrule
 				\texttt{sample} &
 				corresponds to $\Sample$ in Algorithm~\ref{alg:WeightedSample}, takes value of column \texttt{value} &
 				\textit{generic}
 				\\\bottomrule
 			\end{tabularx}
 		\end{center}
 	\item Transition Function (\texttt{state, id, weight}): Lines~\ref{alg:WeightedSample:UpdateWeight}--\ref{alg:WeightedSample:SetSample} of Algorithm~\ref{alg:WeightedSample}
 	\item Merge Function (\texttt{state1, state2}): It is enough to call the transition function with arguments (\texttt{state1, state2.sample\_id, state2.weight\_sum})
 \end{itemize}

 \paragraph{Tool Set}

 While the user-defined aggregate is simple enough to be implemented in plain SQL, we choose a C++ implementation for performance. \todo{In the future, we want to use compensated summation. (Not documented yet.)}


 \section{Sampling with Replacement} % (fold)
 \label{sec:SamplingWithReplacement}
 In Postgres, with the help of the \texttt{row\_number()} window function, we could achive sampling with replacement by joining the input table with a list of randomly-generated numbers using row numbers.

 \lstset{numbers=none}

 \subsection{Assign Row Numbers for Input Table}
 \begin{sql}
     SELECT (row_number() OVER ())::integer AS rn, * FROM $input_table;
 \end{sql}

 \subsection{Generate A Row Number Sample Randomly}
 \begin{sql}
     SELECT ceiling((1 - random()) * $size)::integer AS rn FROM generate_series(1, $size) s;
 \end{sql}

 \subsection{Join to Generate Sample}
 \begin{sql}
     SELECT *
     FROM
     (
         SELECT (row_number() OVER ())::integer AS rn, * FROM $input_table
     ) src
     JOIN
     (
         SELECT ceiling((1 - random()) * $size)::integer AS rn FROM generate_series(1, $size) s
     ) sample_rn
     USING (rn);
 \end{sql}

 \lstset{numbers=left}
	% When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines.
	% !TEX root = ../design.tex

	\chapter{Sampling}

	\begin{moduleinfo}
	\item[Author] \href{mailto:Florian.Schoppmann@emc.com}{Florian Schoppmann}
	\item[History]
	\begin{modulehistory}
	\item[v0.5] Initial revision
	\end{modulehistory}
	\end{moduleinfo}

	\section{Sampling without Replacement} \label{sec:SampingWOReplacement}

	Given a list of known size $n$ through that we can iterate with arbitrary increments, sampling $m$ elements without replacement can be implemented in time $O(m)$, i.e., proportional to the sample size and independent of the list size \cite{V84a}. Even if the list size $n$ is unknown in advance, sampling can still be implemented in time $O(m(1 + \log \frac nm))$ \cite{V85a}.

	While being able to iterate through a list with arbitrary increments might seem like a very modest requirement, it is still not always an option in real-world databases (e.g., in PostgreSQL). It is therefore important to also consider more constrained algorithms.

	\subsection{Probabilistic Sampling}

	Probabilistic sampling selects each element in the list with probability $p$. Hence, the sample size is a random variable with Binomial distribution $B(n, p)$ and expectation $np$. The standard deviation is $\sqrt{np(1 - p)}$, i.e., approximately $\sqrt{np}$ if $p$ is small. In many applications a fixed sample size $m$ is needed, however. In this case, we could choose $p$ slightly larger than $m/n$, so that with high probability at least $m$ items are selected. Items in excess of $m$ are then discarded.

	\subsubsection{Formal Description}

	In the following, we discuss how to choose $p$ so that with high probability at least $m$ elements are sampled, but also not ``much'' more than $m$ (in fact, only $O(\sqrt m)$ more in expectation).

	In mathematical terms: What is a lower bound on the probability $p$ so that for a random variable $X \sim B(n,p)$ we have that $\Pr[X < m] \leq \epsilon$? We use the Chernoff bound for a fairly good estimate. It says
	%
	\begin{align*}
	\Pr[X < (1 - \delta) \cdot \mu] \leq \exp\left( \frac{-\delta^2}{2} \cdot \mu \right)
	\SiM,
	\end{align*}
	where $\mu = np$ is the expectation of $X$, and $\delta \geq 0$. We set $m = (1 - \delta) \cdot \mu$, or equivalently $\delta = \frac{\mu - m}{\mu}$.
	%
	This yields
	\begin{align}
	\Pr[X < m] \leq \exp\left( \frac{-(\mu - m)^2}{2 \mu} \right)
	\SiM. \label{eq:SamplingWOReplacent:GP:2}
	\end{align}
	%
	We want the right-hand side of \eqref{eq:SamplingWOReplacent:GP:2} to be bounded by $\epsilon$ from above. Rearranging this gives
	\begin{align*}
	\mu \geq m - \ln(\epsilon) + \sqrt{\ln^2(\epsilon) - 2 m \ln(\epsilon)}
	\SiM.
	\end{align*}
	Since $p = \mu / n$, this immediately translates into a lower bound for $p$. For instance, suppose we require $\epsilon = 10^{-6}$, i.e., we want the probability of our sample being too small to be less than one in a million. $\ln(10^{-6}) \approx -13.8$, so we could choose
	\begin{align*}
	p \geq \frac{m + 14 + \sqrt{196 + 28m}}{n}
	\SiM.
	\end{align*}

	Note that the bound on $\mu$ does not depend on $n$. So in expectation, only $O(m + \sqrt m)$ items are selected. At the same time, at least $m$ items are selected with very high probability.

	\subsubsection{Implementation in SQL}

	In real-world DBMSs, probabilistic sampling has the advantage that it is trivially data-parallel. Discarding excessive items can be done using the well-known \texttt{ORDER BY random() LIMIT} idiom. Tests show that PostgreSQL is very efficient in doing the sorting (today's CPUs can easily sort 1 million numbers in less than a couple hundred milliseconds). In fact, the sorting cost is almost not measurable if the sample size is only at the scale of several million or less. Since \texttt{ORDER BY random() LIMIT} is an often-used idiom, there is also hope that advanced optimizers might give it special treatment. Put together, in order to sample $m$ random rows uniformly at random, we write:
	\begin{sql}
	SELECT * FROM list WHERE random() < p ORDER BY random() LIMIT m
	\end{sql}
	If necessary, checks can be added that indeed $m$ rows have been selected.


	\subsection{Generating a Random Variate According to a Discrete Probability Distribution}

	In practice, probability distributions are often induced by weights (that are not necessarily normalized to add up to 1). The following algorithm is a special case of the ``unequal probability sampling plan'' proposed by \textcite{C82a}. Its idea is very similar to reservoir sampling \cite{MB83a}.

	\subsubsection{Formal Description}

	\begin{algorithm}[WeightedSample$(A, w)$] \label{alg:WeightedSample}
	\alginput{Finite set $A$, Mapping $w$ of each element $a \in A$ to its weight $w[a] \geq 0$}
	\algoutput{Random element $\Sample \in A$ sampled according to distribution induced by $w$}
	\begin{algorithmic}[1]
	\State $W \set 0$
	\For{$a \in A$}
	\State $W \set W + w[a]$ \label{alg:WeightedSample:UpdateWeight}
	\With{probability $\frac{w[a]}{W}$} \label{alg:WeightedSample:Prob}
	\State $\Sample \set a$ \label{alg:WeightedSample:SetSample}
	\EndWith
	\EndFor
	\end{algorithmic}
	\end{algorithm}

	\begin{description}
	\item[Runtime] $O(n)$, single-pass streaming algorithm
	\item[Space] $O(1)$, constant memory
	\item[Correctness]
	Let $a_1, \dots, a_n$ be the order in which the algorithm processes the elements. Denote by $\Sample_t$ the value of $\Sample$ at the end of iteration $t \in \Nupto n$. We prove by induction over $t$ that it holds for all $i \in \Nupto t$ that $\Pr[\Sample_t = a_i] = \frac{w[a_i]}{W_t}$ where $W_t := \sum_{j=1}^t w[a_j]$.

	The base case $t = 1$ holds immediately by lines~\ref{alg:WeightedSample:Prob}--\ref{alg:WeightedSample:SetSample}. To see the induction step $t - 1 \to t$, note that $\Pr[\Sample_t = a_t] = \frac{w[a_t]}{W_t}$ (again by lines~\ref{alg:WeightedSample:Prob}--\ref{alg:WeightedSample:SetSample}) and that for all $i \in \Nupto{t-1}$
	\begin{align*}
	\Pr[\Sample_t = a_i]
	= \Pr[\Sample_t \neq a_t] \cdot \Pr[\Sample_{t-1} = a_i]
	\stackrel{\text{IH}}{=}
	\left(1 - \frac{w[a_t]}{W_t}\right) \cdot \frac{w[a_i]}{W_{t-1}}
	= \frac{w[a_i]}{W_t}
	\SiM.
	\end{align*}
	\item[Scalability]
	The algorithm can easily be transformed into a divide-and-conquer algorithm, as shown in the following.
	\end{description}

	\begin{algorithm}[RecursiveWeightedSample$(A_1, A_2, w)$] \label{alg:RecursiveWeightedSample}
	\alginput{Disjoint finite sets $A_1, A_2$, Mapping $w$ of each element $a \in A_1 \cup A_2$ to its weight $w[a] \geq 0$}
	\algoutput{Random element $\Sample \in A_1 \cup A_2$ sampled according to distribution induced by $w$}
	\begin{algorithmic}[1]
	\State $\tilde A \set \emptyset$
	\For{$i = 1,2$}
	\State $\Sample_i \set \texttt{WeightedSample}(A_i, w)$
	\State $\tilde A \set \tilde A \cup \{ \Sample_i \}$
	\State $\tilde w[\Sample_i] \set \sum_{a \in A_i} w[a]$
	\EndFor
	\State $\Sample \set \texttt{WeightedSample}(\tilde A, \tilde w)$
	\end{algorithmic}
	\end{algorithm}

	\paragraph{Correctness}

	Define $W_i := \sum_{a \in A_i} w[a]$. Let $a \in A_i$ be arbitrary. Then $\Pr[\Sample = a] = \Pr[\Sample_i = a] \cdot \Pr[\Sample \in A_i] = \frac{w[a]}{W_i} \cdot \frac{W_i}{W} = \frac{w[a]}{W}.$

	\paragraph{Numerical Considerations}

	\begin{itemize}
	\item When Algorithm~\ref{alg:WeightedSample} is used for large sets $A$, line~\ref{alg:WeightedSample:UpdateWeight} will eventually add two numbers that are very different in size. Compensated summation should be used \cite{ORO05a}.
	\end{itemize}

	\subsubsection{Implementation as User-Defined Aggregate}

	Algorithm~\ref{alg:WeightedSample} is implemented as the user-defined aggregate \symlabel{weighted\_sample}{sym:weighted_sample}. Data-parallelism is implemented as in Algorithm~\ref{alg:RecursiveWeightedSample}.

	\paragraph{Input} The aggregate expects the following arguments:

	\begin{center}
	\begin{tabularx}{\linewidth}{lXl}
	\toprule%
	\textbf{Column} & \textbf{Description} & \textbf{Type}
	\\\otoprule
	\texttt{value} &
	Row identifier, each row corresponds to an $a \in A$. There is no need to enforce uniqueness. If a value occurs multiple times, the probability of sampling this value is proportional to the sum of its weights. &
	\textit{generic}
	\\\midrule
	\texttt{weight} &
	weight for row, corresponds to $w[a]$ &
	floating-point
	\\\bottomrule
	\end{tabularx}
	\end{center}
	%
	While it would be desirable to define a user-defined aggregate with a first argument of generic type, this would require a generic transition type (see below). Unfortunately, this is currently not supported in PostgreSQL and Greenplum. However, the internal C++ accumulator type is a generic template class, i.e., only the SQL interface contains redundancies.

	\paragraph{Output}

	Value of column \texttt{id} in row that was selected.

	\paragraph{Components}

	\begin{itemize}
	\item Transition State:
	\begin{center}
	\begin{tabularx}{\linewidth}{lXl}
	\toprule%
	\textbf{Field Name} & \textbf{Description} & \textbf{Type}
	\\\otoprule
	\texttt{weight\_sum} &
	corresponds to $W$ in Algorithm~\ref{alg:WeightedSample} &
	floating-point
	\\\midrule
	\texttt{sample} &
	corresponds to $\Sample$ in Algorithm~\ref{alg:WeightedSample}, takes value of column \texttt{value} &
	\textit{generic}
	\\\bottomrule
	\end{tabularx}
	\end{center}
	\item Transition Function (\texttt{state, id, weight}): Lines~\ref{alg:WeightedSample:UpdateWeight}--\ref{alg:WeightedSample:SetSample} of Algorithm~\ref{alg:WeightedSample}
	\item Merge Function (\texttt{state1, state2}): It is enough to call the transition function with arguments (\texttt{state1, state2.sample\_id, state2.weight\_sum})
	\end{itemize}

	\paragraph{Tool Set}

	While the user-defined aggregate is simple enough to be implemented in plain SQL, we choose a C++ implementation for performance. \todo{In the future, we want to use compensated summation. (Not documented yet.)}


	\section{Sampling with Replacement} % (fold)
	\label{sec:SamplingWithReplacement}
	In Postgres, with the help of the \texttt{row\_number()} window function, we could achive sampling with replacement by joining the input table with a list of randomly-generated numbers using row numbers.

	\lstset{numbers=none}

	\subsection{Assign Row Numbers for Input Table}
	\begin{sql}
	SELECT (row_number() OVER ())::integer AS rn, * FROM $input_table;
	\end{sql}

	\subsection{Generate A Row Number Sample Randomly}
	\begin{sql}
	SELECT ceiling((1 - random()) * $size)::integer AS rn FROM generate_series(1, $size) s;
	\end{sql}

	\subsection{Join to Generate Sample}
	\begin{sql}
	SELECT *
	FROM
	(
	SELECT (row_number() OVER ())::integer AS rn, * FROM $input_table
	) src
	JOIN
	(
	SELECT ceiling((1 - random()) * $size)::integer AS rn FROM generate_series(1, $size) s
	) sample_rn
	USING (rn);
	\end{sql}

	\lstset{numbers=left}