doc/design/modules/knn.tex - madlib - Git at Google

 % Licensed to the Apache Software Foundation (ASF) under one
 % or more contributor license agreements.  See the NOTICE file
 % distributed with this work for additional information
 % regarding copyright ownership.  The ASF licenses this file
 % to you under the Apache License, Version 2.0 (the
 % "License"); you may not use this file except in compliance
 % with the License.  You may obtain a copy of the License at

 %   http://www.apache.org/licenses/LICENSE-2.0

 % Unless required by applicable law or agreed to in writing,
 % software distributed under the License is distributed on an
 % "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 % KIND, either express or implied.  See the License for the
 % specific language governing permissions and limitations
 % under the License.

 !TEX root = ../design.tex


 \chapter[k Nearest Neighbors]{k Nearest Neighbors}

 \begin{moduleinfo}
 \item[Authors] \href{mailto:okislal@pivotal.io}{Orhan Kislal}

 \item[History]
 	\begin{modulehistory}
 		\item[v0.1] Initial version: knn and kd-tree.
 	\end{modulehistory}
 \end{moduleinfo}


 % Abstract. What is the problem we want to solve?
 \section{Introduction} % (fold)
 \label{sec:knn_introduction}

 \emph{Some notes and figures in this section are borrowed from \cite{medium_knn} and \cite{point_knn}}.

 K-nearest neighbors (KNN) is one of the most commonly used learning
 algorithms. The goal of knn is to find a number (k) of training data points
 closest to the test point. These neighbors can be used to predict labels via
 classification or regression.

 KNN does not have a training phase like the most of learning techniques. It
 does not create a model to generalize the data, instead the algorithm uses the
 whole training dataset (or a specific subset of it).

 KNN can be used for classification, the output is a class membership (a
 discrete value). An object is classified by a majority vote of its neighbors,
 with the object being assigned to the class most common among its k nearest
 neighbors. It can also be used for regression, output is the value for the
 object (predicts continuous values). This value is the average (or median) of
 the values of its k nearest neighbors.

 \section{Implementation Details}

 The basic KNN implementation depends on the table join between the training dataset and the test dataset.

 \begin{sql}
 	(SELECT test_id,
             train_id,
             fn_dist(train_col_name, test_col_name) AS dist,
             label
     FROM train_table, test_table) AS knn_sub
 \end{sql}

 Once we have the distance between every train - test pair, the algorithm picks the k smallest values.

 \begin{sql}
 	SELECT row_number() OVER
         (PARTITION BY test_id ORDER BY dist) AS r,
         test_id,
         train_id,
         label
 	FROM knn_sub
 	WHERE r <= k
 \end{sql}

 Finally, the prediction is completed based on the labels of the selected
 training points for each test point.

 \section{Enabling KD-tree}

 One of the major shortcomings of KNN is the fact that it is computationally
 expensive. In addition, there is no training phase; which means every single
 prediction will have to compute the full table join. One of the ways to
 improve the performance is to reduce the search space for test points. Kd-tree
 option is developed to enable trading the accuracy of the output with higher
 performance by reducing the neighbor search space.

 Kd-trees are used for organizing data in k dimensions. It is constructed like
 a binary search tree where each level of the tree is using a specific
 dimension for finding splits.


 \begin{figure}[h]
 	\centering
 	\includegraphics[width=0.9\textwidth]{figures/2d_kdtree.pdf}
 \caption{A 2D kd-tree of depth 3}
 \label{kdd:2d_kdtree}
 \end{figure}

 A kd-tree is constructed by finding the median value of the data in a
 particular dimension and separating the data into two sections based on this
 value. This process is repeated a number of times to construct smaller
 regions. Once the kd-tree is prepared, it can be used by any test point to
 find its assigned region and this fragmentation can be used for limiting the
 search space for nearest neighbors.

 Once we have the kd-tree regions and their borders, we find the associated
 regions for the test points. This gives us the first region to search for
 nearest neighbors. In addition, we allow the user to request for multiple
 regions to search. This means we have to decide which additional regions to
 include in our search. We implemented a backtracking algorithm to find these
 regions. The core idea is to find the closest border for each test point and
 select the region on the other side of the border. Note that points that
 reside in the same region might have different secondary (or tertiary, etc.)
 regions. Consider the tree at Figure~\ref{kdd:2d_kdtree}. A test point at $<5
 , 2>$ is in the same region as $<3 , 3.9>$. However, their closest borders and
 the associated secondary regions are wildly different. In addition, consider
 $<3 , 3.9>$ and $<6 , 3.9>$. They both have the same border as their closest
 one ($y=4$). However, their closest regions do differ. To make sure that we
 get the correct region, the following scheme is implemented. For a given point
 $P$, we find the closest border, $dim[i] = x$ and $P$'s relative position to
 it ($pos$ = $-1$ for lower and $+1$ for higher). We conjure a new point that
 consists of the same values as the test point in every dimension except $i$.
 For $dim[i]$, we set the value to $x-pos*\epsilon$. Finally, we use the
 existing kd-tree to find this new point's assigned region. This region is our
 expansion target for the point $P$. We repeat this process with the next
 closest border as requested by the user.

 The knn algorithm does not change significantly with the addition of regions.
 Assuming that the training and test datasets have their region information
 stored in the tables, the only necessary change is ensuring that the table
 join uses these region ids to limit the search space.


 \begin{sql}
 	(SELECT test_id,
             train_id,
             fn_dist(train_col_name, test_col_name) AS dist,
             label
     FROM train_table, test_table
     WHERE train_table.region_id = test_table.region_id
     ) AS knn_sub
 \end{sql}
	% Licensed to the Apache Software Foundation (ASF) under one
	% or more contributor license agreements. See the NOTICE file
	% distributed with this work for additional information
	% regarding copyright ownership. The ASF licenses this file
	% to you under the Apache License, Version 2.0 (the
	% "License"); you may not use this file except in compliance
	% with the License. You may obtain a copy of the License at

	% http://www.apache.org/licenses/LICENSE-2.0

	% Unless required by applicable law or agreed to in writing,
	% software distributed under the License is distributed on an
	% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	% KIND, either express or implied. See the License for the
	% specific language governing permissions and limitations
	% under the License.

	!TEX root = ../design.tex


	\chapter[k Nearest Neighbors]{k Nearest Neighbors}

	\begin{moduleinfo}
	\item[Authors] \href{mailto:okislal@pivotal.io}{Orhan Kislal}

	\item[History]
	\begin{modulehistory}
	\item[v0.1] Initial version: knn and kd-tree.
	\end{modulehistory}
	\end{moduleinfo}


	% Abstract. What is the problem we want to solve?
	\section{Introduction} % (fold)
	\label{sec:knn_introduction}

	\emph{Some notes and figures in this section are borrowed from \cite{medium_knn} and \cite{point_knn}}.

	K-nearest neighbors (KNN) is one of the most commonly used learning
	algorithms. The goal of knn is to find a number (k) of training data points
	closest to the test point. These neighbors can be used to predict labels via
	classification or regression.

	KNN does not have a training phase like the most of learning techniques. It
	does not create a model to generalize the data, instead the algorithm uses the
	whole training dataset (or a specific subset of it).

	KNN can be used for classification, the output is a class membership (a
	discrete value). An object is classified by a majority vote of its neighbors,
	with the object being assigned to the class most common among its k nearest
	neighbors. It can also be used for regression, output is the value for the
	object (predicts continuous values). This value is the average (or median) of
	the values of its k nearest neighbors.

	\section{Implementation Details}

	The basic KNN implementation depends on the table join between the training dataset and the test dataset.

	\begin{sql}
	(SELECT test_id,
	train_id,
	fn_dist(train_col_name, test_col_name) AS dist,
	label
	FROM train_table, test_table) AS knn_sub
	\end{sql}

	Once we have the distance between every train - test pair, the algorithm picks the k smallest values.

	\begin{sql}
	SELECT row_number() OVER
	(PARTITION BY test_id ORDER BY dist) AS r,
	test_id,
	train_id,
	label
	FROM knn_sub
	WHERE r <= k
	\end{sql}

	Finally, the prediction is completed based on the labels of the selected
	training points for each test point.

	\section{Enabling KD-tree}

	One of the major shortcomings of KNN is the fact that it is computationally
	expensive. In addition, there is no training phase; which means every single
	prediction will have to compute the full table join. One of the ways to
	improve the performance is to reduce the search space for test points. Kd-tree
	option is developed to enable trading the accuracy of the output with higher
	performance by reducing the neighbor search space.

	Kd-trees are used for organizing data in k dimensions. It is constructed like
	a binary search tree where each level of the tree is using a specific
	dimension for finding splits.


	\begin{figure}[h]
	\centering
	\includegraphics[width=0.9\textwidth]{figures/2d_kdtree.pdf}
	\caption{A 2D kd-tree of depth 3}
	\label{kdd:2d_kdtree}
	\end{figure}

	A kd-tree is constructed by finding the median value of the data in a
	particular dimension and separating the data into two sections based on this
	value. This process is repeated a number of times to construct smaller
	regions. Once the kd-tree is prepared, it can be used by any test point to
	find its assigned region and this fragmentation can be used for limiting the
	search space for nearest neighbors.

	Once we have the kd-tree regions and their borders, we find the associated
	regions for the test points. This gives us the first region to search for
	nearest neighbors. In addition, we allow the user to request for multiple
	regions to search. This means we have to decide which additional regions to
	include in our search. We implemented a backtracking algorithm to find these
	regions. The core idea is to find the closest border for each test point and
	select the region on the other side of the border. Note that points that
	reside in the same region might have different secondary (or tertiary, etc.)
	regions. Consider the tree at Figure~\ref{kdd:2d_kdtree}. A test point at $<5
	, 2>$ is in the same region as $<3 , 3.9>$. However, their closest borders and
	the associated secondary regions are wildly different. In addition, consider
	$<3 , 3.9>$ and $<6 , 3.9>$. They both have the same border as their closest
	one ($y=4$). However, their closest regions do differ. To make sure that we
	get the correct region, the following scheme is implemented. For a given point
	$P$, we find the closest border, $dim[i] = x$ and $P$'s relative position to
	it ($pos$ = $-1$ for lower and $+1$ for higher). We conjure a new point that
	consists of the same values as the test point in every dimension except $i$.
	For $dim[i]$, we set the value to $x-pos*\epsilon$. Finally, we use the
	existing kd-tree to find this new point's assigned region. This region is our
	expansion target for the point $P$. We repeat this process with the next
	closest border as requested by the user.

	The knn algorithm does not change significantly with the addition of regions.
	Assuming that the training and test datasets have their region information
	stored in the tables, the only necessary change is ensuring that the table
	join uses these region ids to limit the search space.


	\begin{sql}
	(SELECT test_id,
	train_id,
	fn_dist(train_col_name, test_col_name) AS dist,
	label
	FROM train_table, test_table
	WHERE train_table.region_id = test_table.region_id
	) AS knn_sub
	\end{sql}