blob: 2a4fccc67470aa85c2e51a6fa938c92bc46c4184 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: DBSCAN</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.19.0</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__dbscan.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">DBSCAN<div class="ingroups"><a class="el" href="group__grp__early__stage.html">Early Stage Development</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> <ul>
<li class="level1">
<a href="#cluster">Clustering Function</a> </li>
<li class="level1">
<a href="#assignment">Cluster Assignment</a> </li>
<li class="level1">
<a href="#examples">Examples</a> </li>
<li class="level1">
<a href="#literature">Literature</a> </li>
</ul>
</div><dl class="section warning"><dt>Warning</dt><dd><em> This MADlib method is still in early stage development. Interface and implementation are subject to change. </em></dd></dl>
<p>Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm designed to discover clusters of arbitrary shape [1,2]. It places minimum requirements on domain knowledge to determine input parameters and has good efficiency on large databases.</p>
<p>Given a set of points, DBSCAN groups together points that are closely packed with many nearby neighbors (<em>core</em>), and marks points that lie alone in low-density regions with few nearby neighbors (<em>border</em>). Other points in the dataset that are not core or border are considered as <em>noise</em>. This method tends to be good for data which contains clusters of similar density.</p>
<p>Currently only a brute force approach is implemented, suitable for small datasets. Other approaches for larger datasets will be implemented in a future release.</p>
<p><a class="anchor" id="cluster"></a></p><dl class="section user"><dt>Clustering Function</dt><dd></dd></dl>
<pre class="syntax">
dbscan( source_table,
output_table,
id_column,
expr_point,
eps,
min_samples,
metric,
algorithm
)
</pre><p><b>Arguments</b> </p><dl class="arglist">
<dt>source_table </dt>
<dd><p class="startdd">TEXT. Name of the table containing the input data points. </p>
<p class="enddd"></p>
</dd>
<dt>output_table </dt>
<dd><p class="startdd">TEXT. Name of the table containing the clustering results. </p>
<p class="enddd"></p>
</dd>
<dt>id </dt>
<dd><p class="startdd">TEXT. Name of a column or expression containing a unique integer id for each training point. </p>
<p class="enddd"></p>
</dd>
<dt>point </dt>
<dd><p class="startdd">TEXT. Name of the column with point coordinates in array form, or an expression that evaluates to an array of point coordinates. </p>
<p class="enddd"></p>
</dd>
<dt>eps </dt>
<dd><p class="startdd">FLOAT8. Maximum distance between two samples for one to be considered in the neighborhood of the other. (Note this is not a maximum bound on the distances of points within a cluster.) This is an important parameter to choose appropriately and should consider both the nature of the data set and the distance function. </p>
<p class="enddd"></p>
</dd>
<dt>min_samples (optional) </dt>
<dd><p class="startdd">INTEGER, default: 5. Number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. This parameter controls how tolerant the algorithm is towards noise, so on noisy data sets it may be useful to increase the magnitude of this parameter. </p>
<dl class="section note"><dt>Note</dt><dd>The parameters 'eps' and 'min_samples' together define the density of a cluster. A core point is where there exist 'min_samples' other points within a distance of 'eps', which are defined as neighbors of the core point. A higher value of 'min_samples' or a lower value of 'eps' indicate that higher density is needed to form a cluster.</dd></dl>
</dd>
<dt>metric (optional) </dt>
<dd>TEXT, default: 'squared_dist_norm2'. The name of the function used to calculate the distance between data points. The following distance functions can be used: <ul>
<li>
<b><a class="el" href="linalg_8sql__in.html#aad193850e79c4b9d811ca9bc53e13476">dist_norm1</a></b>: 1-norm/Manhattan (element-wise median). MADlib does not provide a median aggregate function for performance reasons. </li>
<li>
<b><a class="el" href="linalg_8sql__in.html#aa58e51526edea6ea98db30b6f250adb4">dist_norm2</a></b>: 2-norm/Euclidean (element-wise mean) </li>
<li>
<b><a class="el" href="linalg_8sql__in.html#a00a08e69f27524f2096032214e15b668">squared_dist_norm2</a></b>: squared Euclidean distance (element-wise mean) </li>
<li>
<b><a class="el" href="linalg_8sql__in.html#a8c7b9281a72ff22caf06161701b27e84">dist_angle</a></b>: angle (element-wise mean of normalized points) </li>
<li>
<p class="startli"><b><a class="el" href="linalg_8sql__in.html#afa13b4c6122b99422d666dedea136c18">dist_tanimoto</a></b>: tanimoto (element-wise mean of normalized points) </p>
<p class="endli"></p>
</li>
</ul>
</dd>
<dt>algorithm (optional) </dt>
<dd>TEXT, default: 'optimized'. The name of the algorithm used to compute clusters. <ul>
<li>
<p class="startli"><b>brute_force</b>: Brute force can be slow and should not be used for large datasets. However, it does have less initial setup overhead than the parllel-optimized algorithm, which may make it faster for some cases involving small datasets. You can also use a short form "b" or "brute" etc. to select brute force.</p>
<p class="endli"></p>
</li>
<li>
<p class="startli"><b>optimized</b>: This uses an rtree index to accelerate range_queries while performing the cluster detection. It is designed with gpdb in mind, in which case it intelligently partitions the space into a number of connected regions, runs DBSCAN in parallel on each region, and then merges the results together. By default, the maximum number of regions it will consider using is equal to the number of segments in the database, but if you suspect it may be spending too much time on the segmentation, this can be limited further by setting the max_segmentation_depth parameter to something lower. This should decrease the segmentation overhead, but will also decrease the amount of parallelism. If the dataset is relatively small, another option to consider is brute force, which has very little overhead but won't scale as well. For large enough datasets, and an appropriate choice of eps, this algorithm should be able to run in O((N/S) log N) time where N is the number of rows (points) in the input and S is the number of regions (often equal to the number of segments, but may be less). (Worst case for brute force is O(N^2).) eps should be chosen based on the density of the dataset, where DBSCAN works best for datasets where the clusters have a roughly uniform density. If eps is chosen too large, then the runtime will explode and nearly every point will be considered connected to every other point in one large cluster. Best practice is to start with a relatively small eps, so it can return the first result faster; if it looks like there are too many small clusters, increase eps and allow it to run for longer.</p>
<p class="endli"></p>
</li>
</ul>
</dd>
</dl>
<p><b>Output</b> <br />
The output table for DBSCAN module has the following columns: </p><table class="output">
<tr>
<th>id </th><td>SMALLINT|INTEGER|BIGINT. A column or expression which specifies unique ids for each row in the training dataset. Must evaluate to an integer type. </td></tr>
<tr>
<th>cluster_id </th><td>INTEGER. The cluster id of each classified data point. </td></tr>
<tr>
<th>is_core_point </th><td>BOOLEAN. Indicates if the training data point is core If it is not a core point and listed in the output table, it is a border point. Noise points are not shown in this table. </td></tr>
<tr>
<th>point </th><td>TEXT. The coordinates of each point classified </td></tr>
</table>
<p><a class="anchor" id="assignment"></a></p><dl class="section user"><dt>Cluster Assignment</dt><dd></dd></dl>
<p>After clustering, the cluster assignment for each data point can be computed:</p>
<pre class="syntax">
dbscan_predict( dbscan_table,
source_table,
id,
point,
output_table
)
</pre><p><b>Arguments</b> </p><dl class="arglist">
<dt>dbscan_table </dt>
<dd><p class="startdd">TEXT. Name of the table created by running DBSCAN.</p>
<p class="enddd"></p>
</dd>
<dt>source_table </dt>
<dd><p class="startdd">TEXT. Name of the table containing the input data points. </p>
<p class="enddd"></p>
</dd>
<dt>id </dt>
<dd><p class="startdd">VARCHAR. A column name or expression which selects a unique integer id for each training point. </p>
<p class="enddd"></p>
</dd>
<dt>point </dt>
<dd><p class="startdd">TEXT. Name of the column with point coordinates in array form, or an expression that evaluates to an array of point coordinates. </p>
<p class="enddd"></p>
</dd>
<dt>output_table </dt>
<dd>TEXT. Name of the table containing the clustering results. </dd>
</dl>
<p><b>Output</b> <br />
The output is a table with the following columns: </p><table class="output">
<tr>
<th>id </th><td>BIGINT. The unique id of each row, as it as passed in. </td></tr>
<tr>
<th>cluster_id </th><td>INTEGER. Cluster assignment (zero-based, i.e., 0,1,2...). </td></tr>
<tr>
<th>distance </th><td>DOUBLE PRECISION. Distance to the cluster centroid. </td></tr>
</table>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<ol type="1">
<li>Create input data: <pre class="example">
DROP TABLE IF EXISTS dbscan_train_data;
CREATE TABLE dbscan_train_data (pid int, points double precision[]);
INSERT INTO dbscan_train_data VALUES
(1, '{1, 1}'),
(2, '{2, 1}'),
(3, '{1, 2}'),
(4, '{2, 2}'),
(5, '{3, 5}'),
(6, '{3, 9}'),
(7, '{3, 10}'),
(8, '{4, 10}'),
(9, '{4, 11}'),
(10, '{5, 10}'),
(11, '{7, 10}'),
(12, '{10, 9}'),
(13, '{10, 6}'),
(14, '{9, 5}'),
(15, '{10, 5}'),
(16, '{11, 5}'),
(17, '{9, 4}'),
(18, '{10, 4}'),
(19, '{11, 4}'),
(20, '{10, 3}');
CREATE TABLE dbscan_test_data (pid int, points double precision[]);
INSERT INTO dbscan_test_data VALUES
(1, '{1, 2}'),
(2, '{2, 2}'),
(3, '{1, 3}'),
(4, '{2, 2}'),
(10, '{5, 11}'),
(11, '{7, 10}'),
(12, '{10, 9}'),
(13, '{10, 6}'),
(14, '{9, 5}'),
(15, '{10, 6}');
</pre></li>
<li>Run DBSCAN using the brute force method with a Euclidean distance function: <pre class="example">
DROP TABLE IF EXISTS dbscan_result, dbscan_result_summary;
SELECT madlib.dbscan(
'dbscan_train_data', -- source table
'dbscan_result', -- output table
'pid', -- point id column
'points', -- data point
1.75, -- epsilon
4, -- min samples
'dist_norm2', -- metric
'brute_force'); -- algorithm
SELECT * FROM dbscan_result ORDER BY pid;
</pre> <pre class="result">
pid | cluster_id | is_core_point | __points__
-----+------------+---------------+------------
1 | 0 | t | {1,1}
2 | 0 | t | {2,1}
3 | 0 | t | {1,2}
4 | 0 | t | {2,2}
6 | 1 | f | {3,9}
7 | 1 | t | {3,10}
8 | 1 | t | {4,10}
9 | 1 | t | {4,11}
10 | 1 | f | {5,10}
13 | 2 | t | {10,6}
14 | 2 | t | {9,5}
15 | 2 | t | {10,5}
16 | 2 | t | {11,5}
17 | 2 | t | {9,4}
18 | 2 | t | {10,4}
19 | 2 | t | {11,4}
20 | 2 | t | {10,3}
(17 rows)
</pre> There are three clusters created. All points are core points except for 6 and 10 which are border points. The noise points do not show up in the output table. If you want to see the noise points you can use a query like: <pre class="example">
SELECT l.* FROM dbscan_train_data l WHERE NOT EXISTS
(SELECT NULL FROM dbscan_result r WHERE r.pid = l.pid)
ORDER BY l.pid;
</pre> <pre class="result">
pid | points
-----+--------
5 | {3,5}
11 | {7,10}
12 | {10,9}
</pre> The summary table lists the 'eps' value and the distance metric used: <pre class="example">
SELECT * FROM dbscan_result_summary;
</pre> <pre class="result">
id | eps | metric
-----------+------+------------
pid | 1.75 | dist_norm2
</pre></li>
<li>Find the cluster assignment for the test data points: <pre class="example">
SELECT madlib.dbscan_predict(
'dbscan_result', -- from DBSCAN run
'dbscan_test_data', -- test dataset
'pid', -- point id column
'points', -- data point
'dbscan_predict_out' -- output table
);
SELECT * FROM dbscan_predict_out ORDER BY pid;
</pre> <pre class="result">
pid | cluster_id | distance
-----+------------+----------
1 | 0 | 0
2 | 0 | 0
3 | 0 | 1
4 | 0 | 0
10 | 1 | 1
13 | 2 | 0
14 | 2 | 0
15 | 2 | 0
(8 rows)
</pre></li>
</ol>
<p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, "A Density-Based Algorithm for Discovering Clusters
in Large Spatial Databases with Noise", KDD-96 Proceedings, <a href="https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf">https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf</a></p>
<p>[2] Erich Schubert, Jörg Sander, Martin Ester, Hans-Peter Kriegel, Xiaowei Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN", ACM Transactions on Database Systems, July 2017, Article No. 19, <a href="https://dl.acm.org/doi/10.1145/3068335">https://dl.acm.org/doi/10.1145/3068335</a> </p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Wed Dec 15 2021 20:27:21 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>