| <!-- HTML header for doxygen 1.8.4--> |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| <html xmlns="http://www.w3.org/1999/xhtml"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/> |
| <meta http-equiv="X-UA-Compatible" content="IE=9"/> |
| <meta name="generator" content="Doxygen 1.8.13"/> |
| <meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/> |
| <title>MADlib: DBSCAN</title> |
| <link href="tabs.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="jquery.js"></script> |
| <script type="text/javascript" src="dynsections.js"></script> |
| <link href="navtree.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="resize.js"></script> |
| <script type="text/javascript" src="navtreedata.js"></script> |
| <script type="text/javascript" src="navtree.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(initResizable); |
| </script> |
| <link href="search/search.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="search/searchdata.js"></script> |
| <script type="text/javascript" src="search/search.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(function() { init_search(); }); |
| </script> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"], |
| jax: ["input/TeX","output/HTML-CSS"], |
| }); |
| </script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script> |
| <!-- hack in the navigation tree --> |
| <script type="text/javascript" src="eigen_navtree_hacks.js"></script> |
| <link href="doxygen.css" rel="stylesheet" type="text/css" /> |
| <link href="madlib_extra.css" rel="stylesheet" type="text/css"/> |
| <!-- google analytics --> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| ga('create', 'UA-45382226-1', 'madlib.apache.org'); |
| ga('send', 'pageview'); |
| </script> |
| </head> |
| <body> |
| <div id="top"><!-- do not remove this div, it is closed by doxygen! --> |
| <div id="titlearea"> |
| <table cellspacing="0" cellpadding="0"> |
| <tbody> |
| <tr style="height: 56px;"> |
| <td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td> |
| <td style="padding-left: 0.5em;"> |
| <div id="projectname"> |
| <span id="projectnumber">1.18.0</span> |
| </div> |
| <div id="projectbrief">User Documentation for Apache MADlib</div> |
| </td> |
| <td> <div id="MSearchBox" class="MSearchBoxInactive"> |
| <span class="left"> |
| <img id="MSearchSelect" src="search/mag_sel.png" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| alt=""/> |
| <input type="text" id="MSearchField" value="Search" accesskey="S" |
| onfocus="searchBox.OnSearchFieldFocus(true)" |
| onblur="searchBox.OnSearchFieldFocus(false)" |
| onkeyup="searchBox.OnSearchFieldChange(event)"/> |
| </span><span class="right"> |
| <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a> |
| </span> |
| </div> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <!-- end header part --> |
| <!-- Generated by Doxygen 1.8.13 --> |
| <script type="text/javascript"> |
| var searchBox = new SearchBox("searchBox", "search",false,'Search'); |
| </script> |
| </div><!-- top --> |
| <div id="side-nav" class="ui-resizable side-nav-resizable"> |
| <div id="nav-tree"> |
| <div id="nav-tree-contents"> |
| <div id="nav-sync" class="sync"></div> |
| </div> |
| </div> |
| <div id="splitbar" style="-moz-user-select:none;" |
| class="ui-resizable-handle"> |
| </div> |
| </div> |
| <script type="text/javascript"> |
| $(document).ready(function(){initNavTree('group__grp__dbscan.html','');}); |
| </script> |
| <div id="doc-content"> |
| <!-- window showing the filter options --> |
| <div id="MSearchSelectWindow" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| onkeydown="return searchBox.OnSearchSelectKey(event)"> |
| </div> |
| |
| <!-- iframe showing the search results (closed by default) --> |
| <div id="MSearchResultsWindow"> |
| <iframe src="javascript:void(0)" frameborder="0" |
| name="MSearchResults" id="MSearchResults"> |
| </iframe> |
| </div> |
| |
| <div class="header"> |
| <div class="headertitle"> |
| <div class="title">DBSCAN<div class="ingroups"><a class="el" href="group__grp__early__stage.html">Early Stage Development</a></div></div> </div> |
| </div><!--header--> |
| <div class="contents"> |
| <div class="toc"><b>Contents</b> <ul> |
| <li class="level1"> |
| <a href="#cluster">Clustering Function</a> </li> |
| <li class="level1"> |
| <a href="#assignment">Cluster Assignment</a> </li> |
| <li class="level1"> |
| <a href="#examples">Examples</a> </li> |
| <li class="level1"> |
| <a href="#literature">Literature</a> </li> |
| </ul> |
| </div><dl class="section warning"><dt>Warning</dt><dd><em> This MADlib method is still in early stage development. Interface and implementation are subject to change. </em></dd></dl> |
| <p>Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm designed to discover clusters of arbitrary shape [1,2]. It places minimum requirements on domain knowledge to determine input parameters and has good efficiency on large databases.</p> |
| <p>Given a set of points, DBSCAN groups together points that are closely packed with many nearby neighbors (<em>core</em>), and marks points that lie alone in low-density regions with few nearby neighbors (<em>border</em>). Other points in the dataset that are not core or border are considered as <em>noise</em>. This method tends to be good for data which contains clusters of similar density.</p> |
| <p>Currently only a brute force approach is implemented, suitable for small datasets. Other approaches for larger datasets will be implemented in a future release.</p> |
| <p><a class="anchor" id="cluster"></a></p><dl class="section user"><dt>Clustering Function</dt><dd></dd></dl> |
| <pre class="syntax"> |
| dbscan( source_table, |
| output_table, |
| id_column, |
| expr_point, |
| eps, |
| min_samples, |
| metric, |
| algorithm |
| ) |
| </pre><p><b>Arguments</b> </p><dl class="arglist"> |
| <dt>source_table </dt> |
| <dd><p class="startdd">TEXT. Name of the table containing the input data points. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>output_table </dt> |
| <dd><p class="startdd">TEXT. Name of the table containing the clustering results. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>id_column </dt> |
| <dd><p class="startdd">TEXT. Name of the column containing a unique integer id for each training point. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>expr_point </dt> |
| <dd><p class="startdd">TEXT. Name of the column with point coordinates in array form, or an expression that evaluates to an array of point coordinates. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>eps </dt> |
| <dd><p class="startdd">FLOAT8. Maximum distance between two samples for one to be considered in the neighborhood of the other. (Note this is not a maximum bound on the distances of points within a cluster.) This is an important parameter to choose appropriately and should consider both the nature of the data set and the distance function. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>min_samples (optional) </dt> |
| <dd><p class="startdd">INTEGER, default: 5. Number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. This parameter controls how tolerant the algorithm is towards noise, so on noisy data sets it may be useful to increase the magnitude of this parameter. </p> |
| <dl class="section note"><dt>Note</dt><dd>The parameters 'eps' and 'min_samples' together define the density of a cluster. A core point is where there exist 'min_samples' other points within a distance of 'eps', which are defined as neighbors of the core point. A higher value of 'min_samples' or a lower value of 'eps' indicate that higher density is needed to form a cluster.</dd></dl> |
| </dd> |
| <dt>metric (optional) </dt> |
| <dd>TEXT, default: 'squared_dist_norm2'. The name of the function used to calculate the distance between data points. The following distance functions can be used: <ul> |
| <li> |
| <b><a class="el" href="linalg_8sql__in.html#aad193850e79c4b9d811ca9bc53e13476">dist_norm1</a></b>: 1-norm/Manhattan (element-wise median). MADlib does not provide a median aggregate function for performance reasons. </li> |
| <li> |
| <b><a class="el" href="linalg_8sql__in.html#aa58e51526edea6ea98db30b6f250adb4">dist_norm2</a></b>: 2-norm/Euclidean (element-wise mean) </li> |
| <li> |
| <b><a class="el" href="linalg_8sql__in.html#a00a08e69f27524f2096032214e15b668">squared_dist_norm2</a></b>: squared Euclidean distance (element-wise mean) </li> |
| <li> |
| <b><a class="el" href="linalg_8sql__in.html#a8c7b9281a72ff22caf06161701b27e84">dist_angle</a></b>: angle (element-wise mean of normalized points) </li> |
| <li> |
| <p class="startli"><b><a class="el" href="linalg_8sql__in.html#afa13b4c6122b99422d666dedea136c18">dist_tanimoto</a></b>: tanimoto (element-wise mean of normalized points) </p> |
| <p class="endli"></p> |
| </li> |
| </ul> |
| </dd> |
| <dt>algorithm (optional) </dt> |
| <dd>TEXT, default: 'brute_force'. The name of the algorithm used to compute clusters. Currently only brute force is supported: <ul> |
| <li> |
| <b>brute_force</b>: Produces an exact result by searching all points in the search space. Brute force can be slow and is intended to be used for small datasets only. You can also use a short form "b" or "brute" etc. to select brute force. </li> |
| </ul> |
| </dd> |
| </dl> |
| <p><b>Output</b> <br /> |
| The output table for DBSCAN module has the following columns: </p><table class="output"> |
| <tr> |
| <th>id_column </th><td>INTEGER. Test data point id. </td></tr> |
| <tr> |
| <th>cluster_id </th><td>INTEGER. Cluster id associated with the test data point. </td></tr> |
| <tr> |
| <th>is_core_point </th><td>BOOLEAN. Indicates if the test data point is core If it is not a core point and listed in the output table, it is a border point. Noise points are not shown in this table. </td></tr> |
| <tr> |
| <th>__points__ </th><td>TEXT. Column or expression for the data point. </td></tr> |
| </table> |
| <p><a class="anchor" id="assignment"></a></p><dl class="section user"><dt>Cluster Assignment</dt><dd></dd></dl> |
| <p>After clustering, the cluster assignment for each data point can be computed:</p> |
| <pre class="syntax"> |
| dbscan_predict( dbscan_table, |
| source_table, |
| id_column, |
| expr_point, |
| output_table |
| ) |
| </pre><p><b>Arguments</b> </p><dl class="arglist"> |
| <dt>dbscan_table </dt> |
| <dd><p class="startdd">TEXT. Name of the table created by running DBSCAN.</p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>source_table </dt> |
| <dd><p class="startdd">TEXT. Name of the table containing the input data points. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>id_column </dt> |
| <dd><p class="startdd">TEXT. Name of the column containing a unique integer id for each training point. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>expr_point </dt> |
| <dd><p class="startdd">TEXT. Name of the column with point coordinates in array form, or an expression that evaluates to an array of point coordinates. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>output_table </dt> |
| <dd>TEXT. Name of the table containing the clustering results. </dd> |
| </dl> |
| <p><b>Output</b> <br /> |
| The output is a table with the following columns: </p><table class="output"> |
| <tr> |
| <th>id_column </th><td>INTEGER. ID column passed to the function. </td></tr> |
| <tr> |
| <th>cluster_id </th><td>INTEGER. Cluster assignment (zero-based, i.e., 0,1,2...). </td></tr> |
| <tr> |
| <th>distance </th><td>DOUBLE PRECISION. Distance to the cluster centroid. </td></tr> |
| </table> |
| <p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl> |
| <ol type="1"> |
| <li>Create input data: <pre class="example"> |
| DROP TABLE IF EXISTS dbscan_train_data; |
| CREATE TABLE dbscan_train_data (pid int, points double precision[]); |
| INSERT INTO dbscan_train_data VALUES |
| (1, '{1, 1}'), |
| (2, '{2, 1}'), |
| (3, '{1, 2}'), |
| (4, '{2, 2}'), |
| (5, '{3, 5}'), |
| (6, '{3, 9}'), |
| (7, '{3, 10}'), |
| (8, '{4, 10}'), |
| (9, '{4, 11}'), |
| (10, '{5, 10}'), |
| (11, '{7, 10}'), |
| (12, '{10, 9}'), |
| (13, '{10, 6}'), |
| (14, '{9, 5}'), |
| (15, '{10, 5}'), |
| (16, '{11, 5}'), |
| (17, '{9, 4}'), |
| (18, '{10, 4}'), |
| (19, '{11, 4}'), |
| (20, '{10, 3}'); |
| CREATE TABLE dbscan_test_data (pid int, points double precision[]); |
| INSERT INTO dbscan_test_data VALUES |
| (1, '{1, 2}'), |
| (2, '{2, 2}'), |
| (3, '{1, 3}'), |
| (4, '{2, 2}'), |
| (10, '{5, 11}'), |
| (11, '{7, 10}'), |
| (12, '{10, 9}'), |
| (13, '{10, 6}'), |
| (14, '{9, 5}'), |
| (15, '{10, 6}'); |
| </pre></li> |
| <li>Run DBSCAN using the brute force method with a Euclidean distance function: <pre class="example"> |
| DROP TABLE IF EXISTS dbscan_result, dbscan_result_summary; |
| SELECT madlib.dbscan( |
| 'dbscan_train_data', -- source table |
| 'dbscan_result', -- output table |
| 'pid', -- point id column |
| 'points', -- data point |
| 1.75, -- epsilon |
| 4, -- min samples |
| 'dist_norm2', -- metric |
| 'brute_force'); -- algorithm |
| SELECT * FROM dbscan_result ORDER BY pid; |
| </pre> <pre class="result"> |
| pid | cluster_id | is_core_point | __points__ |
| -----+------------+---------------+------------ |
| 1 | 0 | t | {1,1} |
| 2 | 0 | t | {2,1} |
| 3 | 0 | t | {1,2} |
| 4 | 0 | t | {2,2} |
| 6 | 1 | f | {3,9} |
| 7 | 1 | t | {3,10} |
| 8 | 1 | t | {4,10} |
| 9 | 1 | t | {4,11} |
| 10 | 1 | f | {5,10} |
| 13 | 2 | t | {10,6} |
| 14 | 2 | t | {9,5} |
| 15 | 2 | t | {10,5} |
| 16 | 2 | t | {11,5} |
| 17 | 2 | t | {9,4} |
| 18 | 2 | t | {10,4} |
| 19 | 2 | t | {11,4} |
| 20 | 2 | t | {10,3} |
| (17 rows) |
| </pre> There are three clusters created. All points are core points except for 6 and 10 which are border points. The noise points do not show up in the output table. If you want to see the noise points you can use a query like: <pre class="example"> |
| SELECT l.* FROM dbscan_train_data l WHERE NOT EXISTS |
| (SELECT NULL FROM dbscan_result r WHERE r.pid = l.pid) |
| ORDER BY l.pid; |
| </pre> <pre class="result"> |
| pid | points |
| -----+-------- |
| 5 | {3,5} |
| 11 | {7,10} |
| 12 | {10,9} |
| </pre> The summary table lists the 'eps' value and the distance metric used: <pre class="example"> |
| SELECT * FROM dbscan_result_summary; |
| </pre> <pre class="result"> |
| id_column | eps | metric |
| -----------+------+------------ |
| pid | 1.75 | dist_norm2 |
| </pre></li> |
| <li>Find the cluster assignment for the test data points: <pre class="example"> |
| SELECT madlib.dbscan_predict( |
| 'dbscan_result', -- from DBSCAN run |
| 'dbscan_test_data', -- test dataset |
| 'pid', -- point id column |
| 'points', -- data point |
| 'dbscan_predict_out' -- output table |
| ); |
| SELECT * FROM dbscan_predict_out ORDER BY pid; |
| </pre> <pre class="result"> |
| pid | cluster_id | distance |
| -----+------------+---------- |
| 1 | 0 | 0 |
| 2 | 0 | 0 |
| 3 | 0 | 1 |
| 4 | 0 | 0 |
| 10 | 1 | 1 |
| 13 | 2 | 0 |
| 14 | 2 | 0 |
| 15 | 2 | 0 |
| (8 rows) |
| </pre></li> |
| </ol> |
| <p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl> |
| <p>[1] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, "A Density-Based Algorithm for Discovering Clusters |
| in Large Spatial Databases with Noise", KDD-96 Proceedings, <a href="https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf">https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf</a></p> |
| <p>[2] Erich Schubert, Jörg Sander, Martin Ester, Hans-Peter Kriegel, Xiaowei Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN", ACM Transactions on Database Systems, July 2017, Article No. 19, <a href="https://dl.acm.org/doi/10.1145/3068335">https://dl.acm.org/doi/10.1145/3068335</a> </p> |
| </div><!-- contents --> |
| </div><!-- doc-content --> |
| <!-- start footer part --> |
| <div id="nav-path" class="navpath"><!-- id is needed for treeview function! --> |
| <ul> |
| <li class="footer">Generated on Wed Mar 31 2021 20:45:50 for MADlib by |
| <a href="http://www.doxygen.org/index.html"> |
| <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li> |
| </ul> |
| </div> |
| </body> |
| </html> |