blob: 623b34bb09ea099876dd7a139c08343275359103 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: k-Means Clustering</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.19.0</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__kmeans.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">k-Means Clustering<div class="ingroups"><a class="el" href="group__grp__unsupervised.html">Unsupervised Learning</a> &raquo; <a class="el" href="group__grp__clustering.html">Clustering</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> <ul>
<li class="level1">
<a href="#cluster">Clustering Function</a> </li>
<li class="level1">
<a href="#auto_cluster">Auto Clustering Function</a> </li>
<li class="level1">
<a href="#assignment">Cluster Assignment</a> </li>
<li class="level1">
<a href="#silh">Simple Silhouette</a> </li>
<li class="level1">
<a href="#examples">Examples</a> </li>
<li class="level1">
<a href="#background">Technical Background</a> </li>
<li class="level1">
<a href="#literature">Literature</a> </li>
<li class="level1">
<a href="#related">Related Topics</a> </li>
</ul>
</div><p>Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of <em>similarity</em>. In the k-means variant, given \( n \) points \( x_1, \dots, x_n \in \mathbb R^d \), the goal is to position \( k \) centroids \( c_1, \dots, c_k \in \mathbb R^d \) so that the sum of <em>distances</em> between each point and its closest centroid is minimized. Each centroid represents a cluster that consists of all points to which this centroid is closest.</p>
<p>This module can compute clusters given the number of centroids <em>k</em> as an input, using a variety of seeding methods. It can also automatically select the best <em>k</em> value from a range of suggested <em>k</em> values, using the simplified silhouette method or the elbow method.</p>
<p><a class="anchor" id="cluster"></a></p><dl class="section user"><dt>Clustering Function</dt><dd></dd></dl>
<p>The k-means algorithm can be invoked in different ways, depending on the source of the initial set of centroids:</p>
<ul>
<li>Uniform random centroid seeding method. <pre class="syntax">
kmeans_random( rel_source,
expr_point,
k,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
</pre></li>
<li>k-means++ centroid seeding method <a href="#kmeans-lit-1">[1]</a>. This method can speed up convergence by seeding centroids spread out over the whole range of the input points, while at the same time not being too susceptible to outliers. <pre class="syntax">
kmeanspp( rel_source,
expr_point,
k,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned,
seeding_sample_ratio
)
</pre></li>
<li>Supply an initial centroid set in a relation identified by the <em>rel_initial_centroids</em> argument, for the case where initial centroids are stored in a table. <pre class="syntax">
kmeans( rel_source,
expr_point,
rel_initial_centroids,
expr_centroid,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
</pre></li>
<li>Provide an initial centroid set as an array expression in the <em>initial_centroids</em> argument. <pre class="syntax">
kmeans( rel_source,
expr_point,
initial_centroids,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned
)
</pre></li>
</ul>
<p><b>Arguments</b> </p><dl class="arglist">
<dt>rel_source </dt>
<dd><p class="startdd">TEXT. The name of the table containing the input data points. Data points and predefined centroids (if used) are expected to be stored row-wise, in a column of type <code><a class="el" href="group__grp__svec.html">SVEC</a></code> (or any type convertible to <code><a class="el" href="group__grp__svec.html">SVEC</a></code>, like <code>FLOAT[]</code> or <code>INTEGER[]</code>). Data points with non-finite values (NULL, NaN, infinity) in any component are skipped during analysis. </p>
<p class="enddd"></p>
</dd>
<dt>expr_point </dt>
<dd><p class="startdd">TEXT. The name of the column with point coordinates or an array expression.</p>
<p class="enddd"></p>
</dd>
<dt>k </dt>
<dd><p class="startdd">INTEGER. The number of centroids to calculate.</p>
<p class="enddd"></p>
</dd>
<dt>fn_dist (optional) </dt>
<dd><p class="startdd">TEXT, default: 'squared_dist_norm2'. The name of the function to use to calculate the distance from a data point vector to a centroid vector. The following distance functions can be used (computation of barycenter/mean in parentheses): </p><ul>
<li>
<b><a class="el" href="linalg_8sql__in.html#aad193850e79c4b9d811ca9bc53e13476">dist_norm1</a></b>: 1-norm/Manhattan (element-wise median). MADlib does not provide a median aggregate function for performance reasons. </li>
<li>
<b><a class="el" href="linalg_8sql__in.html#aa58e51526edea6ea98db30b6f250adb4">dist_norm2</a></b>: 2-norm/Euclidean (element-wise mean) </li>
<li>
<b><a class="el" href="linalg_8sql__in.html#a00a08e69f27524f2096032214e15b668">squared_dist_norm2</a></b>: squared Euclidean distance (element-wise mean) </li>
<li>
<b><a class="el" href="linalg_8sql__in.html#a8c7b9281a72ff22caf06161701b27e84">dist_angle</a></b>: angle (element-wise mean of normalized points) </li>
<li>
<b><a class="el" href="linalg_8sql__in.html#afa13b4c6122b99422d666dedea136c18">dist_tanimoto</a></b>: tanimoto (element-wise mean of normalized points <a href="#kmeans-lit-5">[5]</a>) </li>
<li>
<b>user defined function</b> with signature <code>DOUBLE PRECISION[] x, DOUBLE PRECISION[] y -&gt; DOUBLE PRECISION</code></li>
</ul>
<p class="enddd"></p>
</dd>
<dt>agg_centroid (optional) </dt>
<dd><p class="startdd">TEXT, default: 'avg'. The name of the aggregate function used to determine centroids. The following aggregate functions can be used:</p><ul>
<li>
<b><a class="el" href="linalg_8sql__in.html#a1aa37f73fb1cd8d7d106aa518dd8c0b4">avg</a></b>: average (default) </li>
<li>
<b><a class="el" href="linalg_8sql__in.html#a0b04663ca206f03e66aed5ea2b4cc461">normalized_avg</a></b>: normalized average</li>
</ul>
<p class="enddd"></p>
</dd>
<dt>max_num_iterations (optional) </dt>
<dd><p class="startdd">INTEGER, default: 20. The maximum number of iterations to perform.</p>
<p class="enddd"></p>
</dd>
<dt>min_frac_reassigned (optional) </dt>
<dd><p class="startdd">DOUBLE PRECISION, default: 0.001. The minimum fraction of centroids reassigned to continue iterating. When fewer than this fraction of centroids are reassigned in an iteration, the calculation completes.</p>
<p class="enddd"></p>
</dd>
<dt>seeding_sample_ratio (optional) </dt>
<dd><p class="startdd">DOUBLE PRECISION, default: 1.0. The proportion of subsample of original dataset to use for kmeans++ centroid seeding method. Kmeans++ scans through the data sequentially 'k' times and can be too slow for big datasets. When 'seeding_sample_ratio' is greater than 0 (thresholded to be maximum value of 1.0), the seeding is run on a uniform random subsample of the data. Note: the final K-means algorithm is run on the complete dataset. This parameter only builds a subsample for the seeding and is only available for kmeans++.</p>
<p class="enddd"></p>
</dd>
<dt>rel_initial_centroids </dt>
<dd><p class="startdd">TEXT. Table or view containing the set of initial centroids. </p>
<p class="enddd"></p>
</dd>
<dt>expr_centroid </dt>
<dd><p class="startdd">TEXT. The name of the column (or the array expression) in the <em>rel_initial_centroids</em> table or view that contains the centroid coordinates.</p>
<p class="enddd"></p>
</dd>
<dt>initial_centroids </dt>
<dd>DOUBLE PRECISION[][]. Array expression with the initial centroid coordinates. </dd>
</dl>
<p><b>Output</b> <br />
The output of the k-means module is a composite type with the following columns: </p><table class="output">
<tr>
<th>centroids </th><td>DOUBLE PRECISION[][]. The final centroid positions. </td></tr>
<tr>
<th>cluster_variance </th><td>DOUBLE PRECISION[]. The value of the objective function per cluster. </td></tr>
<tr>
<th>objective_fn </th><td>DOUBLE PRECISION. The value of the objective function. </td></tr>
<tr>
<th>frac_reassigned </th><td>DOUBLE PRECISION. The fraction of points reassigned in the last iteration. </td></tr>
<tr>
<th>num_iterations </th><td>INTEGER. The total number of iterations executed. </td></tr>
</table>
<p><a class="anchor" id="auto_cluster"></a></p><dl class="section user"><dt>Auto Clustering Function</dt><dd></dd></dl>
<p>The optimal number of centroids can be determined automatically, from a set of candidate values that you provide, for random seeding or <em>k-means++</em> seeding. The <a href="#simplified_silhouette">simplified silhouette method</a> or the <a href="#elbow">elbow method</a> are used to pick the best <em>k</em> value.</p>
<ul>
<li>Uniform random centroid seeding method. <pre class="syntax">
kmeans_random_auto( rel_source,
output_table,
expr_point,
k,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned,
k_selection_algorithm
)
</pre></li>
<li>k-means++ centroid seeding method. <pre class="syntax">
kmeanspp_auto( rel_source,
output_table,
expr_point,
k,
fn_dist,
agg_centroid,
max_num_iterations,
min_frac_reassigned,
seeding_sample_ratio,
k_selection_algorithm
)
</pre></li>
</ul>
<p><b>Arguments</b> <br />
The arguments for auto k selection are the same as described above, with the following additions: </p><dl class="arglist">
<dt>output_table </dt>
<dd><p class="startdd">TEXT. Name of the output table containing results for each k value. Details of the output table are shown below. A summary table called &lt;output_table&gt;_summary will also be created for the best k value as per the selection algorithm.</p>
<p class="enddd"></p>
</dd>
<dt>k </dt>
<dd><p class="startdd">INTEGER[]. Array of k values to test. Does not need to be contiguous but all elements must be &gt;1 and cannot repeat within the array. Parameter 'k_selection_algorithm' determines the evaluation method.</p>
<p class="enddd"></p>
</dd>
<dt>k_selection_algorithm (optional) </dt>
<dd><p class="startdd">TEXT, default: 'silhouette'. Method to evaluate optimal number of centroids k. Current approaches supported: 'silhouette', 'elbow' or 'both'. The text can be any subset of the strings; for e.g., 'silh' will use the silhouette method. Note that for large data sets, the silhouette computation can be expensive.</p>
<dl class="section note"><dt>Note</dt><dd>Note that the auto k-means algorithms require the NumPy python package to be installed on the system since the elbow method uses the NumPy gradient function <a href="#kmeans-lit-2">[2]</a>. For Greenplum clusters, installing NumPy only on the host machine of the master segment will be sufficient. The suggested installation method is: <em>pip install numpy &ndash;user</em> </dd></dl>
</dd>
</dl>
<p><b>Output Tables</b> <br />
Two output tables are created for auto k-means. The first is called 'output_table' and contains one row per k value: </p><table class="output">
<tr>
<th>k </th><td>INTEGER. Number of centroids. </td></tr>
<tr>
<th>centroids </th><td>DOUBLE PRECISION[][]. The final centroid positions. </td></tr>
<tr>
<th>cluster_variance </th><td>DOUBLE PRECISION[]. The value of the objective function per cluster. </td></tr>
<tr>
<th>objective_fn </th><td>DOUBLE PRECISION. The value of the objective function. </td></tr>
<tr>
<th>frac_reassigned </th><td>DOUBLE PRECISION. The fraction of points reassigned in the last iteration. </td></tr>
<tr>
<th>num_iterations </th><td>INTEGER. The total number of iterations executed. </td></tr>
<tr>
<th>k </th><td>INTEGER. Number of centroids as per the specified 'k_selection_algorithm'. If 'both' is specified, the best k value will correspond to the silhouette method. </td></tr>
<tr>
<th>silhouette </th><td>DOUBLE PRECISION. The value of the simplified silhouette score for the k value, if computed. </td></tr>
<tr>
<th>elbow </th><td>DOUBLE PRECISION. The value of the elbow score for the k value, if computed. </td></tr>
<tr>
<th>selection_algorithm </th><td>TEXT. Algorithm used to select the best k (either 'silhouette' or 'elbow') </td></tr>
</table>
<p>A summary table named &lt;output_table&gt;_summary is also created, which has the same output as the 'output_table' above but only contains one row for best k value as per the selection algorithm. If 'both' is specified for 'k_selection_algorithm' the best k value returned will correspond to the silhouette method.</p>
<p><a class="anchor" id="assignment"></a></p><dl class="section user"><dt>Cluster Assignment</dt><dd></dd></dl>
<p>After clustering, the cluster assignment for each data point can be computed with the help of the <a class="el" href="linalg_8sql__in.html#acf6628dfa4d73dfce65a582aa5c5a3db" title="Given matrix and vector compute the column of that is closest to . ">closest_column()</a> utility function:</p>
<pre class="syntax">
closest_column( m,
x,
dist
)
</pre><p><b>Arguments</b> </p><dl class="arglist">
<dt>m </dt>
<dd><p class="startdd">DOUBLE PRECISION[][]. Learned centroids from the training function.</p>
<p class="enddd"></p>
</dd>
<dt>x </dt>
<dd><p class="startdd">DOUBLE PRECISION[]. Data points.</p>
<p class="enddd"></p>
</dd>
<dt>dist (optional) </dt>
<dd>TEXT, default: 'squared_dist_norm2'. The name of the function to use to calculate the distance from a data point vector to a centroid vector. See the <em>fn_dist</em> argument of the k-means training function for more details on distance functions. </dd>
</dl>
<p><b>Output</b> <br />
The output is a composite type with the following columns: </p><table class="output">
<tr>
<th>column_id </th><td>INTEGER. Cluster assignment (zero-based, i.e., 0,1,2...). </td></tr>
<tr>
<th>distance </th><td>DOUBLE PRECISION. Distance to the cluster centroid. </td></tr>
</table>
<p><a class="anchor" id="silh"></a></p><dl class="section user"><dt>Simple Silhouette</dt><dd></dd></dl>
<p>A common method to assess the quality of the clustering is the <em>silhouette coefficient</em>, a simplified version of which is implemented in MADlib <a href="#kmeans-lit-3">[3]</a>. There are two silhouette functions: average score across all data points, and a per-point score. The average silhouette function has the following syntax: </p><pre class="syntax">
simple_silhouette( rel_source,
expr_point,
centroids,
fn_dist
)
</pre><p>The per-point silhouette function has two formats. The first takes an array of centroids: </p><pre class="syntax">
simple_silhouette_points(rel_source,
output_table,
pid,
expr_point,
centroids,
fn_dist
)
</pre><p>The second takes a centroids column from a table: </p><pre class="syntax">
simple_silhouette_points(rel_source,
output_table,
pid,
expr_point,
centroids_table,
centroids_col,
fn_dist
)
</pre><p><b>Arguments</b> </p><dl class="arglist">
<dt>rel_source </dt>
<dd><p class="startdd">TEXT. The name of the table containing the input data points.</p>
<p class="enddd"></p>
</dd>
<dt>expr_point </dt>
<dd><p class="startdd">TEXT. The name of the column with point coordinates or an array expression.</p>
<p class="enddd"></p>
</dd>
<dt>centroids </dt>
<dd><p class="startdd">DOUBLE PRECISION[][]. An expression evaluating to an array of centroids. </p>
<p class="enddd"></p>
</dd>
<dt>fn_dist (optional) </dt>
<dd><p class="startdd">TEXT, default: 'squared_dist_norm2'. The name of the function to use to calculate the distance from a data point vector to a centroid vector. See the <em>fn_dist</em> argument of the k-means training function for more details on distance functions.</p>
<p class="enddd"></p>
</dd>
<dt>output_table </dt>
<dd><p class="startdd">TEXT. Name of the output table to contain sihouette score for each point.</p>
<p class="enddd"></p>
</dd>
<dt>pid </dt>
<dd><p class="startdd">TEXT. Column containing point ID in the table 'rel_source' containing the data points. </p>
<p class="enddd"></p>
</dd>
<dt>centroids_table </dt>
<dd><p class="startdd">TEXT. Name of the table containing the centroids.</p>
<p class="enddd"></p>
</dd>
<dt>centroids_col </dt>
<dd><p class="startdd">TEXT. Name of the column in the table 'centroids_table' containing the centroids.</p>
<p class="enddd"></p>
</dd>
</dl>
<p><b>Output</b> <br />
For the average function, a single value for simple silhouette score is returned. For the per-point function, the output table contains the following columns: </p><table class="output">
<tr>
<th>pid </th><td><p class="starttd">INTEGER. Point ID. </p>
<p class="endtd"></p>
</td></tr>
<tr>
<th>centroid_id </th><td><p class="starttd">INTEGER. The cluster that the point is assigned to. </p>
<p class="endtd"></p>
</td></tr>
<tr>
<th>neighbor_centroid_id </th><td><p class="starttd">INTEGER. The next closest cluster to the one that the point is assigned to. </p>
<p class="endtd"></p>
</td></tr>
<tr>
<th>simple_silhouette </th><td>DOUBLE PRECISION. Simplified silhouette score for the point. </td></tr>
</table>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<dl class="section note"><dt>Note</dt><dd>Your results may not be exactly the same as below due to the nature random nature of the k-means algorithm. Also, remember to be consistent in the distance functions that you use in the k-means, silhouette and helper functions.</dd></dl>
<h4>Clustering for Single <em>k</em> Value</h4>
<ol type="1">
<li>Create input data: <pre class="example">
DROP TABLE IF EXISTS km_sample;
CREATE TABLE km_sample(pid int, points double precision[]);
INSERT INTO km_sample VALUES
(1, '{14.23, 1.71, 2.43, 15.6, 127, 2.8, 3.0600, 0.2800, 2.29, 5.64, 1.04, 3.92, 1065}'),
(2, '{13.2, 1.78, 2.14, 11.2, 1, 2.65, 2.76, 0.26, 1.28, 4.38, 1.05, 3.49, 1050}'),
(3, '{13.16, 2.36, 2.67, 18.6, 101, 2.8, 3.24, 0.3, 2.81, 5.6799, 1.03, 3.17, 1185}'),
(4, '{14.37, 1.95, 2.5, 16.8, 113, 3.85, 3.49, 0.24, 2.18, 7.8, 0.86, 3.45, 1480}'),
(5, '{13.24, 2.59, 2.87, 21, 118, 2.8, 2.69, 0.39, 1.82, 4.32, 1.04, 2.93, 735}'),
(6, '{14.2, 1.76, 2.45, 15.2, 112, 3.27, 3.39, 0.34, 1.97, 6.75, 1.05, 2.85, 1450}'),
(7, '{14.39, 1.87, 2.45, 14.6, 96, 2.5, 2.52, 0.3, 1.98, 5.25, 1.02, 3.58, 1290}'),
(8, '{14.06, 2.15, 2.61, 17.6, 121, 2.6, 2.51, 0.31, 1.25, 5.05, 1.06, 3.58, 1295}'),
(9, '{14.83, 1.64, 2.17, 14, 97, 2.8, 2.98, 0.29, 1.98, 5.2, 1.08, 2.85, 1045}'),
(10, '{13.86, 1.35, 2.27, 16, 98, 2.98, 3.15, 0.22, 1.8500, 7.2199, 1.01, 3.55, 1045}');
</pre></li>
<li>Run k-means clustering using kmeans++ for centroid seeding. Use squared Euclidean distance which is a commonly used distance function. <pre class="example">
DROP TABLE IF EXISTS km_result;
CREATE TABLE km_result AS
SELECT * FROM madlib.kmeanspp( 'km_sample', -- Table of source data
'points', -- Column containing point co-ordinates
2, -- Number of centroids to calculate
'madlib.squared_dist_norm2', -- Distance function
'madlib.avg', -- Aggregate function
20, -- Number of iterations
0.001 -- Fraction of centroids reassigned to keep iterating
);
\x on;
SELECT * FROM km_result;
</pre> <pre class="result">
-[ RECORD 1 ]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
centroids | {{14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75},{13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333333,1020.83333333333}}
cluster_variance | {30561.74805,122999.110416013}
objective_fn | 153560.858466013
frac_reassigned | 0
num_iterations | 2
</pre></li>
<li>Simplified silhouette coefficient. First find average for whole data set. Make sure to use the same distance function as k-means above. <pre class="example">
SELECT * FROM madlib.simple_silhouette( 'km_sample', -- Input points table
'points', -- Column containing points
(SELECT centroids FROM km_result), -- Centroids
'madlib.squared_dist_norm2' -- Distance function
);
</pre> <pre class="result">
-[ RECORD 1 ]-----+------------------
simple_silhouette | 0.868174608939623
</pre> Now calculate simplified silhouette coefficient for each point in the data set: <pre class="example">
DROP TABLE IF EXISTS km_points_silh;
SELECT * FROM madlib.simple_silhouette_points( 'km_sample', -- Input points table
'km_points_silh', -- Output table
'pid', -- Point ID column in input table
'points', -- Points column in input table
'km_result', -- Centroids table
'centroids', -- Column in centroids table containing centroids
'madlib.squared_dist_norm2' -- Distance function
);
SELECT * FROM km_points_silh ORDER BY pid;
</pre> <pre class="result">
pid | centroid_id | neighbor_centroid_id | silh
-----+-------------+----------------------+-------------------
1 | 1 | 0 | 0.966608819821713
2 | 1 | 0 | 0.926251125077039
3 | 1 | 0 | 0.28073008848306
4 | 0 | 1 | 0.951447083910869
5 | 1 | 0 | 0.80098239014753
6 | 0 | 1 | 0.972487557020722
7 | 0 | 1 | 0.88838568723116
8 | 0 | 1 | 0.906334719972002
9 | 1 | 0 | 0.994315140692314
10 | 1 | 0 | 0.99420347703982
(10 rows)
</pre></li>
<li>Find the cluster assignment for each point. Use the <a class="el" href="linalg_8sql__in.html#acf6628dfa4d73dfce65a582aa5c5a3db" title="Given matrix and vector compute the column of that is closest to . ">closest_column()</a> function to map each point to the cluster that it belongs to. <pre class="example">
\x off;
DROP TABLE IF EXISTS point_cluster_map;
CREATE TABLE point_cluster_map AS
SELECT data.*, (madlib.closest_column(centroids, points, 'madlib.squared_dist_norm2')).*
FROM km_sample as data, km_result;
ALTER TABLE point_cluster_map RENAME column_id to cluster_id; -- change column name
SELECT * FROM point_cluster_map ORDER BY pid;
</pre> <pre class="result">
pid | points | cluster_id | distance
-----+--------------------------------------------------------------------+------------+------------------
1 | {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065} | 1 | 3296.12614333444
2 | {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050} | 1 | 8856.90882600111
3 | {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185} | 1 | 27072.3216580044
4 | {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480} | 0 | 10261.9450375
5 | {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735} | 1 | 82492.8673553345
6 | {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450} | 0 | 5080.3612375
7 | {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290} | 0 | 8090.4485875
8 | {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295} | 0 | 7128.9931875
9 | {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045} | 1 | 634.301947334443
10 | {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045} | 1 | 646.584486004443
(10 rows)
</pre></li>
<li>Display cluster ID. There are two steps to get the cluster id associated with the centroid coordinates, if you need it. First unnest the cluster centroids 2-D array to get a set of 1-D centroid arrays: <pre class="example">
DROP TABLE IF EXISTS km_centroids_unnest;
-- Run unnest function
CREATE TABLE km_centroids_unnest AS
SELECT (madlib.array_unnest_2d_to_1d(centroids)).*
FROM km_result;
SELECT * FROM km_centroids_unnest ORDER BY 1;
</pre> <pre class="result">
unnest_row_id | unnest_result
---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | {14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75}
2 | {13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333333,1020.83333333333}
(2 rows)
</pre> Note that the ID column returned by <a class="el" href="array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70" title="This function takes a 2-D array as the input and unnests it by one level. It returns a set of 1-D arr...">array_unnest_2d_to_1d()</a> is just a row ID and not the cluster ID assigned by k-means. The second step to get the cluster_id is: <pre class="example">
SELECT cent.*, (madlib.closest_column(centroids, unnest_result, 'madlib.squared_dist_norm2')).column_id as cluster_id
FROM km_centroids_unnest as cent, km_result
ORDER BY cent.unnest_row_id;
</pre> <pre class="result">
unnest_row_id | unnest_result | cluster_id
---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------
1 | {14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75} | 0
2 | {13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333333,1020.83333333333} | 1
(2 rows)
</pre></li>
<li>Run the same example as above, but using array input. Create the input table: <pre class="example">
DROP TABLE IF EXISTS km_arrayin CASCADE;
CREATE TABLE km_arrayin(pid int,
p1 float,
p2 float,
p3 float,
p4 float,
p5 float,
p6 float,
p7 float,
p8 float,
p9 float,
p10 float,
p11 float,
p12 float,
p13 float);
INSERT INTO km_arrayin VALUES
(1, 14.23, 1.71, 2.43, 15.6, 127, 2.8, 3.0600, 0.2800, 2.29, 5.64, 1.04, 3.92, 1065),
(2, 13.2, 1.78, 2.14, 11.2, 1, 2.65, 2.76, 0.26, 1.28, 4.38, 1.05, 3.49, 1050),
(3, 13.16, 2.36, 2.67, 18.6, 101, 2.8, 3.24, 0.3, 2.81, 5.6799, 1.03, 3.17, 1185),
(4, 14.37, 1.95, 2.5, 16.8, 113, 3.85, 3.49, 0.24, 2.18, 7.8, 0.86, 3.45, 1480),
(5, 13.24, 2.59, 2.87, 21, 118, 2.8, 2.69, 0.39, 1.82, 4.32, 1.04, 2.93, 735),
(6, 14.2, 1.76, 2.45, 15.2, 112, 3.27, 3.39, 0.34, 1.97, 6.75, 1.05, 2.85, 1450),
(7, 14.39, 1.87, 2.45, 14.6, 96, 2.5, 2.52, 0.3, 1.98, 5.25, 1.02, 3.58, 1290),
(8, 14.06, 2.15, 2.61, 17.6, 121, 2.6, 2.51, 0.31, 1.25, 5.05, 1.06, 3.58, 1295),
(9, 14.83, 1.64, 2.17, 14, 97, 2.8, 2.98, 0.29, 1.98, 5.2, 1.08, 2.85, 1045),
(10, 13.86, 1.35, 2.27, 16, 98, 2.98, 3.15, 0.22, 1.8500, 7.2199, 1.01, 3.55, 1045);
</pre> Now find the cluster assignment for each point using random seeding: <pre class="example">
DROP TABLE IF EXISTS km_result_array;
-- Run kmeans algorithm
CREATE TABLE km_result_array AS
SELECT * FROM madlib.kmeans_random('km_arrayin', -- Table of source data
'ARRAY[p1, p2, p3, p4, p5, p6, -- Points
p7, p8, p9, p10, p11, p12, p13]',
2, -- Number of centroids to calculate
'madlib.squared_dist_norm2', -- Distance function
'madlib.avg', -- Aggregate function
20, -- Number of iterations
0.001); -- Fraction of centroids reassigned to keep iterating
-- Get point assignment
SELECT data.*, (madlib.closest_column(centroids,
ARRAY[p1, p2, p3, p4, p5, p6,
p7, p8, p9, p10, p11, p12, p13],
'madlib.squared_dist_norm2')).column_id as cluster_id
FROM km_arrayin as data, km_result_array
ORDER BY data.pid;
</pre> This produces the result in column format: <pre class="result">
pid | p1 | p2 | p3 | p4 | p5 | p6 | p7 | p8 | p9 | p10 | p11 | p12 | p13 | cluster_id
-----+-------+------+------+------+-----+------+------+------+------+--------+------+------+------+------------
1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.8 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 | 0
2 | 13.2 | 1.78 | 2.14 | 11.2 | 1 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.49 | 1050 | 0
3 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.8 | 3.24 | 0.3 | 2.81 | 5.6799 | 1.03 | 3.17 | 1185 | 1
4 | 14.37 | 1.95 | 2.5 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.8 | 0.86 | 3.45 | 1480 | 1
5 | 13.24 | 2.59 | 2.87 | 21 | 118 | 2.8 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 | 0
6 | 14.2 | 1.76 | 2.45 | 15.2 | 112 | 3.27 | 3.39 | 0.34 | 1.97 | 6.75 | 1.05 | 2.85 | 1450 | 1
7 | 14.39 | 1.87 | 2.45 | 14.6 | 96 | 2.5 | 2.52 | 0.3 | 1.98 | 5.25 | 1.02 | 3.58 | 1290 | 1
8 | 14.06 | 2.15 | 2.61 | 17.6 | 121 | 2.6 | 2.51 | 0.31 | 1.25 | 5.05 | 1.06 | 3.58 | 1295 | 1
9 | 14.83 | 1.64 | 2.17 | 14 | 97 | 2.8 | 2.98 | 0.29 | 1.98 | 5.2 | 1.08 | 2.85 | 1045 | 0
10 | 13.86 | 1.35 | 2.27 | 16 | 98 | 2.98 | 3.15 | 0.22 | 1.85 | 7.2199 | 1.01 | 3.55 | 1045 | 0
(10 rows)
</pre></li>
</ol>
<h4>Auto Clustering for Range of <em>k</em> Values</h4>
<ol type="1">
<li>Auto k selection. Now let's run k-means random for a variety of k values and compare using simple silhouette and elbow methods. <pre class="example">
DROP TABLE IF EXISTS km_result_auto, km_result_auto_summary;
SELECT madlib.kmeans_random_auto(
'km_sample', -- points table
'km_result_auto', -- output table
'points', -- column name in point table
ARRAY[2,3,4,5,6], -- k values to try
'madlib.squared_dist_norm2', -- distance function
'madlib.avg', -- aggregate function
20, -- max iterations
0.001, -- minimum fraction of centroids reassigned to continue iterating
'both' -- k selection algorithm (simple silhouette and elbow)
);
\x on
SELECT * FROM km_result_auto_summary;
</pre> <pre class="result">-[ RECORD 1 ]-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
k | 6
centroids | {{14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065},{13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735},{14.285,1.855,2.475,16,112.5,3.56,3.44,0.29,2.075,7.275,0.955,3.15,1465},{13.87,2.12666666666667,2.57666666666667,16.9333333333333,106,2.63333333333333,2.75666666666667,0.303333333333333,2.01333333333333,5.32663333333333,1.03666666666667,3.44333333333333,1256.66666666667},{14.345,1.495,2.22,15,97.5,2.89,3.065,0.255,1.915,6.20995,1.045,3.2,1045},{13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}}
cluster_variance | {0,0,452.7633,8078.22646267333,5.346498005,0}
objective_fn | 8536.33626067833
frac_reassigned | 0
num_iterations | 3
silhouette | 0.954496117675027
elbow | -50296.3677976633
selection_algorithm | silhouette
</pre> The best selection above is made by the silhouette algorithm by default. Note that the elbow method may select a different k value as the best. To see results for all k values: <pre class="example">
SELECT * FROM km_result_auto ORDER BY k;
</pre> <pre class="result">
-[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
k | 2
centroids | {{14.036,2.018,2.536,16.56,108.6,3.004,3.03,0.298,2.038,6.10598,1.004,3.326,1340},{13.872,1.814,2.376,15.56,88.2,2.806,2.928,0.288,1.844,5.35198,1.044,3.348,988}}
cluster_variance | {60672.638245208,90512.324426408}
objective_fn | 151184.962671616
frac_reassigned | 0
num_iterations | 3
silhouette | 0.872087020146542
elbow | -1056.25028126836
-[ RECORD 2 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
k | 3
centroids | {{13.87,2.12666666666667,2.57666666666667,16.9333333333333,106,2.63333333333333,2.75666666666667,0.303333333333333,2.01333333333333,5.32663333333333,1.03666666666667,3.44333333333333,1256.66666666667},{14.285,1.855,2.475,16,112.5,3.56,3.44,0.29,2.075,7.275,0.955,3.15,1465},{13.872,1.814,2.376,15.56,88.2,2.806,2.928,0.288,1.844,5.35198,1.044,3.348,988}}
cluster_variance | {8078.22646267333,452.7633,90512.324426408}
objective_fn | 99043.3141890814
frac_reassigned | 0
num_iterations | 2
silhouette | 0.895849874221733
elbow | 20549.7753189989
-[ RECORD 3 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
k | 4
centroids | {{14.02,1.765,2.385,16.05,105.75,2.845,3.1075,0.2725,2.2325,5.93495,1.04,3.3725,1085},{13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735},{14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75},{13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}}
cluster_variance | {14227.41709401,0,30561.74805,0}
objective_fn | 44789.16514401
frac_reassigned | 0
num_iterations | 3
silhouette | 0.877839150000438
elbow | 17535.7421610686
-[ RECORD 4 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
k | 5
centroids | {{14.285,1.855,2.475,16,112.5,3.56,3.44,0.29,2.075,7.275,0.955,3.15,1465},{14.225,2.01,2.53,16.1,108.5,2.55,2.515,0.305,1.615,5.15,1.04,3.58,1292.5},{13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050},{14.04,1.8225,2.435,16.65,110,2.845,2.97,0.295,1.985,5.594975,1.0425,3.3125,972.5},{13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185}}
cluster_variance | {452.7633,329.8988,0,76176.4564000075,0}
objective_fn | 76959.1185000075
frac_reassigned | 0
num_iterations | 2
silhouette | 0.771207558995578
elbow | -28690.3421973961
-[ RECORD 5 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
k | 6
centroids | {{14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065},{13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735},{14.285,1.855,2.475,16,112.5,3.56,3.44,0.29,2.075,7.275,0.955,3.15,1465},{13.87,2.12666666666667,2.57666666666667,16.9333333333333,106,2.63333333333333,2.75666666666667,0.303333333333333,2.01333333333333,5.32663333333333,1.03666666666667,3.44333333333333,1256.66666666667},{14.345,1.495,2.22,15,97.5,2.89,3.065,0.255,1.915,6.20995,1.045,3.2,1045},{13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}}
cluster_variance | {0,0,452.7633,8078.22646267333,5.346498005,0}
objective_fn | 8536.33626067833
frac_reassigned | 0
num_iterations | 3
silhouette | 0.954496117675027
elbow | -50296.3677976633
</pre></li>
<li>Simplified silhouette per point. Let's say we want the simplified silhouette coefficient for each point in the data set, for the case where <em>k=3</em>: <pre class="example">
DROP TABLE IF EXISTS km_points_silh;
SELECT * FROM madlib.simple_silhouette_points( 'km_sample', -- Input points table
'km_points_silh', -- Output table
'pid', -- Point ID column in input table
'points', -- Points column in input table
(SELECT centroids FROM km_result_auto WHERE k=3), -- centroids array
'madlib.squared_dist_norm2' -- Distance function
);
SELECT * FROM km_points_silh ORDER BY pid;
</pre> <pre class="result">
pid | centroid_id | neighbor_centroid_id | silh
-----+-------------+----------------------+-------------------
1 | 2 | 0 | 0.800019825058391
2 | 2 | 0 | 0.786712987562913
3 | 0 | 2 | 0.867496080386644
4 | 1 | 0 | 0.995466498952947
5 | 2 | 0 | 0.761551610381542
6 | 1 | 0 | 0.993950286967157
7 | 0 | 1 | 0.960621637528528
8 | 0 | 1 | 0.941493577405087
9 | 2 | 0 | 0.925822020308802
10 | 2 | 0 | 0.92536421766532
(10 rows)
</pre></li>
</ol>
<p><a class="anchor" id="background"></a></p><dl class="section user"><dt>Technical Background</dt><dd></dd></dl>
<p><b>k-means Algorithm</b> <br />
Formally, we wish to minimize the following objective function: </p><p class="formulaDsp">
\[ (c_1, \dots, c_k) \mapsto \sum_{i=1}^n \min_{j=1}^k \operatorname{dist}(x_i, c_j) \]
</p>
<p> In the most common case, \( \operatorname{dist} \) is the square of the Euclidean distance, though other distance measures can be used.</p>
<p>This problem is computationally difficult (NP-hard), yet the local-search heuristic proposed by Lloyd <a href="#kmeans-lit-4">[4]</a> performs reasonably well in practice. In fact, it is so ubiquitous today that it is often referred to as the <em>standard algorithm</em> or even just the <em>k-means algorithm</em>. It works as follows:</p>
<ol type="1">
<li>Seed the \( k \) centroids, meaning specify their initial positions (see below).</li>
<li>Repeat until convergence:<ol type="a">
<li>Assign each point to its closest centroid.</li>
<li>Move each centroid to a position that minimizes the sum of distances in this cluster.</li>
</ol>
</li>
<li>Convergence is achieved when no points change their assignments during step 2a.</li>
</ol>
<p>Since the objective function decreases in every step, this algorithm is guaranteed to converge to a local optimum.</p>
<p>The quality of k-means is highly influenced by the choice of the seeding. In MADlib we support uniform-at-random sampling, kmeans++ as well as the ability for the user to enter manual seeding based on other methods.</p>
<p>k-means++ seeding <a href="#kmeans-lit-1">[1]</a> starts with a single centroid chosen randomly among the input points. It then iteratively chooses new centroids from the input points until there are a total of k centroids. The probability for picking a particular point is proportional to its minimum distance to any existing centroid. Intuitively, k-means++ favors seedings where centroids are spread out over the whole range of the input points, while at the same time not being too susceptible to outliers.</p>
<p><a class="anchor" id="simplified_silhouette"></a><b>Silhouette</b> <br />
</p>
<p>To evaluate the validity of clustering with different k values, the objective function is not ideal because it decreases as k value increases, so it has a bias toward selecting the largest k as the best result <a href="#kmeans-lit-6">[6]</a>. Therefore we use other internal measures to evaluate cluster validity.</p>
<p>One such measure is silhouette score. The original formulation <a href="#kmeans-lit-7">[7]</a> computes the average distance of a data point to all the other data points in its own cluster, and to all the data points in the neighbouring cluster nearest to the data point. This is expensive for a large number of points since it requires the full pairwise distance matrix over all data.</p>
<p>In the simplified silhouette <a href="#kmeans-lit-3">[3]</a> which is used in MADlib, the distance of a data point to a cluster is represented with the distance to the cluster centroid instead of the average distance to all (other) data points in the cluster. This has the advantage of being much faster to compute, and can be shown to be comparable in performance to the full silhouette <a href="#kmeans-lit-8">[8]</a>.</p>
<p><a class="anchor" id="elbow"></a><b>Elbow Method</b> <br />
The elbow method considers the percentage of variance explained as a function of number of clusters. The idea is not to add another cluster if it doesn't model the data better. Graphically it means identifying the "elbow" in the curve of sum of squared errors vs. number of clusters (k). This was possibly originally suggested in <a href="#kmeans-lit-9">[9]</a>. To locate the elbow, we use the 2nd derivative of the objective function using the NumPy gradient function <a href="#kmeans-lit-2">[2]</a>.</p>
<p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p><a class="anchor" id="kmeans-lit-1"></a>[1] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'07), pp. 1027-1035.</p>
<p><a class="anchor" id="kmeans-lit-2"></a>[2] NumPy gradient function, <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.gradient.html">https://docs.scipy.org/doc/numpy/reference/generated/numpy.gradient.html</a></p>
<p><a class="anchor" id="kmeans-lit-3"></a>[3] E. R. Hruschka, L. N. C. Silva, R. J. G. B. Campello: Clustering Gene-Expression Data: A Hybrid Approach that Iterates Between k-Means and Evolutionary Search. In: Studies in Computational Intelligence - Hybrid Evolutionary Algorithms. pp. 313-335. Springer. 2007.</p>
<p><a class="anchor" id="kmeans-lit-4"></a>[4] Lloyd, Stuart: Least squares quantization in PCM. Technical Note, Bell Laboratories. Published much later in: IEEE Transactions on Information Theory 28(2), pp. 128-137. 1982.</p>
<p><a class="anchor" id="kmeans-lit-5"></a>[5] Leisch, Friedrich: A Toolbox for K-Centroids Cluster Analysis. In: Computational Statistics and Data Analysis, 51(2). pp. 526-544. 2006.</p>
<p><a class="anchor" id="kmeans-lit-6"></a>[6] Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern recognition letters 31(8), 651–666 (2010)</p>
<p><a class="anchor" id="kmeans-lit-7"></a>[7] Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”. Computational and Applied Mathematics 20: 53-65</p>
<p><a class="anchor" id="kmeans-lit-8"></a>[8] Wang F., Franco-Penya HH., Kelleher J.D., Pugh J., Ross R. (2017) An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity. In: Perner P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2017. Lecture Notes in Computer Science, vol 10358. Springer, Cham</p>
<p><a class="anchor" id="kmeans-lit-9"></a>[9] Robert L. Thorndike (December 1953). "Who Belongs in the Family?". Psychometrika. 18 (4): 267–276. doi:10.1007/BF02289263.</p>
<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
<p>File <a class="el" href="kmeans_8sql__in.html" title="Set of functions for k-means clustering. ">kmeans.sql_in</a> documenting the k-Means SQL functions</p>
<p><a class="el" href="group__grp__svec.html">Sparse Vectors</a></p>
<p><a class="el" href="linalg_8sql__in.html#acf6628dfa4d73dfce65a582aa5c5a3db" title="Given matrix and vector compute the column of that is closest to . ">closest_column()</a></p>
<p><a class="el" href="array__ops_8sql__in.html#af057b589f2a2cb1095caa99feaeb3d70" title="This function takes a 2-D array as the input and unnests it by one level. It returns a set of 1-D arr...">array_unnest_2d_to_1d()</a></p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Wed Dec 15 2021 20:27:21 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>