blob: ae43cff836bcda101cfeaba512b4368138ba7139 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: CountMin (Cormode-Muthukrishnan)</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.18.0</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__countmin.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">CountMin (Cormode-Muthukrishnan)<div class="ingroups"><a class="el" href="group__grp__stats.html">Statistics</a> &raquo; <a class="el" href="group__grp__desc__stats.html">Descriptive Statistics</a> &raquo; <a class="el" href="group__grp__sketches.html">Cardinality Estimators</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> <ul>
<li>
<a href="#syntax">Syntax</a> </li>
<li>
<a href="#examples">Examples</a> </li>
<li>
<a href="#literature">Literature</a> </li>
<li>
<a href="#related">Related Topics</a> </li>
</ul>
</div><p>This module implements Cormode-Muthukrishnan <em>CountMin</em> sketches on integer values, implemented as a user-defined aggregate. It also provides scalar functions over the sketches to produce approximate counts, order statistics, and histograms.</p>
<p><a class="anchor" id="syntax"></a></p><dl class="section user"><dt>Syntax</dt><dd><ul>
<li>Get a sketch of a selected column specified by <em>col_name</em>. <pre class="syntax">
cmsketch( col_name )
</pre></li>
<li>Get the number of rows where <em>col_name = p</em>, computed from the sketch obtained from <code>cmsketch</code>. <pre class="syntax">
cmsketch_count( cmsketch,
p )
</pre></li>
<li>Get the number of rows where <em>col_name</em> is between <em>m</em> and <em>n</em> inclusive. <pre class="syntax">
cmsketch_rangecount( cmsketch,
m,
n )
</pre></li>
<li>Get the <em>k</em>th percentile of <em>col_name</em> where <em>count</em> specifies number of rows. <em>k</em> should be an integer between 1 to 99. <pre class="syntax">
cmsketch_centile( cmsketch,
k,
count )
</pre></li>
<li>Get the median of col_name where <em>count</em> specifies number of rows. This is equivalent to <code><a class="el" href="sketch_8sql__in.html#a2f2ab2fe3244515f5f73d49690e73b39">cmsketch_centile</a>(<em>cmsketch</em>,50,<em>count</em>)</code>. <pre class="syntax">
cmsketch_median( cmsketch,
count )
</pre></li>
<li>Get an n-bucket histogram for values between min and max for the column where each bucket has approximately the same width. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate. <pre class="syntax">
cmsketch_width_histogram( cmsketch,
min,
max,
n )
</pre></li>
<li>Get an n-bucket histogram for the column where each bucket has approximately the same count. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate. Note that an equi-depth histogram is equivalent to a spanning set of equi-spaced centiles. <pre class="syntax">
cmsketch_depth_histogram( cmsketch,
n )
</pre></li>
</ul>
</dd></dl>
<dl class="section note"><dt>Note</dt><dd>This is a <a href="https://www.postgresql.org/docs/current/static/xaggr.html">User Defined Aggregate</a> which returns the results when used in a query. Use "CREATE TABLE AS ", with the UDA as subquery if the results are to be stored. This is unlike the usual MADlib stored procedure interface which places the results in a table instead of returning it.</dd></dl>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<ol type="1">
<li>Generate some data. <pre class="example">
CREATE TABLE data(class INT, a1 INT);
INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
</pre></li>
<li>Count number of rows where a1 = 2 in each class. Store results in a table. <pre class="example">
CREATE TABLE sketch_count AS
SELECT class,
cmsketch_count( cmsketch( a1 ), 2 )
FROM data GROUP BY data.class;
SELECT * FROM sketch_count;
</pre> Result: <pre class="result">
class | cmsketch_count
&#160;------+----------------
2 | 0
1 | 15000
(2 rows)
</pre></li>
<li>Count number of rows where a1 is between 3 and 6. <pre class="example">
SELECT class,
cmsketch_rangecount( cmsketch(a1), 3, 6 )
FROM data GROUP BY data.class;
</pre> Result: <pre class="result">
class | cmsketch_rangecount
&#160;------+---------------------
2 | 2000
1 | 10000
(2 rows)
</pre></li>
<li>Compute the 90th percentile of all of a1. <pre class="example">
SELECT cmsketch_centile( cmsketch( a1 ), 90, count(*) )
FROM data;
</pre> Result: <pre class="result">
cmsketch_centile
&#160;-----------------
3
(1 row)
</pre></li>
<li>Produce an equi-width histogram with 2 bins between 0 and 10. <pre class="example">
SELECT cmsketch_width_histogram( cmsketch( a1 ), 0, 10, 2 )
FROM data;
</pre> Result: <pre class="result">
cmsketch_width_histogram
&#160;-----------------------------------
[[0L, 4L, 35000], [5L, 10L, 2000]]
(1 row)
</pre></li>
<li>Produce an equi-depth histogram of a1 with 2 bins of approximately equal depth. <pre class="example">
SELECT cmsketch_depth_histogram( cmsketch( a1 ), 2 )
FROM data;
</pre> Result: <pre class="result">
cmsketch_depth_histogram
&#160;----------------------------------------------------------------------
[[-9223372036854775807L, 1, 10000], [2, 9223372036854775807L, 27000]]
(1 row)
</pre></li>
</ol>
<p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . <a href="http://dimacs.rutgers.edu/~graham/pubs/html/CormodeMuthukrishnan04CMLatin.html">http://dimacs.rutgers.edu/~graham/pubs/html/CormodeMuthukrishnan04CMLatin.html</a></p>
<p>[2] G. Cormode. Encyclopedia entry on 'Count-Min Sketch'. In L. Liu and M. T. Ozsu, editors, Encyclopedia of Database Systems, pages 511-516. Springer, 2009. <a href="http://dimacs.rutgers.edu/~graham/pubs/html/Cormode09b.html">http://dimacs.rutgers.edu/~graham/pubs/html/Cormode09b.html</a></p>
<p><a class="anchor" id="related"></a>File <a class="el" href="sketch_8sql__in.html" title="SQL functions for sketch-based approximations of descriptive statistics. ">sketch.sql_in</a> documenting the SQL functions.</p>
<p>Module grp_quantile for a different implementation of quantile function. </p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Wed Mar 31 2021 20:45:48 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>