blob: 32b1de9434ac9fe2f4bd0a287b6a879fa15d4ff7 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: Apriori Algorithm</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.17.0</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__assoc__rules.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">Apriori Algorithm<div class="ingroups"><a class="el" href="group__grp__unsupervised.html">Unsupervised Learning</a> &raquo; <a class="el" href="group__grp__association__rules.html">Association Rules</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> <ul>
<li>
<a href="#rules">Rules</a> </li>
<li>
<a href="#algorithm">Apriori Algorithm</a> </li>
<li>
<a href="#syntax">Function Syntax</a> </li>
<li>
<a href="#examples">Examples</a> </li>
<li>
<a href="#notes">Notes</a> </li>
<li>
<a href="#literature">Literature</a> </li>
<li>
<a href="#related">Related Topics</a> </li>
</ul>
</div><p>This module implements the association rules data mining technique on a transactional data set. Given the names of a table and the columns, minimum support and confidence values, this function generates all single and multidimensional association rules that meet the minimum thresholds.</p>
<p>Association rule mining is a widely used technique for discovering relationships between variables in a large data set (e.g., items in a store that are commonly purchased together). The classic market basket analysis example using association rules is the "beer and diapers" rule. According to data mining urban legend, a study of customer purchase behavior in a supermarket found that men often purchased beer and diapers together. After making this discovery, the managers strategically placed beer and diapers closer together on the shelves and saw a dramatic increase in sales. In addition to market basket analysis, association rules are also used in bioinformatics, web analytics, and several other fields.</p>
<p>This type of data mining algorithm uses transactional data. Every transaction event has a unique identification, and each transaction consists of a set of items (or itemset). Purchases are considered binary (either it was purchased or not), and this implementation does not take into consideration the quantity of each item. For the MADlib association rules function, it is assumed that the data is stored in two columns with one item and transaction id per row. Transactions with multiple items will span multiple rows with one row per item.</p>
<pre>
trans_id | product
---------+---------
1 | 1
1 | 2
1 | 3
1 | 4
2 | 3
2 | 4
2 | 5
3 | 1
3 | 4
3 | 6
...
</pre><p><a class="anchor" id="rules"></a></p><dl class="section user"><dt>Rules</dt><dd></dd></dl>
<p>Association rules take the form "If X, then Y", where X and Y are non-empty itemsets. X and Y are called the antecedent and consequent, or the left-hand-side and right-hand-side, of the rule respectively. Using our previous example, the association rule may state "If {diapers}, then {beer}" with .2 support and .85 confidence.</p>
<p>The following metrics are defined for any given itemset "X".</p><ul>
<li>Count: The number of transactions that contain X</li>
<li>Support: The ratio of transactions that contain X to all transactions, T <p class="formulaDsp">
\[ S (X) = \frac{Total X}{Total transactions} \]
</p>
</li>
</ul>
<p>Given any association rule "If X, then Y", the association rules function will also calculate the following metrics:</p><ul>
<li>Count: The number of transactions that contain X,Y</li>
<li>Support: The ratio of transactions that contain X,Y to all transactions, T <p class="formulaDsp">
\[ S (X \Rightarrow Y) = \frac{Total(X \cup Y)}{Total transactions} \]
</p>
</li>
<li>Confidence: The ratio of transactions that contain \( X,Y \) to transactions that contain \( X \). One could view this metric as the conditional probability of \( Y \) , given \( X \) . \( P(Y|X) \) <p class="formulaDsp">
\[ C (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X)} \]
</p>
</li>
<li>Lift: The ratio of observed support of \( X,Y \) to the expected support of \( X,Y \) , assuming \( X \) and \( Y \) are independent. <p class="formulaDsp">
\[ L (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X) \cdot s(Y)} \]
</p>
</li>
<li><p class="startli">Conviction: The ratio of expected support of \( X \) occurring without \( Y \) assuming \( X \) and \( \neg Y \) are independent, to the observed support of \( X \) occuring without \( Y \). If conviction is greater than 1, then this metric shows that incorrect predictions ( \( X \Rightarrow Y \) ) occur less often than if these two actions were independent. This metric can be viewed as the ratio that the association rule would be incorrect if the actions were independent (i.e. a conviction of 1.5 indicates that if the variables were independent, this rule would be incorrect 50% more often.)</p>
<p class="formulaDsp">
\[ Conv (X \Rightarrow Y) = \frac{1 - S(Y)}{1 - C(X \Rightarrow Y)} \]
</p>
</li>
</ul>
<p><a class="anchor" id="algorithm"></a></p><dl class="section user"><dt>Apriori Algorithm</dt><dd></dd></dl>
<p>Although there are many algorithms that generate association rules, the classic algorithm is called Apriori [1] which we have implemented in this module. It is a breadth-first search, as opposed to depth-first searches like Eclat. Frequent itemsets of order \( n \) are generated from sets of order \( n - 1 \). Using the downward closure property, all sets must have frequent subsets. There are two steps in this algorithm; generating frequent itemsets, and using these itemsets to construct the association rules. A simplified version of the algorithm is as follows, and assumes a minimum level of support and confidence is provided:</p>
<p><em>Initial</em> <em>step</em> </p><ol type="1">
<li>Generate all itemsets of order 1.</li>
<li>Eliminate itemsets that have support less than minimum support.</li>
</ol>
<p><em>Main</em> <em>algorithm</em> </p><ol type="1">
<li>For \( n \ge 2 \), generate itemsets of order \( n \) by combining the itemsets of order \( n - 1 \). This is done by doing the union of two itemsets that have identical items except one.</li>
<li>Eliminate itemsets that have (n-1) order subsets with insufficient support.</li>
<li>Eliminate itemsets with insufficient support.</li>
<li>Repeat until itemsets cannot be generated, or maximum itemset size is exceeded.</li>
</ol>
<p><em>Association</em> <em>rule</em> <em>generation</em> </p>
<p>Given a frequent itemset \( A \) generated from the Apriori algorithm, and all subsets \( B \) , we generate rules such that \( B \Rightarrow (A - B) \) meets minimum confidence requirements.</p>
<dl class="section note"><dt>Note</dt><dd>Beware of combinatorial explosion. The Apriori algorithm can potentially generate a huge number of rules, even for fairly simple data sets, resulting in run times that are unreasonably long. To avoid this, it is recommended to cap the maximum itemset size to a small number to start with, then increase it gradually. Similarly, <em>max_LHS_size</em> and <em>max_RHS_size</em> limit the number of items on the LHS and RHS of the rules and can significantly reduce run times. <em>Support</em> and <em>confidence</em> values are parameters that can also be used to control rule generation.</dd></dl>
<p><a class="anchor" id="syntax"></a></p><dl class="section user"><dt>Function Syntax</dt><dd>Association rules has the following syntax: <pre class="syntax">
assoc_rules( support,
confidence,
tid_col,
item_col,
input_table,
output_schema,
verbose,
max_itemset_size,
max_LHS_size,
max_RHS_size
);</pre> This generates all association rules that satisfy the specified minimum <em>support</em> and <em>confidence</em>.</dd></dl>
<p><b>Arguments</b> </p><dl class="arglist">
<dt>support </dt>
<dd><p class="startdd">Minimum level of support needed for each itemset to be included in result.</p>
<p class="enddd"></p>
</dd>
<dt>confidence </dt>
<dd><p class="startdd">Minimum level of confidence needed for each rule to be included in result.</p>
<p class="enddd"></p>
</dd>
<dt>tid_col </dt>
<dd><p class="startdd">Name of the column storing the transaction ids.</p>
<p class="enddd"></p>
</dd>
<dt>item_col </dt>
<dd><p class="startdd">Name of the column storing the products.</p>
<p class="enddd"></p>
</dd>
<dt>input_table </dt>
<dd><p class="startdd">Name of the table containing the input data.</p>
<p>The input data is expected to be of the following form: </p><pre>{TABLE|VIEW} <em>input_table</em> (
<em>trans_id</em> INTEGER,
<em>product</em> TEXT
)</pre><p>The algorithm maps the product names to consecutive integer ids starting at 1. If they are already structured this way, then the ids will not change. </p>
<p class="enddd"></p>
</dd>
<dt>output_schema </dt>
<dd><p class="startdd">The name of the schema where the final results will be stored. The schema must be created before calling the function. Alternatively, use <code>NULL</code> to output to the current schema.</p>
<p>The results containing the rules, support, count, confidence, lift, and conviction are stored in the table <code>assoc_rules</code> in the schema specified by <code>output_schema</code>.</p>
<p>The table has the following columns. </p><table class="output">
<tr>
<th>ruleid </th><td>integer </td></tr>
<tr>
<th>pre </th><td>text </td></tr>
<tr>
<th>post </th><td>text </td></tr>
<tr>
<th>count </th><td>integer </td></tr>
<tr>
<th>support </th><td>double </td></tr>
<tr>
<th>confidence </th><td>double </td></tr>
<tr>
<th>lift </th><td>double </td></tr>
<tr>
<th>conviction </th><td>double </td></tr>
</table>
<p>On Greenplum Database, the table is distributed by the <code>ruleid</code> column.</p>
<p>The <code>pre</code> and <code>post</code> columns are the itemsets of left and right hand sides of the association rule respectively. The <code>support</code>, <code>confidence</code>, <code>lift</code>, and <code>conviction</code> columns are calculated as described earlier. </p>
<p class="enddd"></p>
</dd>
<dt>verbose (optional) </dt>
<dd><p class="startdd">BOOLEAN, default: FALSE. Determines if details are printed for each iteration as the algorithm progresses.</p>
<p class="enddd"></p>
</dd>
<dt>max_itemset_size (optional) </dt>
<dd><p class="startdd">INTEGER, default: 10. Determines the maximum size of frequent itemsets that are used for generating association rules. Must be 2 or more. This parameter can be used to reduce run time for data sets where itemset size is large, which is a common situation. If your query is not returning or is running too long, try using a lower value for this parameter.</p>
<p class="enddd"></p>
</dd>
<dt>max_LHS_size (optional) </dt>
<dd><p class="startdd">INTEGER, default: NULL. Determines the maximum size of the left hand side of the rule. Must be 1 or more. This parameter can be used to reduce run time.</p>
<p class="enddd"></p>
</dd>
<dt>max_RHS_size (optional) </dt>
<dd>INTEGER, default: NULL. Determines the maximum size of the right hand side of the rule. Must be 1 or more. This parameter can be used to reduce run time. For example, setting to 1 can significantly reduce run time if this makes sense for your use case. (The <em>apriori</em> algorithm in the R package <em>arules</em> [2] only supports a RHS of 1.) </dd>
</dl>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<p>Let's look at some sample transactional data and generate association rules.</p>
<ol type="1">
<li>Create an input dataset: <pre class="example">
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data (
trans_id INT,
product TEXT
);
INSERT INTO test_data VALUES (1, 'beer');
INSERT INTO test_data VALUES (1, 'diapers');
INSERT INTO test_data VALUES (1, 'chips');
INSERT INTO test_data VALUES (2, 'beer');
INSERT INTO test_data VALUES (2, 'diapers');
INSERT INTO test_data VALUES (3, 'beer');
INSERT INTO test_data VALUES (3, 'diapers');
INSERT INTO test_data VALUES (4, 'beer');
INSERT INTO test_data VALUES (4, 'chips');
INSERT INTO test_data VALUES (5, 'beer');
INSERT INTO test_data VALUES (6, 'beer');
INSERT INTO test_data VALUES (6, 'diapers');
INSERT INTO test_data VALUES (6, 'chips');
INSERT INTO test_data VALUES (7, 'beer');
INSERT INTO test_data VALUES (7, 'diapers');
</pre></li>
<li>Let \( min(support) = .25 \) and \( min(confidence) = .5 \), and the output schema is set to <code>NULL</code> indicating output to the current schema. In this example we set verbose to TRUE so that we have some insight into progress of the function. We can now generate association rules as follows: <pre class="example">
SELECT * FROM madlib.assoc_rules( .25, -- Support
.5, -- Confidence
'trans_id', -- Transaction id col
'product', -- Product col
'test_data', -- Input data
NULL, -- Output schema
TRUE -- Verbose output
);
</pre> Result (iteration details not shown): <pre class="result">
output_schema | output_table | total_rules | total_time
---------------+--------------+-------------+-----------------
public | assoc_rules | 7 | 00:00:00.569254
(1 row)
</pre> The association rules are stored in the assoc_rules table: <pre class="example">
SELECT * FROM assoc_rules
ORDER BY support DESC, confidence DESC;
</pre> Result: <pre class="result">
ruleid | pre | post | count | support | confidence | lift | conviction
--------+-----------------+----------------+-------+-------------------+-------------------+-------------------+-------------------
2 | {diapers} | {beer} | 5 | 0.714285714285714 | 1 | 1 | 0
6 | {beer} | {diapers} | 5 | 0.714285714285714 | 0.714285714285714 | 1 | 1
5 | {chips} | {beer} | 3 | 0.428571428571429 | 1 | 1 | 0
4 | {chips,diapers} | {beer} | 2 | 0.285714285714286 | 1 | 1 | 0
1 | {chips} | {diapers,beer} | 2 | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
7 | {chips} | {diapers} | 2 | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
3 | {beer,chips} | {diapers} | 2 | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
(7 rows)
</pre></li>
<li>Limit association rules generated from itemsets of size at most 2. This parameter is a good way to reduce long run times. <pre class="example">
SELECT * FROM madlib.assoc_rules( .25, -- Support
.5, -- Confidence
'trans_id', -- Transaction id col
'product', -- Product col
'test_data', -- Input data
NULL, -- Output schema
TRUE, -- Verbose output
2 -- Max itemset size
);
</pre> Result (iteration details not shown): <pre class="result">
output_schema | output_table | total_rules | total_time
---------------+--------------+-------------+-----------------
public | assoc_rules | 4 | 00:00:00.565176
(1 row)
</pre> The association rules are again stored in the assoc_rules table: <pre class="example">
SELECT * FROM assoc_rules
ORDER BY support DESC, confidence DESC;
</pre> Result: <pre class="result">
ruleid | pre | post | count | support | confidence | lift | conviction
--------+-----------+-----------+-------+-------------------+-------------------+-------------------+-------------------
1 | {diapers} | {beer} | 5 | 0.714285714285714 | 1 | 1 | 0
2 | {beer} | {diapers} | 5 | 0.714285714285714 | 0.714285714285714 | 1 | 1
3 | {chips} | {beer} | 3 | 0.428571428571429 | 1 | 1 | 0
4 | {chips} | {diapers} | 2 | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
(4 rows)
</pre></li>
<li>Post-processing can now be done on the output table in the case that you want to filter the results. For example, if you want any single item on the left hand side and a particular item on the right hand side: <pre class="example">
SELECT * FROM assoc_rules WHERE array_upper(pre,1) = 1 AND post = array['beer'];
</pre> Result: <pre class="result">
ruleid | pre | post | count | support | confidence | lift | conviction
--------+-----------+--------+-------+-------------------+------------+------+------------
1 | {diapers} | {beer} | 5 | 0.714285714285714 | 1 | 1 | 0
3 | {chips} | {beer} | 3 | 0.428571428571429 | 1 | 1 | 0
(2 rows)
</pre></li>
<li>Limit the size of right hand side to 1. This parameter is a good way to reduce long run times. <pre class="example">
SELECT * FROM madlib.assoc_rules( .25, -- Support
.5, -- Confidence
'trans_id', -- Transaction id col
'product', -- Product col
'test_data', -- Input data
NULL, -- Output schema
TRUE, -- Verbose output
NULL, -- Max itemset size
NULL, -- Max LHS size
1 -- Max RHS size
);
</pre> Result (iteration details not shown): <pre class="result">
output_schema | output_table | total_rules | total_time
---------------+--------------+-------------+-----------------
public | assoc_rules | 6 | 00:00:00.031362
(1 row)
</pre> The association rules are again stored in the assoc_rules table: <pre class="example">
SELECT * FROM assoc_rules
ORDER BY support DESC, confidence DESC;
</pre> Result: <pre class="result">
ruleid | pre | post | count | support | confidence | lift | conviction
--------+-----------------+-----------+-------+-------------------+-------------------+-------------------+-------------------
4 | {diapers} | {beer} | 5 | 0.714285714285714 | 1 | 1 | 0
3 | {beer} | {diapers} | 5 | 0.714285714285714 | 0.714285714285714 | 1 | 1
1 | {chips} | {beer} | 3 | 0.428571428571429 | 1 | 1 | 0
6 | {diapers,chips} | {beer} | 2 | 0.285714285714286 | 1 | 1 | 0
2 | {chips} | {diapers} | 2 | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
5 | {beer,chips} | {diapers} | 2 | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
(6 rows)
</pre></li>
</ol>
<p><a class="anchor" id="notes"></a></p><dl class="section user"><dt>Notes</dt><dd></dd></dl>
<p>The association rules function always creates a table named <code>assoc_rules</code>. Make a copy of this table before running the function again if you would like to keep multiple association rule tables. This behavior will be improved in a later release.</p>
<p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] <a href="https://en.wikipedia.org/wiki/Apriori_algorithm">https://en.wikipedia.org/wiki/Apriori_algorithm</a></p>
<p>[2] <a href="https://cran.r-project.org/web/packages/arules/arules.pdf">https://cran.r-project.org/web/packages/arules/arules.pdf</a></p>
<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
<p>File <a class="el" href="assoc__rules_8sql__in.html" title="The assoc_rules function computes association rules for a given set of data. The data is assumed to h...">assoc_rules.sql_in</a> documenting the SQL function. </p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Mon Apr 6 2020 21:46:59 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>