blob: 4a61e9f488e6cff6061000a35bd56009490c7d0c [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<title>MADlib: Association Rules</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { searchBox.OnSelectItem(0); });
</script>
<script src="../mathjax/MathJax.js">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script>
</head>
<body>
<div id="top"><!-- do not remove this div! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td style="padding-left: 0.5em;">
<div id="projectname">MADlib
&#160;<span id="projectnumber">0.6</span> <span style="font-size:10pt; font-style:italic"><a href="../latest/./group__grp__assoc__rules.html"> A newer version is available</a></span>
</div>
<div id="projectbrief">User Documentation</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- Generated by Doxygen 1.7.5.1 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
<script type="text/javascript" src="dynsections.js"></script>
<div id="navrow1" class="tabs">
<ul class="tablist">
<li><a href="index.html"><span>Main&#160;Page</span></a></li>
<li><a href="modules.html"><span>Modules</span></a></li>
<li><a href="files.html"><span>Files</span></a></li>
<li>
<div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</li>
</ul>
</div>
</div>
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
initNavTree('group__grp__assoc__rules.html','');
</script>
<div id="doc-content">
<div class="header">
<div class="headertitle">
<div class="title">Association Rules</div> </div>
<div class="ingroups"><a class="el" href="group__grp__unsuplearn.html">Unsupervised Learning</a></div></div>
<div class="contents">
<div id="dynsection-0" onclick="return toggleVisibility(this)" class="dynheader closed" style="cursor:pointer;">
<img id="dynsection-0-trigger" src="closed.png" alt="+"/> Collaboration diagram for Association Rules:</div>
<div id="dynsection-0-summary" class="dynsummary" style="display:block;">
</div>
<div id="dynsection-0-content" class="dyncontent" style="display:none;">
<center><table><tr><td><div class="center"><iframe scrolling="no" frameborder="0" src="group__grp__assoc__rules.svg" width="363" height="40"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe>
</div>
</td></tr></table></center>
</div>
<dl class="user"><dt><b>About:</b></dt><dd>This module implements the association rules data mining technique on a transactional data set. Given the names of a table and the columns, minimum support and confidence values, this function generates all single and multidimensional association rules that meet the minimum thresholds.</dd></dl>
<p>Association rule mining is a widely used technique for discovering relationships between variables in a large data set (e.g items in a store that are commonly purchased together). The classic market basket analysis example using association rules is the "beer and diapers" rule. According to data mining urban legend, a study of customers' purchase behavior in a supermarket found that men often purchased beer and diapers together. After making this discovery, the managers strategically placed beer and diapers closer together on the shelves and saw a dramatic increase in sales. In addition to market basket analysis, association rules are also used in bioinformatics, web analytics, and several other fields.</p>
<p>This type of data mining algorithm uses transactional data. Every transaction event has a unique identification, and each transaction consists of a set of items (or itemset). Purchases are considered binary (either it was purchased or not), and this implementation does not take into consideration the quantity of each item. For the MADlib association rules function, it is assumed that the data is stored in two columns with one item and transaction id per row. Transactions with multiple items will span multiple rows with one row per item.</p>
<pre>
tran_id | product
---------+---------
1 | 1
1 | 2
1 | 3
1 | 4
2 | 3
2 | 4
2 | 5
3 | 1
3 | 4
3 | 6
...
</pre><p><b>Rules</b> </p>
<p>Association rules take the form "If X, then Y", where X and Y are non-empty itemsets. X and Y are called the antecedent and consequent, or the left-hand-side and right-hand-side, of the rule respectively. Using our previous example, the association rule may state "If {diapers}, then {beer}" with .2 support and .85 confidence.</p>
<p>Given any association rule "If X, then Y", the association rules function will also calculate the following metrics:</p>
<ul>
<li>Support: The ratio of transactions that contain X to all transactions, T <p class="formulaDsp">
\[ S (X) = \frac{Total X}{Total transactions} \]
</p>
</li>
</ul>
<ul>
<li>Confidence: The ratio of transactions that contain \( X,Y \) to transactions that contain \( X \). One could view this metric as the conditional probability of \( Y \) , given \( X \) . \( P(Y|X) \)</li>
</ul>
<p class="formulaDsp">
\[ C (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X)} \]
</p>
<ul>
<li>Lift: The ratio of observed support of \( X,Y \) to the expected support of \( X,Y \) , assuming \( X \) and \( Y \) are independent. <p class="formulaDsp">
\[ L (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X) \cdot s(Y)} \]
</p>
</li>
</ul>
<ul>
<li>Conviction: The ratio of expected support of \( X \) occurring without \( Y \) assuming \( X \) and \( \neg Y \) are independent, to the observed support of \( X \) occuring without \( Y \). If conviction is greater than 1, then this metric shows that incorrect predictions ( \( X \Rightarrow Y \) ) occur less often than if these two actions were independent. This metric can be viewed as the ratio that the association rule would be incorrect if the actions were independent (i.e. a conviction of 1.5 indicates that if the variables were independent, this rule would be incorrect 50% more often.)</li>
</ul>
<p class="formulaDsp">
\[ Conv (X \Rightarrow Y) = \frac{1 - S(Y)}{1 - C(X \Rightarrow Y)} \]
</p>
<p><b>Apriori</b> <b>algorithm</b> </p>
<p>Although there are many algorithms that generate association rules, the classic algorithm used is called Apriori (which we implemented in this module). It is a breadth-first search, as opposed to depth-first searches like eclat. Frequent itemsets of order \( n \) are generated from sets of order \( n - 1 \). Using the downward closure property, all sets must have frequent subsets. There are two steps in this algorithm; generating frequent itemsets, and using these itemsets to construct the association rules. A simplified version of the algorithm is as follows, and assumes a minimum level of support and confidence is provided:</p>
<p><em>Initial</em> <em>step</em> </p>
<ol type="1">
<li>Generate all itemsets of order 1</li>
<li>Eliminate itemsets that have support is less than minimum support</li>
</ol>
<p><em>Main</em> <em>algorithm</em> </p>
<ol type="1">
<li>For \( n \ge 2 \), generate itemsets of order \( n \) by combining the itemsets of order \( n - 1 \). This is done by doing the union of two itemsets that have identical items except one.</li>
<li>Eliminate itemsets that have (n-1) order subsets with insufficient support</li>
<li>Eliminate itemsets with insufficient support</li>
<li>Repeat until itemsets cannot be generated</li>
</ol>
<p><em>Association</em> <em>rule</em> <em>generation</em> </p>
<p>Given a frequent itemset \( A \) generated from the Apriori algorithm, and all subsets \( B \) , we generate rules such that \( B \Rightarrow (A - B) \) meets minimum confidence requirements.</p>
<dl class="user"><dt><b>Input:</b></dt><dd></dd></dl>
<p>The input data is expected to be of the following form: </p>
<pre>{TABLE|VIEW} <em>input_table</em> (
<em>trans_id</em> INTEGER,
<em>product</em> TEXT
)</pre><p>The algorithm will map the product names to consective integer ids starting at 1. If they are already structured this way, then the ids will not change.</p>
<dl class="user"><dt><b>Usage:</b></dt><dd><ul>
<li>Association rules can be called by: <pre>SELECT <a class="el" href="assoc__rules_8sql__in.html#a68a256d98b82ac15bac7df92e806f6f8">assoc_rules</a>(
<em>support</em>, <em>confidence</em>,'<em>tid_col</em>','<em>item_col</em>',
'<em>input_table</em>','<em>output_schema</em>', <em> verbose </em>
);</pre> This will generate all association rules that meet a minimum support of <em>support</em> and confidence of <em>confidence</em>.</li>
</ul>
</dd></dl>
<ul>
<li>The results containing the rules, support, confidence, lift, and conviction are stored in the table assoc_rules in the schema specified by <em>output_schema</em>. <pre>
Table "output_schema.assoc_rules"
Column | Type | Modifiers
------------+------------------+-----------
ruleid | integer |
pre | text[] |
post | text[] |
support | double precision |
confidence | double precision |
lift | double precision |
conviction | double precision |
Distributed by: (ruleid)</pre></li>
</ul>
<pre></pre><p> The <code>pre</code> and <code>post</code> are the itemsets of left and right hand sides of the association rule respectively. The <code>support</code>, <code>confidence</code>, <code>lift</code>, and <code>conviction</code> columns are calculated as mentioned in the about section.</p>
<dl class="user"><dt><b>Implementation Notes:</b></dt><dd></dd></dl>
<p>The association rules function will always create a table named assoc_rules. Please make a copy of this table before running the function again if you would like to keep multiple association rule tables.</p>
<dl class="user"><dt><b>Examples:</b></dt><dd></dd></dl>
<p>Let us take a look at some sample transactional data and generate association rules:</p>
<div class="fragment"><pre class="fragment">DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data (
trans_id INT
, product text
);
INSERT INTO test_data VALUES (1, <span class="stringliteral">&#39;beer&#39;</span>);
INSERT INTO test_data VALUES (1, <span class="stringliteral">&#39;diapers&#39;</span>);
INSERT INTO test_data VALUES (1, <span class="stringliteral">&#39;chips&#39;</span>);
INSERT INTO test_data VALUES (2, <span class="stringliteral">&#39;beer&#39;</span>);
INSERT INTO test_data VALUES (2, <span class="stringliteral">&#39;diapers&#39;</span>);
INSERT INTO test_data VALUES (3, <span class="stringliteral">&#39;beer&#39;</span>);
INSERT INTO test_data VALUES (3, <span class="stringliteral">&#39;diapers&#39;</span>);
INSERT INTO test_data VALUES (4, <span class="stringliteral">&#39;beer&#39;</span>);
INSERT INTO test_data VALUES (4, <span class="stringliteral">&#39;chips&#39;</span>);
INSERT INTO test_data VALUES (5, <span class="stringliteral">&#39;beer&#39;</span>);
INSERT INTO test_data VALUES (6, <span class="stringliteral">&#39;beer&#39;</span>);
INSERT INTO test_data VALUES (6, <span class="stringliteral">&#39;diapers&#39;</span>);
INSERT INTO test_data VALUES (6, <span class="stringliteral">&#39;chips&#39;</span>);
INSERT INTO test_data VALUES (7, <span class="stringliteral">&#39;beer&#39;</span>);
INSERT INTO test_data VALUES (7, <span class="stringliteral">&#39;diapers&#39;</span>);
</pre></div><p>Let \( min(support) = .25 \) and \( min(confidence) = .5 \), and the output schema be 'myschema'. For this example, we will set verbose to 'true' so that we have some insight into the progress of the function. We can now generate association rules as follows:</p>
<div class="fragment"><pre class="fragment">SELECT * FROM MADlib.assoc_rules (.25, .5, <span class="stringliteral">&#39;trans_id&#39;</span>, <span class="stringliteral">&#39;product&#39;</span>, <span class="stringliteral">&#39;test_data&#39;</span>,<span class="stringliteral">&#39;myschema&#39;</span>, <span class="keyword">false</span>);
</pre></div><p>This should generate this output:</p>
<div class="fragment"><pre class="fragment"> output_schema | output_table | total_rules | total_time
---------------+--------------+-------------+-----------------
myschema | <a class="code" href="assoc__rules_8sql__in.html#af9456adb6dad01e452415b9a0a5371dc">assoc_rules</a> | 7 | 00:00:03.162094
(1 row)
</pre></div><p>The association rules are stored in the myschema.assoc_rules:</p>
<div class="fragment"><pre class="fragment">select * from myschema.assoc_rules order by support desc;
ruleid | pre | post | support | confidence | lift | conviction
--------+-----------------+----------------+-------------------+-------------------+-------------------+-------------------
4 | {diapers} | {beer} | 0.714285714285714 | 1 | 1 | 0
2 | {beer} | {diapers} | 0.714285714285714 | 0.714285714285714 | 1 | 1
1 | {chips} | {beer} | 0.428571428571429 | 1 | 1 | 0
5 | {chips} | {beer,diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
6 | {chips,beer} | {diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
7 | {chips,diapers} | {beer} | 0.285714285714286 | 1 | 1 | 0
3 | {chips} | {diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
(7 rows)
</pre></div><dl class="see"><dt><b>See also:</b></dt><dd>File <a class="el" href="assoc__rules_8sql__in.html" title="The assoc_rules function computes association rules for a given set of data. The data is assumed to h...">assoc_rules.sql_in</a> documenting the SQL function. </dd></dl>
</div>
</div>
<div id="nav-path" class="navpath">
<ul>
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
<a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(0)"><span class="SelectionMark">&#160;</span>All</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(1)"><span class="SelectionMark">&#160;</span>Files</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(2)"><span class="SelectionMark">&#160;</span>Functions</a></div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<li class="footer">Generated on Tue Apr 2 2013 14:57:03 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.7.5.1 </li>
</ul>
</div>
</body>
</html>