blob: 5721c80f3da53f1043f666726c26cf501eedba5b [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<title>MADlib: Naive Bayes Classification</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { searchBox.OnSelectItem(0); });
</script>
<script src="../mathjax/MathJax.js">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script>
</head>
<body>
<div id="top"><!-- do not remove this div! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td style="padding-left: 0.5em;">
<div id="projectname">MADlib
&#160;<span id="projectnumber">0.6</span> <span style="font-size:10pt; font-style:italic"><a href="../latest/./group__grp__bayes.html"> A newer version is available</a></span>
</div>
<div id="projectbrief">User Documentation</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- Generated by Doxygen 1.7.5.1 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
<script type="text/javascript" src="dynsections.js"></script>
<div id="navrow1" class="tabs">
<ul class="tablist">
<li><a href="index.html"><span>Main&#160;Page</span></a></li>
<li><a href="modules.html"><span>Modules</span></a></li>
<li><a href="files.html"><span>Files</span></a></li>
<li>
<div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</li>
</ul>
</div>
</div>
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
initNavTree('group__grp__bayes.html','');
</script>
<div id="doc-content">
<div class="header">
<div class="headertitle">
<div class="title">Naive Bayes Classification</div> </div>
<div class="ingroups"><a class="el" href="group__grp__suplearn.html">Supervised Learning</a></div></div>
<div class="contents">
<div id="dynsection-0" onclick="return toggleVisibility(this)" class="dynheader closed" style="cursor:pointer;">
<img id="dynsection-0-trigger" src="closed.png" alt="+"/> Collaboration diagram for Naive Bayes Classification:</div>
<div id="dynsection-0-summary" class="dynsummary" style="display:block;">
</div>
<div id="dynsection-0-content" class="dyncontent" style="display:none;">
<center><table><tr><td><div class="center"><iframe scrolling="no" frameborder="0" src="group__grp__bayes.svg" width="403" height="40"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe>
</div>
</td></tr></table></center>
</div>
<dl class="user"><dt><b>About:</b></dt><dd></dd></dl>
<p>Naive Bayes refers to a stochastic model where all independent variables \( a_1, \dots, a_n \) (often referred to as attributes in this context) independently contribute to the probability that a data point belongs to a certain class \( c \). In detail, <b>Bayes'</b> theorem states that </p>
<p class="formulaDsp">
\[ \Pr(C = c \mid A_1 = a_1, \dots, A_n = a_n) = \frac{\Pr(C = c) \cdot \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)} {\Pr(A_1 = a_1, \dots, A_n = a_n)} \,, \]
</p>
<p> and the <b>naive</b> assumption is that </p>
<p class="formulaDsp">
\[ \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c) = \prod_{i=1}^n \Pr(A_i = a_i \mid C = c) \,. \]
</p>
<p> Naives Bayes classification estimates feature probabilities and class priors using maximum likelihood or Laplacian smoothing. These parameters are then used to classifying new data.</p>
<p>A Naive Bayes classifier computes the following formula: </p>
<p class="formulaDsp">
\[ \text{classify}(a_1, ..., a_n) = \arg\max_c \left\{ \Pr(C = c) \cdot \prod_{i=1}^n \Pr(A_i = a_i \mid C = c) \right\} \]
</p>
<p> where \( c \) ranges over all classes in the training data and probabilites are estimated with relative frequencies from the training set. There are different ways to estimate the feature probabilities \( P(A_i = a \mid C = c) \). The maximum likelihood estimate takes the relative frequencies. That is: </p>
<p class="formulaDsp">
\[ P(A_i = a \mid C = c) = \frac{\#(c,i,a)}{\#c} \]
</p>
<p> where</p>
<ul>
<li>\( \#(c,i,a) \) denotes the # of training samples where attribute \( i \) is \( a \) and class is \( c \)</li>
<li>\( \#c \) denotes the # of training samples where class is \( c \).</li>
</ul>
<p>Since the maximum likelihood sometimes results in estimates of "0", you might want to use a "smoothed" estimate. To do this, you add a number of "virtual" samples and make the assumption that these samples are evenly distributed among the values assumed by attribute \( i \) (that is, the set of all values observed for attribute \( a \) for any class):</p>
<p class="formulaDsp">
\[ P(A_i = a \mid C = c) = \frac{\#(c,i,a) + s}{\#c + s \cdot \#i} \]
</p>
<p> where</p>
<ul>
<li>\( \#i \) denotes the # of distinct values for attribute \( i \) (for all classes)</li>
<li>\( s \geq 0 \) denotes the smoothing factor.</li>
</ul>
<p>The case \( s = 1 \) is known as "Laplace smoothing". The case \( s = 0 \) trivially reduces to maximum-likelihood estimates.</p>
<p><b>Note:</b> (1) The probabilities computed on the platforms of PostgreSQL and Greenplum database have a small difference due to the nature of floating point computation. Usually this is not important. However, if a data point has </p>
<p class="formulaDsp">
\[ P(C=c_i \mid A) \approx P(C=c_j \mid A) \]
</p>
<p> for two classes, this data point might be classified into diferent classes on PostgreSQL and Greenplum. This leads to the differences in classifications on PostgreSQL and Greenplum for some data sets, but this should not affect the quality of the results.</p>
<p>(2) When two classes have equal and highest probability among all classes, the classification result is an array of these two classes, but the order of the two classes is random.</p>
<p>(3) The current implementation of Naive Bayes classification is only suitable for discontinuous (categorial) attributes.</p>
<p>For continuous data, a typical assumption, usually used for small datasets, is that the continuous values associated with each class are distributed according to a Gaussian distribution, and then the probabilities \( P(A_i = a \mid C=c) \) can be estimated. Another common technique for handling continuous values, which is better for large data sets, is to use binning to discretize the values, and convert the continuous data into categorical bins. These approaches are currently not implemented and planned for future releases.</p>
<p>(4) One can still provide floating point data to the naive Bayes classification function. Floating point numbers can be used as symbolic substitutions for categorial data. The classification would work best if there are sufficient data points for each floating point attribute. However, if floating point numbers are used as continuous data, no warning is raised and the result may not be as expected.</p>
<dl class="user"><dt><b>Input:</b></dt><dd></dd></dl>
<p>The <b>training data</b> is expected to be of the following form: </p>
<pre>{TABLE|VIEW} <em>trainingSource</em> (
...
<em>trainingClassColumn</em> INTEGER,
<em>trainingAttrColumn</em> INTEGER[],
...
)</pre><p>The <b>data to classify</b> is expected to be of the following form: </p>
<pre>{TABLE|VIEW} <em>classifySource</em> (
...
<em>classifyKeyColumn</em> ANYTYPE,
<em>classifyAttrColumn</em> INTEGER[],
...
)</pre><dl class="user"><dt><b>Usage:</b></dt><dd></dd></dl>
<ul>
<li>Precompute feature probabilities and class priors: <pre>SELECT <a class="el" href="bayes_8sql__in.html#aeb4eae7843dd789cc38d5fc57f4ccfb2">create_nb_prepared_data_tables</a>(
'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>',
<em>numAttrs</em>, '<em>featureProbsName</em>', '<em>classPriorsName</em>'
);</pre> This creates table <em>featureProbsName</em> for storing feature probabilities and table <em>classPriorsName</em> for storing the class priors.</li>
<li>Perform Naive Bayes classification: <pre>SELECT <a class="el" href="bayes_8sql__in.html#a798402280fc6db710957ae3ab58767e0">create_nb_classify_view</a>(
'<em>featureProbsName</em>', '<em>classPriorsName</em>',
'<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',
<em>numAttrs</em>, '<em>destName</em>'
);</pre> This creates the view <code><em>destName</em></code> mapping <em>classifyKeyColumn</em> to the Naive Bayes classification: <pre>key | nb_classification
----+------------------
...</pre></li>
<li>Compute Naive Bayes probabilities: <pre>SELECT <a class="el" href="bayes_8sql__in.html#a163afffd0c845d325f060f74bcf02243">create_nb_probs_view</a>(
'<em>featureProbsName</em>', '<em>classPriorsName</em>',
'<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',
<em>numAttrs</em>, '<em>destName</em>'
);</pre> This creates the view <code><em>destName</em></code> mapping <em>classifyKeyColumn</em> and every single class to the Naive Bayes probability: <pre>key | class | nb_prob
----+-------+--------
...</pre></li>
<li>Ad-hoc execution (no precomputation): Functions <a class="el" href="bayes_8sql__in.html#a798402280fc6db710957ae3ab58767e0">create_nb_classify_view</a> and <a class="el" href="bayes_8sql__in.html#a163afffd0c845d325f060f74bcf02243">create_nb_probs_view</a> can be used in an ad-hoc fashion without the above precomputation step. In this case, replace the function arguments <pre>'<em>featureProbsName</em>', '<em>classPriorsName</em>'</pre> with <pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>'</pre></li>
</ul>
<dl class="user"><dt><b>Examples:</b></dt><dd></dd></dl>
<p>The following is an extremely simplified example of the above option #1 which can by verified by hand.</p>
<ol type="1">
<li>The training and the classification data: <div class="fragment"><pre class="fragment">
sql&gt; SELECT * FROM training;
id | class | attributes
----+-------+------------
1 | 1 | {1,2,3}
2 | 1 | {1,2,1}
3 | 1 | {1,4,3}
4 | 2 | {1,2,2}
5 | 2 | {0,2,2}
6 | 2 | {0,1,3}
(6 rows)
sql&gt; select * from toclassify;
id | attributes
----+------------
1 | {0,2,1}
2 | {1,2,3}
(2 rows)
</pre></div></li>
<li>Precompute feature probabilities and class priors <div class="fragment"><pre class="fragment">
sql&gt; SELECT madlib.create_nb_prepared_data_tables(
'training', 'class', 'attributes', 3, 'nb_feature_probs', 'nb_class_priors');
</pre></div></li>
<li>Optionally check the contents of the precomputed tables: <div class="fragment"><pre class="fragment">
sql&gt; SELECT * FROM nb_class_priors;
class | class_cnt | all_cnt
-------+-----------+---------
1 | 3 | 6
2 | 3 | 6
(2 rows)
sql&gt; SELECT * FROM nb_feature_probs;
class | attr | value | cnt | attr_cnt
-------+------+-------+-----+----------
1 | 1 | 0 | 0 | 2
1 | 1 | 1 | 3 | 2
1 | 2 | 1 | 0 | 3
1 | 2 | 2 | 2 | 3
...
</pre></div></li>
<li>Create the view with Naive Bayes classification and check the results: <div class="fragment"><pre class="fragment">
sql&gt; SELECT madlib.create_nb_classify_view (
'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_classify_view_fast');
sql&gt; SELECT * FROM nb_classify_view_fast;
key | nb_classification
-----+-------------------
1 | {2}
2 | {1}
(2 rows)
</pre></div></li>
<li>Look at the probabilities for each class (note that we use "Laplacian smoothing"): <div class="fragment"><pre class="fragment">
sql&gt; SELECT madlib.create_nb_probs_view (
'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_probs_view_fast');
sql&gt; SELECT * FROM nb_probs_view_fast;
key | class | nb_prob
-----+-------+---------
1 | 1 | 0.4
1 | 2 | 0.6
2 | 1 | 0.75
2 | 2 | 0.25
(4 rows)
</pre></div></li>
</ol>
<dl class="user"><dt><b>Literature:</b></dt><dd></dd></dl>
<p>[1] Tom Mitchell: Machine Learning, McGraw Hill, 1997. Book chapter <em>Generativ and Discriminative Classifiers: Naive Bayes and Logistic Regression</em> available at: <a href="http://www.cs.cmu.edu/~tom/NewChapters.html">http://www.cs.cmu.edu/~tom/NewChapters.html</a></p>
<p>[2] Wikipedia, Naive Bayes classifier, <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">http://en.wikipedia.org/wiki/Naive_Bayes_classifier</a></p>
<dl class="see"><dt><b>See also:</b></dt><dd>File <a class="el" href="bayes_8sql__in.html" title="SQL functions for naive Bayes.">bayes.sql_in</a> documenting the SQL functions. </dd></dl>
</div>
</div>
<div id="nav-path" class="navpath">
<ul>
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
<a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(0)"><span class="SelectionMark">&#160;</span>All</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(1)"><span class="SelectionMark">&#160;</span>Files</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(2)"><span class="SelectionMark">&#160;</span>Functions</a></div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<li class="footer">Generated on Tue Apr 2 2013 14:57:03 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.7.5.1 </li>
</ul>
</div>
</body>
</html>