blob: d27504583b33849191b3b17b07199b4a4c3ca169 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: Naive Bayes Classification</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.16</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__bayes.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">Naive Bayes Classification<div class="ingroups"><a class="el" href="group__grp__early__stage.html">Early Stage Development</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> <ul>
<li>
<a href="#train">Training Function(s)</a> </li>
<li>
<a href="#classify">Classify Function(s)</a> </li>
<li>
<a href="#probabilities">Probabilities Function(s)</a> </li>
<li>
<a href="#adhoc">Ad Hoc Computation</a> </li>
<li>
<a href="#notes">Implementation Notes</a> </li>
<li>
<a href="#examples">Examples</a> </li>
<li>
<a href="#background">Technical Background</a> </li>
<li>
<a href="#related">Related Topics</a> </li>
</ul>
</div><dl class="section warning"><dt>Warning</dt><dd><em> This MADlib method is still in early stage development. Interface and implementation are subject to change. </em></dd></dl>
<p>Naive Bayes refers to a stochastic model where all independent variables \( a_1, \dots, a_n \) (often referred to as attributes in this context) independently contribute to the probability that a data point belongs to a certain class \( c \).</p>
<p>Naives Bayes classification estimates feature probabilities and class priors using maximum likelihood or Laplacian smoothing. For numeric attributes, Gaussian smoothing can be used to estimate the feature probabilities.These parameters are then used to classify new data.</p>
<p><a class="anchor" id="train"></a></p><dl class="section user"><dt>Training Function(s)</dt><dd></dd></dl>
<p>For data with only categorical attributes, precompute feature probabilities and class priors using the following function:</p>
<pre class="syntax">
create_nb_prepared_data_tables ( trainingSource,
trainingClassColumn,
trainingAttrColumn,
numAttrs,
featureProbsName,
classPriorsName
)
</pre><p>For data containing both categorical and numeric attributes, use the following form to precompute the Gaussian parameters (mean and variance) for numeric attributes alongside the feature probabilities for categorical attributes and class priors.</p>
<pre class="syntax">
create_nb_prepared_data_tables ( trainingSource,
trainingClassColumn,
trainingAttrColumn,
numericAttrsColumnIndices,
numAttrs,
featureProbsName,
numericAttrParamsName,
classPriorsName
)
</pre><p>The <em>trainingSource</em> is expected to be of the following form: </p><pre>{TABLE|VIEW} <em>trainingSource</em> (
...
<em>trainingClassColumn</em> INTEGER,
<em>trainingAttrColumn</em> INTEGER[] OR NUMERIC[] OR FLOAT8[],
...
)</pre><p><em>numericAttrsColumnIndices</em> should be of type TEXT, specified as an array of indices (starting from 1) in the <em>trainingAttrColumn</em> attributes-array that correspond to numeric attributes.</p>
<p>The two output tables are:</p><ul>
<li><em>featureProbsName</em> &ndash; stores feature probabilities</li>
<li><em>classPriorsName</em> &ndash; stores the class priors</li>
</ul>
<p>In addition to the above, if the function specifying numeric attributes is used, an additional table <em>numericAttrParamsName</em> is created which stores the Gaussian parameters for the numeric attributes.</p>
<p><a class="anchor" id="classify"></a></p><dl class="section user"><dt>Classify Function(s)</dt><dd></dd></dl>
<p>Perform Naive Bayes classification: </p><pre class="syntax">
create_nb_classify_view ( featureProbsName,
classPriorsName,
classifySource,
classifyKeyColumn,
classifyAttrColumn,
numAttrs,
destName
)
</pre><p>For data with numeric attributes, use the following version:</p>
<pre class="syntax">
create_nb_classify_view ( featureProbsName,
classPriorsName,
classifySource,
classifyKeyColumn,
classifyAttrColumn,
numAttrs,
numericAttrParamsName,
destName
)
</pre><p>The <b>data to classify</b> is expected to be of the following form: </p><pre>{TABLE|VIEW} <em>classifySource</em> (
...
<em>classifyKeyColumn</em> ANYTYPE,
<em>classifyAttrColumn</em> INTEGER[],
...
)</pre><p>This function creates the view <code><em>destName</em></code> mapping <em>classifyKeyColumn</em> to the Naive Bayes classification. </p><pre class="result">
key | nb_classification
&#160;---+------------------
...
</pre><p><a class="anchor" id="probabilities"></a></p><dl class="section user"><dt>Probabilities Function(s)</dt><dd></dd></dl>
<p>Compute Naive Bayes probabilities. </p><pre class="syntax">
create_nb_probs_view( featureProbsName,
classPriorsName,
classifySource,
classifyKeyColumn,
classifyAttrColumn,
numAttrs,
destName
)
</pre><p>For data with numeric attributes , use the following version:</p>
<pre class="syntax">
create_nb_probs_view( featureProbsName,
classPriorsName,
classifySource,
classifyKeyColumn,
classifyAttrColumn,
numAttrs,
numericAttrParamsName,
destName
)
</pre><p>This creates the view <code><em>destName</em></code> mapping <em>classifyKeyColumn</em> and every single class to the Naive Bayes probability: </p><pre class="result">
key | class | nb_prob
&#160;---+-------+--------
...
</pre><p><a class="anchor" id="adhoc"></a></p><dl class="section user"><dt>Ad Hoc Computation Function</dt><dd></dd></dl>
<p>With ad hoc execution (no precomputation), the functions <a class="el" href="bayes_8sql__in.html#a798402280fc6db710957ae3ab58767e0" title="Create a view with columns (key, nb_classification) ">create_nb_classify_view()</a> and <a class="el" href="bayes_8sql__in.html#a163afffd0c845d325f060f74bcf02243" title="Create view with columns (key, class, nb_prob) ">create_nb_probs_view()</a> can be used in an ad-hoc fashion without the precomputation step. In this case, replace the function arguments</p>
<pre>'<em>featureProbsName</em>', '<em>classPriorsName</em>'</pre><p> with </p><pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>'</pre><p> for data without any any numeric attributes and with </p><pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>', '<em>numericAttrsColumnIndices</em>'</pre><p> for data containing numeric attributes as well.</p>
<p><a class="anchor" id="notes"></a></p><dl class="section user"><dt>Implementation Notes</dt><dd><ul>
<li>The probabilities computed on the platforms of PostgreSQL and Greenplum database have a small difference due to the nature of floating point computation. Usually this is not important. However, if a data point has <p class="formulaDsp">
\[ P(C=c_i \mid A) \approx P(C=c_j \mid A) \]
</p>
for two classes, this data point might be classified into diferent classes on PostgreSQL and Greenplum. This leads to the differences in classifications on PostgreSQL and Greenplum for some data sets, but this should not affect the quality of the results.</li>
<li>When two classes have equal and highest probability among all classes, the classification result is an array of these two classes, but the order of the two classes is random.</li>
<li>The current implementation of Naive Bayes classification is suitable for discontinuous (categorial) attributes as well as continuous (numeric) attributes.<br />
For continuous data, a typical assumption, usually used for small datasets, is that the continuous values associated with each class are distributed according to a Gaussian distribution, and the probabilities \( P(A_i = a \mid C=c) \) are estimated using the Gaussian Distribution formula: <p class="formulaDsp">
\[ P(A_i=a \mid C=c) = \frac{1}{\sqrt{2\pi\sigma^{2}_c}}exp\left(-\frac{(a-\mu_c)^{2}}{2\sigma^{2}_c}\right) \]
</p>
where \(\mu_c\) and \(\sigma^{2}_c\) are the population mean and variance of the attribute for the class \(c\).<br />
Another common technique for handling continuous values, which is better for large data sets, is to use binning to discretize the values, and convert the continuous data into categorical bins. This approach is currently not implemented.</li>
<li>One can provide floating point data to the Naive Bayes classification function. If the corresponding attribute index is not specified in <em>numericAttrsColumnIndices</em>, floating point numbers will be used as symbolic substitutions for categorial data. In this case, the classification would work best if there are sufficient data points for each floating point attribute. However, if floating point numbers are used as continuous data without the attribute being marked as of type numeric in <em>numericAttrsColumnIndices</em>, no warning is raised and the result may not be as expected.</li>
</ul>
</dd></dl>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<p>The following is an extremely simplified example of the above option #1 which can by verified by hand.</p>
<ol type="1">
<li>The training and the classification data. <pre class="example">
SELECT * FROM training;
</pre> Result: <pre class="result">
id | class | attributes
&#160;---+-------+------------
1 | 1 | {1,2,3}
2 | 1 | {1,2,1}
3 | 1 | {1,4,3}
4 | 2 | {1,2,2}
5 | 2 | {0,2,2}
6 | 2 | {0,1,3}
(6 rows)
</pre> <pre class="example">
SELECT * FROM toclassify;
</pre> Result: <pre class="result">
id | attributes
&#160;---+------------
1 | {0,2,1}
2 | {1,2,3}
(2 rows)
</pre></li>
<li>Precompute feature probabilities and class priors. <pre class="example">
SELECT madlib.create_nb_prepared_data_tables( 'training',
'class',
'attributes',
3,
'nb_feature_probs',
'nb_class_priors'
);
</pre></li>
<li>Optionally check the contents of the precomputed tables. <pre class="example">
SELECT * FROM nb_class_priors;
</pre> Result: <pre class="result">
class | class_cnt | all_cnt
&#160;------+-----------+---------
1 | 3 | 6
2 | 3 | 6
(2 rows)
</pre> <pre class="example">
SELECT * FROM nb_feature_probs;
</pre> Result: <pre class="result">
class | attr | value | cnt | attr_cnt
&#160;------+------+-------+-----+----------
1 | 1 | 0 | 0 | 2
1 | 1 | 1 | 3 | 2
1 | 2 | 1 | 0 | 3
1 | 2 | 2 | 2 | 3
...
</pre></li>
<li>Create the view with Naive Bayes classification and check the results. <pre class="example">
SELECT madlib.create_nb_classify_view( 'nb_feature_probs',
'nb_class_priors',
'toclassify',
'id',
'attributes',
3,
'nb_classify_view_fast'
);
&#160;
SELECT * FROM nb_classify_view_fast;
</pre> Result: <pre class="result">
key | nb_classification
&#160;----+-------------------
1 | {2}
2 | {1}
(2 rows)
</pre></li>
<li>Look at the probabilities for each class (note that we use "Laplacian smoothing"), <pre class="example">
SELECT madlib.create_nb_probs_view( 'nb_feature_probs',
'nb_class_priors',
'toclassify',
'id',
'attributes',
3,
'nb_probs_view_fast'
);
&#160;
SELECT * FROM nb_probs_view_fast;
</pre> Result: <pre class="result">
key | class | nb_prob
&#160;----+-------+---------
1 | 1 | 0.4
1 | 2 | 0.6
2 | 1 | 0.75
2 | 2 | 0.25
(4 rows)
</pre></li>
</ol>
<p>The following is an example of using a dataset with both numeric and categorical attributes</p>
<ol type="1">
<li>The training and the classification data. Attributes {height(numeric),weight(numeric),shoe size(categorical)}, Class{sex(1=male,2=female)} <pre class="example">
SELECT * FROM gaussian_data;
</pre> Result: <pre class="result">
id | sex | attributes
&#160;----+-----+---------------
1 | 1 | {6,180,12}
2 | 1 | {5.92,190,12}
3 | 1 | {5.58,170,11}
4 | 1 | {5.92,165,11}
5 | 2 | {5,100,6}
6 | 2 | {5.5,150,6}
7 | 2 | {5.42,130,7}
8 | 2 | {5.75,150,8}
(8 rows)
</pre> <pre class="example">
SELECT * FROM gaussian_test;
</pre> Result: <pre class="result">
id | sex | attributes
----+-----+--------------
9 | 1 | {5.8,180,11}
10 | 2 | {5,160,6}
(2 rows)
</pre></li>
<li>Precompute feature probabilities and class priors. <pre class="example">
SELECT madlib.create_nb_prepared_data_tables( 'gaussian_data',
'sex',
'attributes',
'ARRAY[1,2]',
3,
'categ_feature_probs',
'numeric_attr_params',
'class_priors'
);
</pre></li>
<li>Optionally check the contents of the precomputed tables. <pre class="example">
SELECT * FROM class_priors;
</pre> Result: <pre class="result">
class | class_cnt | all_cnt
&#160;-------+-----------+---------
1 | 4 | 8
2 | 4 | 8
(2 rows)
</pre> <pre class="example">
SELECT * FROM categ_feature_probs;
</pre> Result: <pre class="result">
class | attr | value | cnt | attr_cnt
-------+------+-------+-----+----------
2 | 3 | 6 | 2 | 5
1 | 3 | 12 | 2 | 5
2 | 3 | 7 | 1 | 5
1 | 3 | 11 | 2 | 5
2 | 3 | 8 | 1 | 5
2 | 3 | 12 | 0 | 5
1 | 3 | 6 | 0 | 5
2 | 3 | 11 | 0 | 5
1 | 3 | 8 | 0 | 5
1 | 3 | 7 | 0 | 5
(10 rows)
</pre> <pre class="example">
SELECT * FROM numeric_attr_params;
</pre> Result: <pre class="result">
class | attr | attr_mean | attr_var
-------+------+----------------------+------------------------
1 | 1 | 5.8550000000000000 | 0.03503333333333333333
1 | 2 | 176.2500000000000000 | 122.9166666666666667
2 | 1 | 5.4175000000000000 | 0.09722500000000000000
2 | 2 | 132.5000000000000000 | 558.3333333333333333
(4 rows)
</pre></li>
<li>Create the view with Naive Bayes classification and check the results. <pre class="example">
SELECT madlib.create_nb_classify_view( 'categ_feature_probs',
'class_priors',
'gaussian_test',
'id',
'attributes',
3,
'numeric_attr_params',
'classify_view'
);
&#160;
SELECT * FROM classify_view;
</pre> Result: <pre class="result">
key | nb_classification
&#160;----+-------------------
9 | {1}
10 | {2}
(2 rows)
</pre></li>
<li>Look at the probabilities for each class <pre class="example">
SELECT madlib.create_nb_probs_view( 'categ_feature_probs',
'class_priors',
'gaussian_test',
'id',
'attributes',
3,
'numeric_attr_params',
'probs_view'
);
&#160;
SELECT * FROM probs_view;
</pre> Result: <pre class="result">
key | class | nb_prob
-----+-------+----------------------
9 | 1 | 0.993556745948775
9 | 2 | 0.00644325405122553
10 | 1 | 5.74057538627122e-05
10 | 2 | 0.999942594246137
(4 rows)
</pre></li>
</ol>
<p><a class="anchor" id="background"></a></p><dl class="section user"><dt>Technical Background</dt><dd></dd></dl>
<p>In detail, <b>Bayes'</b> theorem states that </p><p class="formulaDsp">
\[ \Pr(C = c \mid A_1 = a_1, \dots, A_n = a_n) = \frac{\Pr(C = c) \cdot \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)} {\Pr(A_1 = a_1, \dots, A_n = a_n)} \,, \]
</p>
<p> and the <b>naive</b> assumption is that </p><p class="formulaDsp">
\[ \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c) = \prod_{i=1}^n \Pr(A_i = a_i \mid C = c) \,. \]
</p>
<p> Naives Bayes classification estimates feature probabilities and class priors using maximum likelihood or Laplacian smoothing. These parameters are then used to classifying new data.</p>
<p>A Naive Bayes classifier computes the following formula: </p><p class="formulaDsp">
\[ \text{classify}(a_1, ..., a_n) = \arg\max_c \left\{ \Pr(C = c) \cdot \prod_{i=1}^n \Pr(A_i = a_i \mid C = c) \right\} \]
</p>
<p> where \( c \) ranges over all classes in the training data and probabilites are estimated with relative frequencies from the training set. There are different ways to estimate the feature probabilities \( P(A_i = a \mid C = c) \). The maximum likelihood estimate takes the relative frequencies. That is: </p><p class="formulaDsp">
\[ P(A_i = a \mid C = c) = \frac{\#(c,i,a)}{\#c} \]
</p>
<p> where</p><ul>
<li>\( \#(c,i,a) \) denotes the # of training samples where attribute \( i \) is \( a \) and class is \( c \)</li>
<li>\( \#c \) denotes the # of training samples where class is \( c \).</li>
</ul>
<p>Since the maximum likelihood sometimes results in estimates of "0", you might want to use a "smoothed" estimate. To do this, you add a number of "virtual" samples and make the assumption that these samples are evenly distributed among the values assumed by attribute \( i \) (that is, the set of all values observed for attribute \( a \) for any class):</p>
<p class="formulaDsp">
\[ P(A_i = a \mid C = c) = \frac{\#(c,i,a) + s}{\#c + s \cdot \#i} \]
</p>
<p> where</p><ul>
<li>\( \#i \) denotes the # of distinct values for attribute \( i \) (for all classes)</li>
<li>\( s \geq 0 \) denotes the smoothing factor.</li>
</ul>
<p>The case \( s = 1 \) is known as "Laplace smoothing". The case \( s = 0 \) trivially reduces to maximum-likelihood estimates.</p>
<p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] Tom Mitchell: Machine Learning, McGraw Hill, 1997. Book chapter <em>Generativ and Discriminative Classifiers: Naive Bayes and Logistic Regression</em> available at: <a href="http://www.cs.cmu.edu/~tom/NewChapters.html">http://www.cs.cmu.edu/~tom/NewChapters.html</a></p>
<p>[2] Wikipedia, Naive Bayes classifier, <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">http://en.wikipedia.org/wiki/Naive_Bayes_classifier</a></p>
<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd>File <a class="el" href="bayes_8sql__in.html" title="SQL functions for naive Bayes. ">bayes.sql_in</a> documenting the SQL functions.</dd></dl>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Tue Jul 2 2019 22:35:52 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>