blob: 12dba0e719bd97806a244415ebe76a5e1c1c51e9 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.10"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: Random Forest (old implementation)</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
$(window).load(resizeHeight);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script src="../mathjax/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="navtree_hack.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'auto');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.incubator.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.8</span>
</div>
<div id="projectbrief">User Documentation for MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.10 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__rf.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">Random Forest (old implementation)<div class="ingroups"><a class="el" href="group__grp__deprecated.html">Deprecated Modules</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> </p><ul>
<li>
<a href="#train">Training Function</a> </li>
<li>
<a href="#classify">Classification Function</a> </li>
<li>
<a href="#score">Scoring Function</a> </li>
<li>
<a href="#display">Display Function</a> </li>
<li>
<a href="#clean">Cleaning Function</a> </li>
<li>
<a href="#examples">Examples</a> </li>
<li>
<a href="#literature">Literature</a> </li>
<li>
<a href="#related">Related Topics</a> </li>
</ul>
</div><dl class="section warning"><dt>Warning</dt><dd><em> This is an old implementation of random forests. For a newer implementation, please see <a class="el" href="group__grp__random__forest.html">Random Forest</a></em></dd></dl>
<p>A random forest (RF) is an ensemble classifier that consists of many decision trees and outputs the class that is voted by the majority of the individual trees.</p>
<p>It has the following well-known advantages:</p><ul>
<li>Overall, RF produces better accuracy.</li>
<li>It can be very efficient for large data sets. Trees of an RF can be trained in parallel.</li>
<li>It can handle thousands of input attributes without attribute deletion.</li>
</ul>
<p>This module provides an implementation of the random forest algorithm described in [1].</p>
<p>The implementation supports:</p><ul>
<li>Building random forests</li>
<li>Multiple split critera, including: . Information Gain . Gini Coefficient . Gain Ratio</li>
<li>Random forest Classification/Scoring</li>
<li>Random forest Display</li>
<li>Continuous and Discrete features</li>
<li>Equal frequency discretization for continuous features</li>
<li>Missing value handling</li>
<li>Sampling with replacement</li>
</ul>
<dl class="section user"><dt>Input</dt><dd></dd></dl>
<p>The <b>data to classify</b> is expected to be of the same form as <b>training data</b>, except that it does not need a class column.</p>
<p><a class="anchor" id="train"></a></p><dl class="section user"><dt>Training Function</dt><dd></dd></dl>
<p>Run the training algorithm on the source data. </p><pre class="syntax">
rf_train( split_criterion,
training_table_name,
result_rf_table_name,
num_trees,
features_per_node,
sampling_percentage,
continuous_feature_names,
feature_col_names,
id_col_name,
class_col_name,
how2handle_missing_value,
max_tree_depth,
node_prune_threshold,
node_split_threshold,
verbosity
)
</pre><p> <b>Arguments</b> </p><dl class="arglist">
<dt>split_criterion </dt>
<dd><p class="startdd">The name of the split criterion that should be used for tree construction. The valid values are ‘infogain’, ‘gainratio’, and ‘gini’. It can't be NULL. Information gain(infogain) and gini index(gini) are biased toward multivalued attributes. Gain ratio(gainratio) adjusts for this bias. However, it tends to prefer unbalanced splits in which one partition is much smaller than the others.</p>
<p class="enddd"></p>
</dd>
<dt>training_table_name </dt>
<dd><p class="startdd">The name of the table/view with the training data. It can't be NULL and must exist.</p>
<p>The <b>training data</b> is expected to be of the following form: </p><pre>{TABLE|VIEW} <em>trainingSource</em> (
...
<em>id</em> INT|BIGINT,
<em>feature1</em> SUPPORTED_DATA_TYPE,
<em>feature2</em> SUPPORTED_DATA_TYPE,
<em>feature3</em> SUPPORTED_DATA_TYPE,
....................
<em>featureN</em> SUPPORTED_DATA_TYPE,
<em>class</em> SUPPORTED_DATA_TYPE,
...
)</pre><p>SUPPORTED_DATA_TYPE can be any of the following: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL. </p>
<p class="enddd"></p>
</dd>
<dt>result_rf_table_name </dt>
<dd><p class="startdd">The name of the table where the resulting trees are stored. It can not be NULL and must not exist.</p>
<p class="enddd">The output table stores an abstract object (representing the model) used for further classification. The table has the following columns: </p><table class="output">
<tr>
<th>id</th><td></td></tr>
<tr>
<th>tree_location</th><td></td></tr>
<tr>
<th>feature</th><td></td></tr>
<tr>
<th>probability</th><td></td></tr>
<tr>
<th>ebp_coeff</th><td></td></tr>
<tr>
<th>maxclass</th><td></td></tr>
<tr>
<th>split_gain</th><td></td></tr>
<tr>
<th>live</th><td></td></tr>
<tr>
<th>cat_size</th><td></td></tr>
<tr>
<th>parent_id</th><td></td></tr>
<tr>
<th>lmc_nid</th><td></td></tr>
<tr>
<th>lmc_fval</th><td></td></tr>
<tr>
<th>is_feature_cont</th><td></td></tr>
<tr>
<th>split_value</th><td></td></tr>
<tr>
<th>tid</th><td></td></tr>
<tr>
<th>dp_ids </th><td></td></tr>
</table>
</dd>
<dt>num_trees </dt>
<dd>The number of trees to be trained. If it's NULL, 10 will be used. </dd>
<dt>features_per_node </dt>
<dd>The number of features to be considered when finding a best split. If it's NULL, sqrt(p), where p is the number of features, will be used. </dd>
<dt>sampling_percentage </dt>
<dd>The percentage of records sampled to train a tree. If it's NULL, 0.632 bootstrap will be used continuous_feature_names A comma-separated list of the names of the features whose values are continuous. NULL means there are no continuous features. </dd>
<dt>feature_col_names </dt>
<dd>A comma-separated list of names of the table columns, each of which defines a feature. NULL means all the columns except the ID and Class columns will be treated as features. </dd>
<dt>id_col_name </dt>
<dd>The name of the column containing id of each record. It can't be NULL. </dd>
<dt>class_col_name </dt>
<dd>The name of the column containing correct class of each record. It can't be NULL. </dd>
<dt>how2handle_missing_value </dt>
<dd>The way to handle missing value. The valid values are 'explicit' and 'ignore'. It can't be NULL. </dd>
<dt>max_tree_depth</dt>
<dd>The maximum tree depth. It can't be NULL. </dd>
<dt>node_prune_threshold </dt>
<dd>The minimum percentage of the number of records required in a child node. It can't be NULL. The range of it is in [0.0, 1.0]. This threshold only applies to the non-root nodes. Therefore, if the percentage(p) between the sampled training set size of a tree (the number of rows) and the total training set size is less than or equal to the value of this parameter, then the tree only has one node (the root node); if its value is 1, then the percentage p is less than or equal to 1 definitely. Therefore, the tree only has one node (the root node). if its value is 0, then no nodes will be pruned by this parameter. </dd>
<dt>node_split_threshold </dt>
<dd>The minimum percentage of the number of records required in a node in order for a further split to be possible. It can't be NULL. The range of it is in [0.0, 1.0]. If the percentage(p) between the sampled training set size of a tree (the number of rows) and the total training set size is less than the value of this parameter, then the root node will be a leaf one. Therefore, the trained tree only has one node. If the percentage p is equal to the value of this parameter, then the trained tree only has two levels, since only the root node will grow. (the root node); if its value is 0, then trees can grow extensively. </dd>
<dt>verbosity </dt>
<dd>Greater than 0 means this function runs in verbose mode. It can't be NULL. </dd>
</dl>
<p><a class="anchor" id="classify"></a></p><dl class="section user"><dt>Classification Function</dt><dd></dd></dl>
<p>The classification function creates the result_table with the classification results. </p><pre class="syntax">
rf_classify( rf_table_name,
classification_table_name,
result_table_name)
</pre><p><a class="anchor" id="score"></a></p><dl class="section user"><dt>Scoring Function</dt><dd></dd></dl>
<p>The scoring function gives a ratio of correctly classified items in the validation data set. </p><pre class="syntax">
rf_score( rf_table_name,
validation_table_name,
verbosity)
</pre><p><a class="anchor" id="display"></a></p><dl class="section user"><dt>Display Function</dt><dd></dd></dl>
<p>The display tree function displays the trained trees in a human-readable format. </p><pre class="syntax">
rf_display( rf_table_name
)
</pre><p><a class="anchor" id="clean"></a></p><dl class="section user"><dt>Cleaning Function</dt><dd></dd></dl>
<p>The clean tree function cleans up the learned model and metadata. </p><pre class="syntax">
rf_clean( rf_table_name
)
</pre><p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<ol type="1">
<li>Prepare an input table. <pre class="example">
SELECT * FROM golf_data ORDER BY id;
</pre> Result: <pre class="result">
id | outlook | temperature | humidity | windy | class
&#160;---+----------+-------------+----------+--------+--------------
1 | sunny | 85 | 85 | false | Do not Play
2 | sunny | 80 | 90 | true | Do not Play
3 | overcast | 83 | 78 | false | Play
4 | rain | 70 | 96 | false | Play
5 | rain | 68 | 80 | false | Play
6 | rain | 65 | 70 | true | Do not Play
7 | overcast | 64 | 65 | true | Play
8 | sunny | 72 | 95 | false | Do not Play
9 | sunny | 69 | 70 | false | Play
10 | rain | 75 | 80 | false | Play
11 | sunny | 75 | 70 | true | Play
12 | overcast | 72 | 90 | true | Play
13 | overcast | 81 | 75 | false | Play
14 | rain | 71 | 80 | true | Do not Play
(14 rows)
</pre></li>
<li>Train the random forest. <pre class="example">
SELECT * FROM madlib.rf_clean('trained_tree_infogain');
SELECT * FROM madlib.rf_train(
'infogain',
'golf_data',
'trained_tree_infogain',
10,
NULL,
0.632,
'temperature,humidity',
'outlook,temperature,humidity,windy',
'id',
'class',
'explicit',
10,
0.0,
0.0,
0);
</pre> Result: <pre class="result">
training_time | num_of_samples | num_trees | features_per_node | num_tree_nodes | max_tree_depth | split_criterion | acs_time | acc_time | olap_time | update_time | best_time
&#160;---------------+--------------+-----------+-------------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------
00:00:03.60498 | 14 | 10 | 3 | 71 | 6 | infogain | 00:00:00.154991 | 00:00:00.404411 | 00:00:00.736876 | 00:00:00.374084 | 00:00:01.722658
(1 row)
</pre></li>
<li>Check the table records that hold the random forest. <pre class="example">
SELECT * FROM trained_tree_infogain ORDER BY tid, id;
</pre> <pre class="result">
id | tree_location | feature | probability | ebp_coeff | maxclass | split_gain | live | cat_size | parent_id | lmc_nid | lmc_fval | is_feature_cont | split_value | tid | dp_ids
&#160;---+---------------+---------+-------------------+-----------+----------+--------------------+------+----------+-----------+---------+----------+-----------------+-------------+-----+--------
1 | {0} | 3 | 0.777777777777778 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 24 | 1 | f | | 1 |
24 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 1 | | | f | | 1 | {3}
25 | {0,2} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 1 | | | f | | 1 | {3}
26 | {0,3} | 2 | 0.666666666666667 | 1 | 1 | 0.444444444444444 | 0 | 3 | 1 | 42 | 1 | t | 70 | 1 | {3}
42 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 26 | | | f | | 1 |
43 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 26 | | | f | | 1 |
2 | {0} | 2 | 0.555555555555556 | 1 | 1 | 0.17636684303351 | 0 | 9 | 0 | 11 | 1 | t | 65 | 2 |
11 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 2 | | | f | | 2 |
12 | {0,2} | 4 | 0.714285714285714 | 1 | 1 | 0.217687074829932 | 0 | 7 | 2 | 44 | 1 | f | | 2 |
44 | {0,2,1} | 3 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 12 | 57 | 1 | f | | 2 | {4}
45 | {0,2,2} | 3 | 1 | 1 | 1 | 0 | 0 | 4 | 12 | | | f | | 2 | {4}
57 | {0,2,1,1} | 2 | 1 | 1 | 2 | 0 | 0 | 1 | 44 | | | t | 78 | 2 | {4,3}
58 | {0,2,1,2} | 2 | 1 | 1 | 2 | 0 | 0 | 1 | 44 | | | t | 96 | 2 | {4,3}
59 | {0,2,1,3} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 44 | | | t | 85 | 2 | {4,3}
3 | {0} | 2 | 0.777777777777778 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 27 | 1 | t | 80 | 3 |
27 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 6 | 3 | | | f | | 3 |
28 | {0,2} | 2 | 0.666666666666667 | 1 | 1 | 0.444444444444444 | 0 | 3 | 3 | 46 | 1 | t | 90 | 3 |
46 | {0,2,1} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 28 | | | f | | 3 |
47 | {0,2,2} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 28 | | | f | | 3 |
4 | {0} | 4 | 0.888888888888889 | 1 | 2 | 0.0493827160493827 | 0 | 9 | 0 | 13 | 1 | f | | 4 |
13 | {0,1} | 3 | 1 | 1 | 2 | 0 | 0 | 6 | 4 | | | f | | 4 | {4}
14 | {0,2} | 3 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 4 | 48 | 1 | f | | 4 | {4}
48 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 14 | | | t | 90 | 4 | {4,3}
49 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 14 | | | t | 80 | 4 | {4,3}
5 | {0} | 2 | 0.888888888888889 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 29 | 1 | t | 90 | 5 |
29 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 8 | 5 | | | f | | 5 |
30 | {0,2} | 3 | 1 | 1 | 1 | 0 | 0 | 1 | 5 | | | f | | 5 |
6 | {0} | 3 | 0.555555555555556 | 1 | 2 | 0.345679012345679 | 0 | 9 | 0 | 15 | 1 | f | | 6 |
15 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 3 | 6 | | | f | | 6 | {3}
16 | {0,2} | 4 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 6 | 51 | 1 | f | | 6 | {3}
17 | {0,3} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 6 | | | f | | 6 | {3}
51 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 16 | | | t | 96 | 6 | {3,4}
52 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 16 | | | t | 70 | 6 | {3,4}
7 | {0} | 4 | 0.666666666666667 | 1 | 2 | 0.253968253968254 | 0 | 9 | 0 | 31 | 1 | f | | 7 |
31 | {0,1} | 2 | 0.857142857142857 | 1 | 2 | 0.102040816326531 | 0 | 7 | 7 | 36 | 1 | t | 80 | 7 | {4}
32 | {0,2} | 3 | 1 | 1 | 1 | 0 | 0 | 2 | 7 | | | f | | 7 | {4}
36 | {0,1,1} | 4 | 1 | 1 | 2 | 0 | 0 | 5 | 31 | | | f | | 7 |
37 | {0,1,2} | 2 | 0.5 | 1 | 2 | 0.5 | 0 | 2 | 31 | 60 | 1 | t | 95 | 7 |
60 | {0,1,2,1} | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 37 | | | f | | 7 |
61 | {0,1,2,2} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 37 | | | f | | 7 |
8 | {0} | 3 | 0.777777777777778 | 1 | 2 | 0.0864197530864197 | 0 | 9 | 0 | 18 | 1 | f | | 8 |
18 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 8 | | | f | | 8 | {3}
19 | {0,2} | 4 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 8 | 38 | 1 | f | | 8 | {3}
20 | {0,3} | 2 | 0.5 | 1 | 2 | 0.5 | 0 | 2 | 8 | 53 | 1 | t | 70 | 8 | {3}
38 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 19 | | | t | 80 | 8 | {3,4}
39 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 19 | | | t | 80 | 8 | {3,4}
53 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 20 | | | f | | 8 |
54 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 20 | | | f | | 8 |
9 | {0} | 3 | 0.555555555555556 | 1 | 2 | 0.327160493827161 | 0 | 9 | 0 | 33 | 1 | f | | 9 |
33 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 9 | | | f | | 9 | {3}
34 | {0,2} | 4 | 0.75 | 1 | 2 | 0.375 | 0 | 4 | 9 | 55 | 1 | f | | 9 | {3}
35 | {0,3} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 9 | | | f | | 9 | {3}
55 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 3 | 34 | | | t | 96 | 9 | {3,4}
56 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 34 | | | t | 70 | 9 | {3,4}
10 | {0} | 3 | 0.666666666666667 | 1 | 2 | 0.277777777777778 | 0 | 9 | 0 | 21 | 1 | f | | 10 |
21 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 10 | | | f | | 10 | {3}
22 | {0,2} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 10 | | | f | | 10 | {3}
23 | {0,3} | 2 | 0.75 | 1 | 1 | 0.375 | 0 | 4 | 10 | 40 | 1 | t | 70 | 10 | {3}
40 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 23 | | | f | | 10 |
41 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 23 | | | f | | 10 |
(60 rows)
</pre></li>
<li>Display the random forest in a human readable format. <pre class="example">
SELECT * FROM madlib.rf_display( 'trained_tree_infogain'
);
</pre> Result: <pre class="result">
rf_display
&#160;----------------------------------------------------------------------------------------------------
&#160;
Tree 1
Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778)
outlook: = overcast : class( Play) num_elements(4) predict_prob(1)
outlook: = rain : class( Play) num_elements(2) predict_prob(1)
outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(0.666666666666667)
humidity: &lt;= 70 : class( Play) num_elements(1) predict_prob(1)
humidity: &gt; 70 : class( Do not Play) num_elements(2) predict_prob(1)
&#160;
Tree 2
Root Node : class( Do not Play) num_elements(9) predict_prob(0.555555555555556)
humidity: &lt;= 65 : class( Play) num_elements(2) predict_prob(1)
humidity: &gt; 65 : class( Do not Play) num_elements(7) predict_prob(0.714285714285714)
windy: = false : class( Play) num_elements(3) predict_prob(0.666666666666667)
outlook: = overcast : class( Play) num_elements(1) predict_prob(1)
outlook: = rain : class( Play) num_elements(1) predict_prob(1)
outlook: = sunny : class( Do not Play) num_elements(1) predict_prob(1)
windy: = true : class( Do not Play) num_elements(4) predict_prob(1)
&#160;
Tree 3
Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778)
humidity: &lt;= 80 : class( Play) num_elements(6) predict_prob(1)
humidity: &gt; 80 : class( Do not Play) num_elements(3) predict_prob(0.666666666666667)
humidity: &lt;= 90 : class( Do not Play) num_elements(2) predict_prob(1)
humidity: &gt; 90 : class( Play) num_elements(1) predict_prob(1)
&#160;
Tree 4
Root Node : class( Play) num_elements(9) predict_prob(0.888888888888889)
windy: = false : class( Play) num_elements(6) predict_prob(1)
windy: = true : class( Play) num_elements(3) predict_prob(0.666666666666667)
outlook: = overcast : class( Play) num_elements(2) predict_prob(1)
outlook: = rain : class( Do not Play) num_elements(1) predict_prob(1)
&#160;
Tree 5
Root Node : class( Play) num_elements(9) predict_prob(0.888888888888889)
humidity: &lt;= 90 : class( Play) num_elements(8) predict_prob(1)
humidity: &gt; 90 : class( Do not Play) num_elements(1) predict_prob(1)
&#160;
Tree 6
Root Node : class( Play) num_elements(9) predict_prob(0.555555555555556)
outlook: = overcast : class( Play) num_elements(3) predict_prob(1)
outlook: = rain : class( Play) num_elements(3) predict_prob(0.666666666666667)
windy: = false : class( Play) num_elements(2) predict_prob(1)
windy: = true : class( Do not Play) num_elements(1) predict_prob(1)
outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(1)
&#160;
Tree 7
Root Node : class( Play) num_elements(9) predict_prob(0.666666666666667)
windy: = false : class( Play) num_elements(7) predict_prob(0.857142857142857)
humidity: &lt;= 80 : class( Play) num_elements(5) predict_prob(1)
humidity: &gt; 80 : class( Play) num_elements(2) predict_prob(0.5)
humidity: &lt;= 95 : class( Do not Play) num_elements(1) predict_prob(1)
humidity: &gt; 95 : class( Play) num_elements(1) predict_prob(1)
windy: = true : class( Do not Play) num_elements(2) predict_prob(1)
&#160;
Tree 8
Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778)
outlook: = overcast : class( Play) num_elements(4) predict_prob(1)
outlook: = rain : class( Play) num_elements(3) predict_prob(0.666666666666667)
windy: = false : class( Play) num_elements(2) predict_prob(1)
windy: = true : class( Do not Play) num_elements(1) predict_prob(1)
outlook: = sunny : class( Play) num_elements(2) predict_prob(0.5)
humidity: &lt;= 70 : class( Play) num_elements(1) predict_prob(1)
humidity: &gt; 70 : class( Do not Play) num_elements(1) predict_prob(1)
&#160;
Tree 9
Root Node : class( Play) num_elements(9) predict_prob(0.555555555555556)
outlook: = overcast : class( Play) num_elements(2) predict_prob(1)
outlook: = rain : class( Play) num_elements(4) predict_prob(0.75)
windy: = false : class( Play) num_elements(3) predict_prob(1)
windy: = true : class( Do not Play) num_elements(1) predict_prob(1)
outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(1)
&#160;
Tree 10
Root Node : class( Play) num_elements(9) predict_prob(0.666666666666667)
outlook: = overcast : class( Play) num_elements(1) predict_prob(1)
outlook: = rain : class( Play) num_elements(4) predict_prob(1)
outlook: = sunny : class( Do not Play) num_elements(4) predict_prob(0.75)
humidity: &lt;= 70 : class( Play) num_elements(1) predict_prob(1)
humidity: &gt; 70 : class( Do not Play) num_elements(3) predict_prob(1)
(10 rows)
</pre></li>
<li>Classify data with the learned model. <pre class="example">
SELECT * FROM madlib.rf_classify( 'trained_tree_infogain',
'golf_data',
'classification_result'
);
</pre> Result: <pre class="result">
input_set_size | classification_time
&#160;---------------+---------------------
14 | 00:00:02.215017
(1 row)
</pre></li>
<li>Check the classification results. <pre class="example">
SELECT t.id, t.outlook, t.temperature, t.humidity, t.windy, c.class
FROM classification_result c, golf_data t
WHERE t.id=c.id ORDER BY id;
</pre> Result: <pre class="result">
id | outlook | temperature | humidity | windy | class
&#160;---+----------+-------------+----------+--------+--------------
1 | sunny | 85 | 85 | false | Do not Play
2 | sunny | 80 | 90 | true | Do not Play
3 | overcast | 83 | 78 | false | Play
4 | rain | 70 | 96 | false | Play
5 | rain | 68 | 80 | false | Play
6 | rain | 65 | 70 | true | Do not Play
7 | overcast | 64 | 65 | true | Play
8 | sunny | 72 | 95 | false | Do not Play
9 | sunny | 69 | 70 | false | Play
10 | rain | 75 | 80 | false | Play
11 | sunny | 75 | 70 | true | Do not Play
12 | overcast | 72 | 90 | true | Play
13 | overcast | 81 | 75 | false | Play
14 | rain | 71 | 80 | true | Do not Play
(14 rows)
</pre></li>
<li>Score the data against a validation set. <pre class="example">
SELECT * FROM madlib.rf_score( 'trained_tree_infogain',
'golf_data',
0
);
</pre> Result: <pre class="result">
rf_score
&#160;------------------
0.928571428571429
(1 row)
</pre></li>
<li>Clean up the random forest and other auxiliary information: <pre class="example">
SELECT madlib.rf_clean( 'trained_tree_infogain'
);
</pre> Result: <pre class="result">
rf_clean
&#160;---------
t
(1 row)
</pre></li>
</ol>
<p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] <a href="http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm">http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm</a></p>
<p>[2] <a href="http://en.wikipedia.org/wiki/Discretization_of_continuous_features">http://en.wikipedia.org/wiki/Discretization_of_continuous_features</a></p>
<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd>File <a class="el" href="rf_8sql__in.html" title="random forest APIs and main control logic written in PL/PGSQL ">rf.sql_in</a> documenting the SQL functions. </dd></dl>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Mon Jul 27 2015 20:37:46 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.10 </li>
</ul>
</div>
</body>
</html>