Decision Tree
<div class="header">
<div class="headertitle">
<div class="title">Decision Tree<div class="ingroups"><a class="el" href="group__grp__early__stage.html">Early Stage Development</a></div></div> </div>
<div class="contents">
<dl class="section warning"><dt>Warning</dt><dd><em> This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change. </em></dd></dl>
<dl class="section user"><dt>About</dt><dd></dd></dl>
<p>This module provides an implementation of the C4.5 implementation to grow decision trees.</p>
<p>The implementation supports:</p>
<li>Building decision tree</li>
<li>Multiple split critera, including: . Information Gain . Gini Coefficient . Gain Ratio</li>
<li>Decision tree Pruning</li>
<li>Decision tree classification/scoring</li>
<li>Decision tree display</li>
<li>Rule generation</li>
<li>Continuous and discrete features</li>
<li>Missing value handling</li>
<dl class="section user"><dt>Input</dt><dd></dd></dl>
<p>The <b>training data</b> is expected to be of the following form: </p>
<pre>{TABLE|VIEW} <em>trainingSource</em> (
<em>id</em> INT|BIGINT,
<em>feature1</em> SUPPORTED_DATA_TYPE,
<em>feature2</em> SUPPORTED_DATA_TYPE,
<em>feature3</em> SUPPORTED_DATA_TYPE,
<em>featureN</em> SUPPORTED_DATA_TYPE,
<em>class</em> SUPPORTED_DATA_TYPE,
<p>The <b>data to classify</b> is expected to be of the same form as <b>training data</b>, except that it does not need a class column.</p>
<dl class="section user"><dt>Usage</dt><dd><ul>
<li>Run the training algorithm on the source data: <pre>SELECT * FROM <a class="el" href="c45_8sql__in.html#a18b30ff1a063e7cd16274bf7ab2a71dc">c45_train</a>(
</pre> This will create the decision tree output table storing an abstract object (representing the model) used for further classification. Column names: <pre>
id | tree_location | feature | probability | ebp_coeff | maxclass | scv | live | sample_size | parent_id | lmc_nid | lmc_fval | is_continuous | split_value | tid | dp_ids
<li>Run the classification function using the learned model: <pre>SELECT * FROM <a class="el" href="c45_8sql__in.html#af5eb174eeecd11233409657221586cf1">c45_classify</a>(
'<em>result_table_name</em>');</pre> This will create the result_table with the classification results. <pre> </pre></li>
<li>Run the scorinf function to score the learned model against a validation data set: <pre>SELECT * FROM <a class="el" href="c45_8sql__in.html#af0739749507c1097003dcf529d29fee2">c45_score</a>(
'<em>verbosity</em>');</pre> This will give a ratio of correctly classified items in the validation set. <pre> </pre></li>
<li>Run the display tree function using the learned model: <pre>SELECT * FROM <a class="el" href="c45_8sql__in.html#ad7f190eb8e5d53f4772fac699787c0fe">c45_display</a>(
'<em>tree_table_name</em>');</pre> This will display the trained tree in human readable format. <pre> </pre></li>
<li>Run the clean tree function as below: <pre>SELECT * FROM <a class="el" href="c45_8sql__in.html#ac25e17ecbc70149aa559018e718fc793">c45_clean</a>(
'<em>tree_table_name</em>');</pre> This will clean up the learned model and all metadata. <pre> </pre></li>
<dl class="section user"><dt>Examples</dt><dd><ol type="1">
<li>Prepare an input table/view, e.g.: <pre class="fragment">sql&gt; select * from golf_data order by id;
id | outlook | temperature | humidity | windy | class
1 | sunny | 85 | 85 | false | Do not Play
2 | sunny | 80 | 90 | true | Do not Play
3 | overcast | 83 | 78 | false | Play
4 | rain | 70 | 96 | false | Play
5 | rain | 68 | 80 | false | Play
6 | rain | 65 | 70 | true | Do not Play
7 | overcast | 64 | 65 | true | Play
8 | sunny | 72 | 95 | false | Do not Play
9 | sunny | 69 | 70 | false | Play
10 | rain | 75 | 80 | false | Play
11 | sunny | 75 | 70 | true | Play
12 | overcast | 72 | 90 | true | Play
13 | overcast | 81 | 75 | false | Play
14 | rain | 71 | 80 | true | Do not Play
(14 rows)</pre></li>
<li>Train the decision tree model, e.g.: <pre class="fragment">sql&gt; SELECT * FROM MADlib.c45_clean('trained_tree_infogain');
sql&gt; SELECT * FROM MADlib.c45_train(
'infogain', -- split criterion_name
'golf_data', -- input table name
'trained_tree_infogain', -- result tree name
null, -- validation table name
'temperature,humidity', -- continuous feature names
'outlook,temperature,humidity,windy', -- feature column names
'id', -- id column name
'class', -- class column name
100, -- confidence level
'explicit', -- missing value preparation
5, -- max tree depth
0.001, -- min percent mode
0.001, -- min percent split
0); -- verbosity
training_set_size | tree_nodes | tree_depth | training_time | split_criterion
14 | 8 | 3 | 00:00:00.871805 | infogain
(1 row)
<li>Check few rows from the tree model table: <pre class="fragment">sql&gt; select * from trained_tree_infogain order by id;
id | tree_location | feature | probability | ebp_coeff | maxclass | scv | live |sample_size | parent_id | lmc_nid | lmc_fval | is_continuous | split_value
1 | {0} | 3 | 0.642857142857143 | 1 | 2 | 0.171033941880327 | 0 | 14 | 0 | 2 | 1 | f |
2 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 1 | | | f |
3 | {0,2} | 4 | 0.6 | 1 | 2 | 0.673011667009257 | 0 | 5 | 1 | 5 | 1 | f |
4 | {0,3} | 2 | 0.6 | 1 | 1 | 0.673011667009257 | 0 | 5 | 1 | 7 | 1 | t | 70
5 | {0,2,1} | 4 | 1 | 1 | 2 | 0 | 0 | 3 | 3 | | | f |
6 | {0,2,2} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 3 | | | f |
7 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 4 | | | f |
8 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 4 | | | f |
(8 rows)</pre></li>
<li>To display the tree with human readable format: <pre class="fragment">sql&gt; select MADlib.c45_display('trained_tree_infogain');
Tree 1
Root Node : class( Play) num_elements(14) predict_prob(0.642857142857143)
outlook: = overcast : class( Play) num_elements(4) predict_prob(1)
outlook: = rain : class( Play) num_elements(5) predict_prob(0.6)
windy: = false : class( Play) num_elements(3) predict_prob(1)
windy: = true : class( Do not Play) num_elements(2) predict_prob(1)
outlook: = sunny : class( Do not Play) num_elements(5) predict_prob(0.6)
humidity: &lt;= 70 : class( Play) num_elements(2) predict_prob(1)
humidity: &gt; 70 : class( Do not Play) num_elements(3) predict_prob(1)
(1 row)</pre></li>
<li>To classify data with the learned model: <pre class="fragment">sql&gt; select * from MADlib.c45_classify
'trained_tree_infogain', -- name of the trained model
'golf_data', -- name of the table containing data to classify
'classification_result'); -- name of the output table
input_set_size | classification_time
14 | 00:00:00.247713
(1 row)
<li>Check classification results: <pre class="fragment">sql&gt; select,t.outlook,t.temperature,t.humidity,t.windy,c.class from
MADlib.classification_result c,golf_data t where order by id;
id | outlook | temperature | humidity | windy | class
1 | sunny | 85 | 85 | false | Do not Play
2 | sunny | 80 | 90 | true | Do not Play
3 | overcast | 83 | 78 | false | Play
4 | rain | 70 | 96 | false | Play
5 | rain | 68 | 80 | false | Play
6 | rain | 65 | 70 | true | Do not Play
7 | overcast | 64 | 65 | true | Play
8 | sunny | 72 | 95 | false | Do not Play
9 | sunny | 69 | 70 | false | Play
10 | rain | 75 | 80 | false | Play
11 | sunny | 75 | 70 | true | Play
12 | overcast | 72 | 90 | true | Play
13 | overcast | 81 | 75 | false | Play
14 | rain | 71 | 80 | true | Do not Play
(14 rows)
<li>Score the data against a validation set: <pre class="fragment">sql&gt; select * from MADlib.c45_score(
(1 row)
<li>clean up the tree and metadata: <pre class="fragment">testdb=# select MADLIB_SCHEMA.c45_clean('trained_tree_infogain');
(1 row)
<dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] <a href=""></a></p>
<dl class="section see"><dt>See Also</dt><dd>File <a class="el" href="c45_8sql__in.html" title="C4.5 APIs and main controller written in PL/PGSQL. ">c45.sql_in</a> documenting the SQL functions. </dd></dl>
