blob: f2778961fce6ce2d9e53338400228016cb9b0ae8 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: PMML Export</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.16</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__pmml.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">PMML Export<div class="ingroups"><a class="el" href="group__grp__other__functions.html">Utilities</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b><ul>
<li class="level1">
<a href="#function">PMML Export Function</a> </li>
<li class="level1">
<a href="#examples">Examples</a> </li>
<li class="level1">
<a href="#background">Background</a> </li>
<li class="level1">
<a href="#related">Related Topics</a> </li>
</ul>
</div><p><a class="anchor" id="function"></a></p><dl class="section user"><dt>PMML Export Function</dt><dd>The PMML export function in MADlib has the following syntax: <pre class="syntax">
pmml ( model_table,
name_spec
)
</pre> <b>Arguments</b> <dl class="arglist">
<dt>model_table </dt>
<dd><p class="startdd">VARCHAR. The name of the table containing the model.</p>
<p class="enddd"></p>
</dd>
<dt>name_spec (optional) </dt>
<dd>VARCHAR or VARCHAR[]. Names to be used in the Data Dictionary of the PMML. See <a class="el" href="table__to__pmml_8sql__in.html#a9635b6989d9f972497b6b4164b77aa0a" title="Given the model constructed from a data mining algorithm, this function converts the model into PMML ...">pmml()</a> for detailed explanation. </dd>
</dl>
</dd></dl>
<p><b>Output</b> XML. The output of this function is a standard PMML document, some examples of which are covered in the next section. </p>
<dl class="section note"><dt>Note</dt><dd>In PostgreSQL, users may be required to install their database with XML support in order to use this function.</dd></dl>
<p>Usually the user wants to export the resulting PMML contents into a PMML file so that external softwares can use it. The following method can be used (Note: the user needs to use unaligned table output mode for psql with '-A' flag. And inside psql client, both '\t' and '\o' should be used):</p>
<pre class="example">
&gt; # under bash
&gt; psql -A my_database
# -- in psql now
# \t
# \o test.pmml -- export to a file
# select madlib.pmml('tree_out');
# \o
# \t
</pre><p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd><ol type="1">
<li>Create the training data table. <pre class="example">
CREATE TABLE patients( id integer NOT NULL,
second_attack integer,
treatment integer,
trait_anxiety integer);
INSERT INTO patients(id, second_attack, treatment, trait_anxiety) VALUES
( 1, 1, 1, 70),
( 3, 1, 1, 50),
( 5, 1, 0, 40),
( 7, 1, 0, 75),
( 9, 1, 0, 70),
(11, 0, 1, 65),
(13, 0, 1, 45),
(15, 0, 1, 40),
(17, 0, 0, 55),
(19, 0, 0, 50),
( 2, 1, 1, 80),
( 4, 1, 0, 60),
( 6, 1, 0, 65),
( 8, 1, 0, 80),
(10, 1, 0, 60),
(12, 0, 1, 50),
(14, 0, 1, 35),
(16, 0, 1, 50),
(18, 0, 0, 45),
(20, 0, 0, 60);
</pre></li>
<li>Train a regression model using <a class="el" href="logistic_8sql__in.html#a74210a7ef513dfcbdfdd9f3b37bfe428" title="Compute logistic-regression coefficients and diagnostic statistics. ">logregr_train()</a>. <pre class="example">
SELECT madlib.logregr_train(
'patients',
'patients_logregr',
'second_attack',
'ARRAY[1, treatment, trait_anxiety]');
</pre></li>
<li>View the PMML export for this model. <pre class="example">
SELECT madlib.pmml('patients_logregr');
</pre> Result: <pre class="result">
&lt;?xml version="1.0" standalone="yes"?&gt;
&lt;PMML version="4.1" xmlns="http://www.dmg.org/pmml-v4-1.html"&gt;
&lt;Header copyright="redacted for this example"&gt;
&lt;Extension extender="MADlib" name="user" value="gpadmin"&gt;
&lt;Application name="MADlib" version="1.7"&gt;
&lt;Timestamp&gt;
2014-06-13 17:30:14.527899 PDT
&lt;/Timestamp&gt;
&lt;/Header&gt;
&lt;DataDictionary numberoffields="4"&gt;
&lt;DataField datatype="boolean" name="second_attack_pmml_prediction" optype="categorical"&gt;
&lt;DataField datatype="double" name="1" optype="continuous"&gt;
&lt;DataField datatype="double" name="treatment" optype="continuous"&gt;
&lt;DataField datatype="double" name="trait_anxiety" optype="continuous"&gt;
&lt;/DataDictionary&gt;
&lt;RegressionModel functionname="classification" normalizationmethod="softmax"&gt;
&lt;MiningSchema&gt;
&lt;MiningField name="second_attack_pmml_prediction" usagetype="predicted"&gt;
&lt;MiningField name="1"&gt;
&lt;MiningField name="treatment"&gt;
&lt;MiningField name="trait_anxiety"&gt;
&lt;/MiningSchema&gt;
&lt;RegressionTable intercept="0.0" targetcategory="True"&gt;
&lt;NumericPredictor coefficient="-6.36346994178" name="1"&gt;
&lt;NumericPredictor coefficient="-1.02410605239" name="treatment"&gt;
&lt;NumericPredictor coefficient="0.119044916669" name="trait_anxiety"&gt;
&lt;/RegressionTable&gt;
&lt;RegressionTable intercept="0.0" targetcategory="False"&gt;
&lt;/RegressionModel&gt;
&lt;/PMML&gt;
</pre></li>
</ol>
</dd></dl>
<p>Alternatively, the above can also be invoked as below if custom names are needed for fields in the Data Dictionary: </p><pre class="example">
SELECT madlib.pmml('patients_logregr',
'out_attack~1+in_trait_anxiety+in_treatment');
</pre><p><b>Note:</b> If the second argument of 'pmml' function is not specified, a default suffix "_pmml_prediction" will be automatically append to the column name to be predicted. This can help avoid name conflicts.</p>
<p>The following example demonstrates grouping columns in the model table for the same dataset as the previous example.</p>
<ol type="1">
<li>Train a different regression model with 'treatment' as the grouping column. <pre class="example">
SELECT madlib.logregr_train(
'patients',
'patients_logregr_grouping',
'second_attack',
'ARRAY[1, trait_anxiety]',
'treatment');
</pre></li>
<li>View the PMML export for this model. <pre class="example">
SELECT madlib.pmml('patients_logregr_grouping',
ARRAY['second_attack','1','in_trait_anxiety']);
</pre> Result: <pre class="result">
&lt;?xml version="1.0" standalone="yes"?&gt;
&lt;PMML version="4.1" xmlns="http://www.dmg.org/pmml-v4-1.html"&gt;
&lt;Header copyright="redacted for this example"&gt;
&lt;Extension extender="MADlib" name="user" value="gpadmin"&gt;
&lt;Application name="MADlib" version="1.7"&gt;
&lt;Timestamp&gt;
2014-06-13 17:37:55.786307 PDT
&lt;/Timestamp&gt;
&lt;/Header&gt;
&lt;DataDictionary numberoffields="4"&gt;
&lt;DataField datatype="boolean" name="second_attack" optype="categorical"&gt;
&lt;DataField datatype="double" name="1" optype="continuous"&gt;
&lt;DataField datatype="double" name="in_trait_anxiety" optype="continuous"&gt;
&lt;DataField datatype="string" name="treatment" optype="categorical"&gt;
&lt;/DataDictionary&gt;
&lt;MiningModel functionname="classification"&gt;
&lt;MiningSchema&gt;
&lt;MiningField name="second_attack" usagetype="predicted"&gt;
&lt;MiningField name="1"&gt;
&lt;MiningField name="in_trait_anxiety"&gt;
&lt;MiningField name="treatment"&gt;
&lt;/MiningSchema&gt;
&lt;Segmentation multiplemodelmethod="selectFirst"&gt;
&lt;Segment&gt;
&lt;SimplePredicate field="treatment" operator="equal" value="1"&gt;
&lt;RegressionModel functionname="classification" normalizationmethod="softmax"&gt;
&lt;MiningSchema&gt;
&lt;MiningField name="second_attack" usagetype="predicted"&gt;
&lt;MiningField name="1"&gt;
&lt;MiningField name="in_trait_anxiety"&gt;
&lt;/MiningSchema&gt;
&lt;RegressionTable intercept="0.0" targetcategory="True"&gt;
&lt;NumericPredictor coefficient="-8.02068430057" name="1"&gt;
&lt;NumericPredictor coefficient="0.130090428526" name="in_trait_anxiety"&gt;
&lt;/RegressionTable&gt;
&lt;RegressionTable intercept="0.0" targetcategory="False"&gt;
&lt;/RegressionModel&gt;
&lt;/Segment&gt;
&lt;Segment&gt;
&lt;SimplePredicate field="treatment" operator="equal" value="0"&gt;
&lt;RegressionModel functionname="classification" normalizationmethod="softmax"&gt;
&lt;MiningSchema&gt;
&lt;MiningField name="second_attack" usagetype="predicted"&gt;
&lt;MiningField name="1"&gt;
&lt;MiningField name="in_trait_anxiety"&gt;
&lt;/MiningSchema&gt;
&lt;RegressionTable intercept="0.0" targetcategory="True"&gt;
&lt;NumericPredictor coefficient="-5.75043192191" name="1"&gt;
&lt;NumericPredictor coefficient="0.108282446319" name="in_trait_anxiety"&gt;
&lt;/RegressionTable&gt;
&lt;RegressionTable intercept="0.0" targetcategory="False"&gt;
&lt;/RegressionModel&gt;
&lt;/Segment&gt;
&lt;/Segmentation&gt;
&lt;/MiningModel&gt;
&lt;/PMML&gt;
</pre></li>
</ol>
<p><b>Note:</b> MADlib currently supports PMML export for Linear Regression, Logistic Regression, Generalized Linear Regression Model, Multinomial Logistic Regression, Ordinal Linear Regression, Decision Tree and Random Forests.</p>
<p>In Ordinal Regression, the signs of feature coefficients will be different in PMML export and in the default output model table from ordinal(). This is due to the difference of model settings.</p>
<p><a class="anchor" id="background"></a></p><dl class="section user"><dt>Background</dt><dd>The Predictive Model Markup Language (PMML) is an XML-based file format that provides a way for applications to describe and exchange models produced by data mining and machine learning algorithms. A PMML file comprises the following components:<ul>
<li>Header: Contains general information of the model, such as copyright information and model description.</li>
<li>Data Dictionary: Contains definitions of fields used in the model.</li>
<li>Data Transformations: Contains transformations for mapping user data into a form that can be used by the mining model.</li>
<li>Model: Contains definitions of the data mining model, which includes attributes such as the model name, function name, and algorithm name.</li>
<li>Mining Schema: Contains specific information for the fields used in the model, which includes the name and usage type.</li>
<li>Targets: Allows for post-processing of the predicted value.</li>
<li>Output: Allows for naming of output fields expected from the model.</li>
</ul>
</dd></dl>
<p>MADlib follows the PMML v4.1 standard. For more details about PMML, see <a href="http://www.dmg.org/v4-1/GeneralStructure.html">http://www.dmg.org/v4-1/GeneralStructure.html</a>.</p>
<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
<p>File <a class="el" href="table__to__pmml_8sql__in.html">table_to_pmml.sql_in</a> documenting the PMML export functions.</p>
<p><a class="el" href="group__grp__linreg.html">Linear Regression</a></p>
<p><a class="el" href="group__grp__logreg.html">Logistic Regression</a></p>
<p><a class="el" href="group__grp__glm.html">Generalized Linear Models</a></p>
<p><a class="el" href="group__grp__ordinal.html">Ordinal Regression</a></p>
<p><a class="el" href="group__grp__multinom.html">Multinomial Regression</a></p>
<p><a class="el" href="group__grp__decision__tree.html">Decision Tree</a></p>
<p><a class="el" href="group__grp__random__forest.html">Random Forest</a> </p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Tue Jul 2 2019 22:35:52 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>