blob: d0f7ba46c5b3ed6abe45bab57a6cd2b8a0b4dea4 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.10"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: Support Vector Machines</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
$(window).load(resizeHeight);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script src="../mathjax/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="navtree_hack.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'auto');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.incubator.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.8</span>
</div>
<div id="projectbrief">User Documentation for MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.10 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__kernmach.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">Support Vector Machines<div class="ingroups"><a class="el" href="group__grp__early__stage.html">Early Stage Development</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> </p><ul>
<li>
<a href="#learn">Regression Learning Function</a> </li>
<li>
<a href="#classify">Classification Learning Function</a> </li>
<li>
<a href="#novelty">Novelty Detection Functions</a> </li>
<li>
<a href="#predict">Prediction Functions</a> </li>
<li>
<a href="#notes">Notes</a> </li>
<li>
<a href="#examples">Examples</a> </li>
<li>
<a href="#literature">Literature</a> </li>
<li>
<a href="#related">Related Topics</a> </li>
</ul>
</div><dl class="section warning"><dt>Warning</dt><dd><em> This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change. </em></dd></dl>
<p>Support vector machines (SVMs) and related kernel methods have been among the most popular and well-studied machine learning techniques of the past 15 years, with an amazing number of innovations and applications.</p>
<p>In a nutshell, an SVM model \(f(x)\) takes the form of </p><p class="formulaDsp">
\[ f(x) = \sum_i \alpha_i k(x_i,x), \]
</p>
<p> where each \( \alpha_i \) is a real number, each \( \boldsymbol x_i \) is a data point from the training set (called a support vector), and \( k(\cdot, \cdot) \) is a kernel function that measures how "similar" two objects are. In regression, \( f(\boldsymbol x) \) is the regression function we seek. In classification, \( f(\boldsymbol x) \) serves as the decision boundary; so for example in binary classification, the predictor can output class 1 for object \(x\) if \( f(\boldsymbol x) \geq 0 \), and class 2 otherwise.</p>
<p>In the case when the kernel function \( k(\cdot, \cdot) \) is the standard inner product on vectors, \( f(\boldsymbol x) \) is just an alternative way of writing a linear function </p><p class="formulaDsp">
\[ f&#39;(\boldsymbol x) = \langle \boldsymbol w, \boldsymbol x \rangle, \]
</p>
<p> where \( \boldsymbol w \) is a weight vector having the same dimension as \( \boldsymbol x \). One of the key points of SVMs is that we can use more fancy kernel functions to efficiently learn linear models in high-dimensional feature spaces, since \( k(\boldsymbol x_i, \boldsymbol x_j) \) can be understood as an efficient way of computing an inner product in the feature space: </p><p class="formulaDsp">
\[ k(\boldsymbol x_i, \boldsymbol x_j) = \langle \phi(\boldsymbol x_i), \phi(\boldsymbol x_j) \rangle, \]
</p>
<p> where \( \phi(\boldsymbol x) \) projects \( \boldsymbol x \) into a (possibly infinite-dimensional) feature space.</p>
<p>There are many algorithms for learning kernel machines. This module implements the class of online learning with kernels algorithms described in Kivinen et al. [1]. It also includes the incremental gradient descent (IGD) method Feng et al. [3] for learning linear SVMs with the Hinge loss \(l(z) = \max(0, 1-z)\). See also the book Scholkopf and Smola [2] for more details of SVMs in general.</p>
<p>The IGD implementation is based on Bismarck project in University of Wisconsin (<a href="http://hazy.cs.wisc.edu/hazy/victor/bismarck/">http://hazy.cs.wisc.edu/hazy/victor/bismarck/</a>). The methods introduced in [1] are implemented according to their original descriptions, except that we only update the support vector model when we make a significant error. The original algorithms in [1] update the support vector model at every step, even when no error was made, in the name of regularization. For practical purposes, and this is verified empirically to a certain degree, updating only when necessary is both faster and better from a learning-theoretic point of view, at least in the i.i.d. setting.</p>
<p>Methods for classification, regression and novelty detection are available. Multiple instances of the algorithms can be executed in parallel on different subsets of the training data. The resultant support vector models can then be combined using standard techniques like averaging or majority voting.</p>
<p>Training data points are accessed via a table or a view. The support vector models can also be stored in tables for fast execution.</p>
<p><a class="anchor" id="learn"></a></p><dl class="section user"><dt>Regression Learning Function</dt><dd></dd></dl>
<p>Regression learning is achieved through the following function: </p><pre class="syntax">
svm_regression( input_table,
model_table,
parallel,
kernel_func,
verbose DEFAULT false,
eta DEFAULT 0.1,
nu DEFAULT 0.005,
slambda DEFAULT 0.05,
kernel_param DEFAULT 1.0
)</pre><p>For classification and regression, the training table/view is expected to be of the following form (the array size of <em>ind</em> must not be greater than 102,400.):<br />
</p><pre>{TABLE|VIEW} input_table (
...
id INT,
ind FLOAT8[],
label FLOAT8,
...
)</pre><p>For novelty detection, the label field is not required. Also note that the column names of input_table requires to be exactly the same as described above. This limitation will be removed when this module graduates from early development stage.</p>
<p><a class="anchor" id="classify"></a></p><dl class="section user"><dt>Classification Learning Function</dt><dd></dd></dl>
<p>Classification learning is achieved through the following two functions:</p>
<ul>
<li>Learn linear SVM(s) using IGD [3]. <pre class="syntax">
lsvm_classification( input_table,
model_table,
parallel DEFAULT false,
verbose DEFAULT false,
eta DEFAULT 0.1,
reg DEFAULT 0.001,
max_iter DEFAULT 100
)
</pre> Note that, as any gradient descent methods, IGD will converge with a greater eta (stepsize), thus faster, if the input training data is well-conditioned. We highly recommend user to perform data preparation, such that the mean value of each feature column is 0 and standard error is 1, and append an extra feature with constant value 1 for intercept term. We plan to provide a function for this when this module graduates from early stage development.</li>
<li>Learn linear or non-linear SVM(s) using the method described in [1]. <pre class="syntax">
svm_classification( input_table,
model_table,
parallel,
kernel_func,
verbose DEFAULT false,
eta DEFAULT 0.1,
nu DEFAULT 0.005,
kernel_param DEFAULT 1.0
)
</pre></li>
</ul>
<p><a class="anchor" id="novelty"></a></p><dl class="section user"><dt>Novelty Detection Function</dt><dd></dd></dl>
<p>Novelty detection is achieved through the following function: </p><pre class="syntax">
svm_novelty_detection( input_table,
model_table,
parallel,
kernel_func,
verbose DEFAULT false,
eta DEFAULT 0.1,
nu DEFAULT 0.005,
kernel_param DEFAULT 1.0
)
</pre><p>Assuming the model_table parameter takes on value 'model', each learning function will produce two tables as output: 'model' and 'model_param'. The first contains the support vectors of the model(s) learned. The second contains the parameters of the model(s) learned, which include information like the kernel function used and the value of the intercept, if there is one.</p>
<p><a class="anchor" id="predict"></a></p><dl class="section user"><dt>Prediction Functions</dt><dd></dd></dl>
<ul>
<li>To make predictions on a single data point x using a single model learned previously, we use the function <pre class="syntax">
svm_predict_batch( input_table,
data_col,
id_col,
model_table,
output_table,
parallel
)
</pre> If the <code>parallel</code> parameter is true, then each data point in the input table will have multiple predicted values corresponding to the number of models learned in</li>
<li>If the model is produced by the <a class="el" href="lsvm_8sql__in.html#a6dcddc88d70523ddda32b46ab82dfbbf" title="This is the linear support vector classification function. ">lsvm_classification()</a> function, use the following prediction function instead. <pre class="syntax">
lsvm_predict_batch( input_table,
data_col,
id_col,
model_table,
output_table,
)
</pre></li>
<li>Note that, to make predictions on a subset of data points stored in a table, a separated view or table needs to be created ahead of time: <pre class="example">
-- create subset as a view
CREATE VIEW subset AS SELECT * FROM input_table WHERE id &lt;= 100;
-- prediction on the subset
SELECT svm_predict_batch('subset', 'ind', 'id',
'svm_model', 'subset_svm_predict', false);
-- prediction using linear SVMs
SELECT lsvm_predict_batch('subset', 'ind', 'id',
'lsvm_model', 'subset_lsvm_predict');
</pre></li>
</ul>
<p><a class="anchor" id="notes"></a></p><dl class="section user"><dt>Notes</dt><dd></dd></dl>
<p>The <code>kernel_func</code> argument of <code>svm_classification</code>, <code>svm_regression</code>, and <code>svm_novelty_detection</code> can only accept a kernel function in the following form:</p>
<pre class="syntax">
FLOAT8 kernel_function(FLOAT8[], FLOAT8[], FLOAT8)
</pre><p>The first two parameters are feature vectors, and the third one is a control parameter for the kernel function. The value of the control parameter must be set throught the <code>kernel_param</code> argument of <code>svm_classification</code>, <code>svm_regression</code>, and <code>svm_novelty_detection</code>.</p>
<p>Currently, three kernel functions have been implemented: linear or dot product (<a class="el" href="online__sv_8sql__in.html#ab53cac5790dafd7230359e08c72af4f1">svm_dot</a>), polynomial (<a class="el" href="online__sv_8sql__in.html#a1ac76fdf9623e0a4db47665f2a80be90">svm_polynomial</a>) and Gaussian (<a class="el" href="online__sv_8sql__in.html#a9f2a96e1a241ecc66386a78b110777d3">svm_gaussian</a>) kernels. Note that for the dot product kernel, it actually only requires two FLOAT8[] parameters. To be compliant with the requirements for the kernel function, we have an overloaded version of <code>svm_dot</code> which accepts two FLOAT8[] and one FLOAT8 and returns a FLOAT8, but the FLOAT8 parameter is simply a placeholder and will be ignored.</p>
<p>With the HAWQ database, only the above pre-built kernel functions can be used. With the Greenplum database and PostgreSQL database, one can use any user-defined function as long as it conforms to the requirements for the kernel function.</p>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<p>As a general first step, prepare and populate an input table/view with the following structure: </p><pre class="example">
CREATE TABLE/VIEW my_schema.my_input_table(
id INT, -- point ID
ind FLOAT8[], -- data point
label FLOAT8 -- label of data point
);
</pre><p> The label field is not required for novelty detection.</p>
<p><b>Example usage for regression</b>:</p><ol type="1">
<li>Randomly generate 1000 5-dimensional data labelled using the simple target function. <pre class="example">
t(x) = if x[5] = 10 then 50 else if x[5] = -10 then 50 else 0;
</pre> and store that in the my_schema.my_train_data table as follows: <pre class="example">
SELECT madlib.svm_generate_reg_data(
'my_schema.my_train_data', 1000, 5);
</pre></li>
<li>Learn a regression model and store the resultant model under the name 'myexp'. <pre class="example">
SELECT madlib.svm_regression('my_schema.my_train_data',
'myexp1', false, 'madlib.svm_dot');
</pre></li>
<li>To learn multiple support vector models, we replace the learning step above by <pre class="example">
SELECT madlib.svm_regression('my_schema.my_train_data',
'myexp2', true, 'madlib.svm_dot');
</pre></li>
<li>We can also predict the labels of data points stored in a table. For example, we can execute the following: <pre class="example">
-- prepare test data
CREATE TABLE madlib.svm_reg_test AS
SELECT id, ind
FROM my_schema.my_train_data
LIMIT 20;
-- prediction using a single model
SELECT madlib.svm_predict_batch('madlib.svm_reg_test', 'ind', 'id',
'myexp1', 'madlib.svm_reg_output1', false);
SELECT * FROM madlib.svm_reg_output1;
-- prediction using multiple models
SELECT madlib.svm_predict_batch('madlib.svm_reg_test', 'ind', 'id',
'myexp2', 'madlib.svm_reg_output2', true);
SELECT * FROM madlib.svm_reg_output2;
</pre></li>
</ol>
<p><b>Example usage for classification:</b></p><ol type="1">
<li>Randomly generate training and testing data labelled by the simple target function. <pre class="example">
t(x) = if x[1] &gt; 0 and x[2] &lt; 0 then 1 else -1;
</pre> and store that in tables as follows: <pre class="example">
SELECT madlib.svm_generate_cls_data(
'my_schema.my_train_data', 2000, 5);
SELECT madlib.svm_generate_cls_data(
'my_schema.my_test_data', 3000, 5);
</pre></li>
<li>Learn a classification model and store the resultant model the table 'myexpc'. <pre class="example">
SELECT madlib.svm_classification('my_schema.my_train_data',
'myexpc', false, 'madlib.svm_dot');
</pre></li>
<li>Start using the model to predict the labels of testing data points. <pre class="example">
SELECT madlib.svm_predict_batch('my_schema.my_test_data', 'ind', 'id',
'myexpc', 'my_schema.svm_cls_output1', false);
</pre></li>
<li>To learn multiple support vector models, replace the model-building and prediction steps above. <pre class="example">
-- training
SELECT madlib.svm_classification('my_schema.my_train_data',
'myexpc', true, 'madlib.svm_dot');
-- predicting
SELECT madlib.svm_predict_batch('my_schema.my_test_data', 'ind', 'id',
'myexpc', 'my_schema.svm_cls_output1', true);
</pre></li>
<li>To learn a linear support vector model using IGD [3], replace the model-building and prediction steps. <pre class="example">
-- training
SELECT madlib.lsvm_classification(
'my_schema.my_train_data', 'my_lsvm');
-- predicting
SELECT madlib.lsvm_predict_batch('my_schema.my_test_data',
'ind', 'id', 'my_lsvm', 'my_lsvm_predict');
</pre></li>
</ol>
<p><b>Example usage for novelty detection:</b></p>
<ol type="1">
<li>Randomly generate 100 2-dimensional data (the normal cases) and store that in the my_schema.my_train_data table. <pre class="example">
SELECT madlib.svm_generate_nd_data(
'my_schema.my_train_data', 100, 2);
</pre></li>
<li>Learning and predicting using a single novelty detection model: <pre class="example">
SELECT madlib.svm_novelty_detection( 'my_schema.my_train_data',
'myexpnd',
false,
'madlib.svm_dot'
);
SELECT * FROM myexpnd;
</pre></li>
<li>Learning and predicting using multiple models can be done as follows: <pre class="example">
SELECT madlib.svm_novelty_detection( 'my_schema.my_train_data',
'myexpnd',
true,
'madlib.svm_dot'
);
SELECT * FROM myexpnd;
</pre></li>
</ol>
<p><b>Model cleanup:</b> To drop all tables pertaining to the model, use: </p><pre class="example">
SELECT svm_drop_model('model_table');
</pre><p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson: <em>Online Learning with Kernels</em>, IEEE Transactions on Signal Processing, 52(8), 2165-2176, 2004.</p>
<p>[2] Bernhard Scholkopf and Alexander J. Smola: <em>Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond</em>, MIT Press, 2002.</p>
<p>[3] X. Feng, A. Kumar, B. Recht, and C. R&eacute;: <em>Towards a uniļ¬ed architecture for in-RDBMS analytics</em>, In SIGMOD Conference, 2012.</p>
<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
<p>File <a class="el" href="online__sv_8sql__in.html" title="SQL functions for support vector machines. ">online_sv.sql_in</a> and <a class="el" href="lsvm_8sql__in.html" title="SQL functions for linear support vector machines. ">lsvm.sql_in</a> documenting the SQL functions.</p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Mon Jul 27 2015 20:37:45 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.10 </li>
</ul>
</div>
</body>
</html>