blob: d4dfc223c59175e65e3d6f60f47342d6f545825c [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<title>MADlib: Cross Validation</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { searchBox.OnSelectItem(0); });
</script>
<script src="../mathjax/MathJax.js">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script>
</head>
<body>
<div id="top"><!-- do not remove this div! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td style="padding-left: 0.5em;">
<div id="projectname">MADlib
&#160;<span id="projectnumber">0.6</span> <span style="font-size:10pt; font-style:italic"><a href="../latest/./group__grp__validation.html"> A newer version is available</a></span>
</div>
<div id="projectbrief">User Documentation</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- Generated by Doxygen 1.7.5.1 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
<script type="text/javascript" src="dynsections.js"></script>
<div id="navrow1" class="tabs">
<ul class="tablist">
<li><a href="index.html"><span>Main&#160;Page</span></a></li>
<li><a href="modules.html"><span>Modules</span></a></li>
<li><a href="files.html"><span>Files</span></a></li>
<li>
<div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</li>
</ul>
</div>
</div>
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
initNavTree('group__grp__validation.html','');
</script>
<div id="doc-content">
<div class="header">
<div class="headertitle">
<div class="title">Cross Validation</div> </div>
<div class="ingroups"><a class="el" href="group__grp__suplearn.html">Supervised Learning</a></div></div>
<div class="contents">
<div id="dynsection-0" onclick="return toggleVisibility(this)" class="dynheader closed" style="cursor:pointer;">
<img id="dynsection-0-trigger" src="closed.png" alt="+"/> Collaboration diagram for Cross Validation:</div>
<div id="dynsection-0-summary" class="dynsummary" style="display:block;">
</div>
<div id="dynsection-0-content" class="dyncontent" style="display:none;">
<center><table><tr><td><div class="center"><iframe scrolling="no" frameborder="0" src="group__grp__validation.svg" width="336" height="40"><p><b>This browser is not able to show SVG: try Firefox, Chrome, Safari, or Opera instead.</b></p></iframe>
</div>
</td></tr></table></center>
</div>
<dl class="user"><dt><b>About:</b></dt><dd></dd></dl>
<p>Cross-validation, sometimes called rotation estimation, is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.</p>
<p>In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter.</p>
<dl class="user"><dt><b>Input:</b></dt><dd></dd></dl>
<p><b>The flexible interface.</b></p>
<p>The input includes the data set, a training function, a prediction function and an error metric function.</p>
<p>The training function takes in a given data set with independent and dependent variables in it and produces a model, which is stored in an output table.</p>
<p>The prediction function takes in the model generated by the training function and a different data set with independent variables in it, and it produces a prediction of the dependent variables bease on the model. The prediction is stored in an output table. The prediction function should take a unique ID column name of the data table as one of the inputs, otherwise the prediction result cannot be compared with the validation values.</p>
<p>The error metric function takes in the prediction made by the prediction function, and compare with the known values of the dependent variables of the data set that was fed into the prediction function. It computes the error metric defined by the function. The results are stored in a table</p>
<p>Other inputs include the output table name, k value for the k-fold cross-validation, and how many folds the user wants to try (for example, the user can choose to run a simple validation instead of a full cross-validation.)</p>
<dl class="user"><dt><b>Usage:</b></dt><dd></dd></dl>
<p><b>The flexible interface.</b></p>
<p>In order to choose the optimum value for a parameter of the model, the user needs to provied the training function, prediction function, error metric function, the parameter and its values to be studied and the data set.</p>
<p>It would be better if the data set has a unique ID for each row, so that it is easier to cut the data set into the training part and the validation part. The user also needs to inform the cross validation (CV) function about whether this ID value is randomly assigned to each row. If it is not randomly assigned, the CV function will automatically generate a random ID for each row.</p>
<p>If the data set has no unique ID for each row, the CV function will copy the data set and create a randomly assigned ID column for the newly created temp table. The new table will be dropped after the computation is finished. To minimize the copying work load, the user needs to provide the data column names (for independent variables and dependent variables) that are going to be used in the calculation, and only these columns will be copied.</p>
<pre>SELECT cross_validation_general(
<em>modelling_func</em>, -- Name of function that trains the model
<em>modelling_params</em>, -- Array of parameters for modelling function
<em>modelling_params_type</em>, -- Types of each parameters for modelling function
--
<em>param_explored</em>, -- Name of parameter that will be checked to find the optimum value, the
---- same name must also appear in the array of modelling_params
<em>explore_values</em>, -- Values of this parameter that will be studied
--
<em>predict_func</em>, -- Name of function for prediction
<em>predict_params</em>, -- Array of parameters for prediction function
<em>predict_params_type</em>, -- Types of each parameters for prediction function
--
<em>metric_func</em>, -- Name of function for measuring errors
<em>metric_params</em>, -- Array of parameters for error metric function
<em>metric_params_type</em>, -- Types of each parameters for metric function
--
<em>data_tbl</em>, -- Data table which will be split into training and validation parts
<em>data_id</em>, -- Name of the unique ID associated with each row. Provide <em>NULL</em>
---- if there is no such column in the data table
<em>id_is_random</em>, -- Whether the provided ID is randomly assigned to each row
--
<em>validation_result</em>, -- Table name to store the output of CV function, see the Output for
---- format. It will be automatically created by CV function
--
<em>data_cols</em>, -- Names of data columns that are going to be used. It is only useful when
---- <em>data_id</em> is NULL, otherwise it is ignored.
<em>fold_num</em> -- Value of k. How many folds validation? Each validation uses 1/fold_num
---- fraction of the data for validation. Deafult value: 10.
);</pre><p>Special keywords in parameter arrays of modelling, prediction and metric functions:</p>
<p><em>%data%</em> : The argument position for training/validation data</p>
<p><em>%model%</em> : The argument position for the output/input of modelling/prediction function</p>
<p><em>%id%</em> : The argument position of unique ID column (provided by user or generated by CV function as is mentioned above)</p>
<p><em>%prediction%</em> : The argument position for the output/input of prediction/metric function</p>
<p><em>%error%</em> : The argument position for the output of metric function</p>
<p><b>Note</b>: If the parameter <em>explore_values</em> is NULL or has zero length, then the cross validation function will only run a data folding.</p>
<p>Output: </p>
<pre> param_explored | average error | standard deviation of error
-------------------------|------------------|--------------------------------
.......
</pre><p><b>Note:</b></p>
<p><em>max_locks_per_transaction</em>, which usually has the default value of 64, limits the number of tables that can be dropped inside a single transaction (the CV function). Thus the number of different values of <em>param_explored</em> (or the length of array <em>explored_values</em>) cannot be too large. For 10-fold cross validation, the limit of length(<em>explored_values</em>) is around 40. If this number is too large, the use might see "out of shared memory" error because <em>max_locks_per_transaction</em> is used up.</p>
<p>One way to overcome this limitation is to run CV function multiple times, and each run covers a different region of values of the parameter.</p>
<p>In the future, MADlib will implement cross-validation functions for each individual applicable module, where we can optimize the calculation to avoid table droppings and this max_locks_per_transaction limitation. However, such cross-validation functions need to know the implementation details of the modules to do the optimization and thus cannot be as flexible as the cross-validation function provided here.</p>
<p>The cross-validation function provided here is very flexible, and can actually work with any algorithms that the user want to cross-validate including the algorithms written by the user. The price for this flexiblity is that the algorithms' details cannot be utilized to optimize the calculation and thus <em>max_locks_per_transaction</em> limitation cannot be avoided.</p>
<dl class="user"><dt><b>Examples:</b></dt><dd></dd></dl>
<p>Cross validation is used on elastic net regression to find the best value of the regularization parameter.</p>
<p>(1) Populate the table 'cvtest' with 101 dimensional independent variable 'val', and dependent variable 'dep'.</p>
<p>(2) Run the general CV function </p>
<pre>
select madlib.cross_validation_general (
'madlib.elastic_net_train',
'{%data%, %model%, dep, val, gaussian, 1, lambda, True, Null, fista, "{eta = 2, max_stepsize = 2, use_active_set = t}", Null, 2000, 1e-6}'::varchar[],
'{varchar, varchar, varchar, varchar, varchar, double precision, double precision, boolean, varchar, varchar, varchar[], varchar, integer, double precision}'::varchar[],
--
'lambda',
'{0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30, 0.32, 0.34, 0.36}'::varchar[],
--
'madlib.elastic_net_predict',
'{%model%, %data%, %id%, %prediction%}'::varchar[],
'{text, text, text, text}'::varchar[],
--
'madlib.mse_error',
'{%prediction%, %data%, %id%, dep, %error%}'::varchar[],
'{varchar, varchar, varchar, varchar, varchar}'::varchar[],
--
'cvtest',
NULL::varchar,
False,
--
'valid_rst_tbl',
'{val, dep}'::varchar[],
10
);</pre><pre></pre><dl class="see"><dt><b>See also:</b></dt><dd>File <a class="el" href="cross__validation_8sql__in.html" title="SQL functions for cross validation.">cross_validation.sql_in</a> documenting the SQL functions. </dd></dl>
</div>
</div>
<div id="nav-path" class="navpath">
<ul>
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
<a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(0)"><span class="SelectionMark">&#160;</span>All</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(1)"><span class="SelectionMark">&#160;</span>Files</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(2)"><span class="SelectionMark">&#160;</span>Functions</a></div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<li class="footer">Generated on Tue Apr 2 2013 14:57:03 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.7.5.1 </li>
</ul>
</div>
</body>
</html>