| <!-- HTML header for doxygen 1.8.4--> |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| <html xmlns="http://www.w3.org/1999/xhtml"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/> |
| <meta http-equiv="X-UA-Compatible" content="IE=9"/> |
| <meta name="generator" content="Doxygen 1.8.4"/> |
| <meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/> |
| <title>MADlib: Huber White Variance</title> |
| <link href="tabs.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="jquery.js"></script> |
| <script type="text/javascript" src="dynsections.js"></script> |
| <link href="navtree.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="resize.js"></script> |
| <script type="text/javascript" src="navtree.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(initResizable); |
| $(window).load(resizeHeight); |
| </script> |
| <link href="search/search.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="search/search.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(function() { searchBox.OnSelectItem(0); }); |
| </script> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"], |
| jax: ["input/TeX","output/HTML-CSS"], |
| }); |
| </script><script src="../mathjax/MathJax.js"></script> |
| <link href="doxygen.css" rel="stylesheet" type="text/css" /> |
| <link href="madlib_extra.css" rel="stylesheet" type="text/css"/> |
| <!-- google analytics --> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| ga('create', 'UA-45382226-1', 'auto'); |
| ga('send', 'pageview'); |
| </script> |
| </head> |
| <body> |
| <div id="top"><!-- do not remove this div, it is closed by doxygen! --> |
| <div id="titlearea"> |
| <table cellspacing="0" cellpadding="0"> |
| <tbody> |
| <tr style="height: 56px;"> |
| <td style="padding-left: 0.5em;"> |
| <div id="projectname">MADlib |
|  <span id="projectnumber">1.3</span> <span style="font-size:10pt; font-style:italic"><a href="../latest/./group__grp__robust.html"> A newer version is available</a></span> |
| </div> |
| <div id="projectbrief">User Documentation</div> |
| </td> |
| <td> <div id="MSearchBox" class="MSearchBoxInactive"> |
| <span class="left"> |
| <img id="MSearchSelect" src="search/mag_sel.png" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| alt=""/> |
| <input type="text" id="MSearchField" value="Search" accesskey="S" |
| onfocus="searchBox.OnSearchFieldFocus(true)" |
| onblur="searchBox.OnSearchFieldFocus(false)" |
| onkeyup="searchBox.OnSearchFieldChange(event)"/> |
| </span><span class="right"> |
| <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a> |
| </span> |
| </div> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <!-- end header part --> |
| <!-- Generated by Doxygen 1.8.4 --> |
| <script type="text/javascript"> |
| var searchBox = new SearchBox("searchBox", "search",false,'Search'); |
| </script> |
| </div><!-- top --> |
| <div id="side-nav" class="ui-resizable side-nav-resizable"> |
| <div id="nav-tree"> |
| <div id="nav-tree-contents"> |
| <div id="nav-sync" class="sync"></div> |
| </div> |
| </div> |
| <div id="splitbar" style="-moz-user-select:none;" |
| class="ui-resizable-handle"> |
| </div> |
| </div> |
| <script type="text/javascript"> |
| $(document).ready(function(){initNavTree('group__grp__robust.html','');}); |
| </script> |
| <div id="doc-content"> |
| <!-- window showing the filter options --> |
| <div id="MSearchSelectWindow" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| onkeydown="return searchBox.OnSearchSelectKey(event)"> |
| <a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(0)"><span class="SelectionMark"> </span>All</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(1)"><span class="SelectionMark"> </span>Files</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(2)"><span class="SelectionMark"> </span>Functions</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(3)"><span class="SelectionMark"> </span>Variables</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(4)"><span class="SelectionMark"> </span>Groups</a></div> |
| |
| <!-- iframe showing the search results (closed by default) --> |
| <div id="MSearchResultsWindow"> |
| <iframe src="javascript:void(0)" frameborder="0" |
| name="MSearchResults" id="MSearchResults"> |
| </iframe> |
| </div> |
| |
| <div class="header"> |
| <div class="headertitle"> |
| <div class="title">Huber White Variance<div class="ingroups"><a class="el" href="group__grp__glm.html">Generalized Linear Models</a></div></div> </div> |
| </div><!--header--> |
| <div class="contents"> |
| <div class="toc"><b>Contents</b> </p> |
| <ul> |
| <li class="level1"> |
| <a href="#train_linregr">Robust Linear Regression Training Function</a> </li> |
| <li class="level1"> |
| <a href="#train_logregr">Robust Logistic Regression Training Function</a> </li> |
| <li class="level1"> |
| <a href="#train_mlogregr">Robust Multinomial Logistic Regression Training Function</a> </li> |
| <li class="level1"> |
| <a href="#examples">Examples</a> </li> |
| <li class="level1"> |
| <a href="#background">Technical Background</a> </li> |
| <li class="level1"> |
| <a href="#literature">Literature</a> </li> |
| <li class="level1"> |
| <a href="#related">Related Topics</a> </li> |
| </ul> |
| </div><p>The functions in this module calculate robust variance (Huber-White estimates) for linear regression, logistic regression, and multinomial logistic regression. They are useful in calculating variances in a dataset with potentially noisy outliers. The Huber-White implemented here is identical to the "HC0" sandwich operator in the R module "sandwich".</p> |
| <p>The interfaces for robust linear, logistic, and multinomial logistic regression are similar. Each regression type has its own training function. The regression results are saved in an output table with small differences, depending on the regression type.</p> |
| <p><a class="anchor" id="train_linregr"></a></p> |
| <dl class="section user"><dt>Robust Linear Regression Training Function</dt><dd></dd></dl> |
| <p>The <a class="el" href="robust_8sql__in.html#afe2de0edcada1eee39175e75053e701e">robust_variance_linregr()</a> function has the following syntax: </p> |
| <pre class="syntax"> |
| robust_variance_linregr( source_table, |
| out_table, |
| dependent_varname, |
| independent_varname, |
| grouping_cols |
| ) |
| </pre> <dl class="arglist"> |
| <dt>source_table </dt> |
| <dd>VARCHAR. The name of the table containing the training data. The training data is expected to be of the following form: <pre class="fragment"> {TABLE|VIEW} sourceName ( |
| outputTable VARCHAR, |
| regressionType VARCHAR, |
| dependentVariable VARCHAR, |
| independentVariable VARCHAR |
| )</pre> </dd> |
| <dt>out_table </dt> |
| <dd>VARCHAR. Name of the generated table containing the output model. The output table contains the following columns. <table class="output"> |
| <tr> |
| <th>coef </th><td>DOUBLE PRECISION[]. Vector of the coefficients of the regression. </td></tr> |
| <tr> |
| <th>std_err </th><td>DOUBLE PRECISION[]. Vector of the standard error of the coefficients. </td></tr> |
| <tr> |
| <th>t_stats </th><td>DOUBLE PRECISION[]. Vector of the t-stats of the coefficients. </td></tr> |
| <tr> |
| <th>p_values </th><td>DOUBLE PRECISION[]. Vector of the p-values of the coefficients. </td></tr> |
| </table> |
| </dd> |
| <dt>dependent_varname </dt> |
| <dd>VARCHAR. The name of the column containing the dependent variable. </dd> |
| <dt>independent_varname </dt> |
| <dd>VARCHAR. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list. </dd> |
| <dt>grouping_cols (optional) </dt> |
| <dd>VARCHAR, default: NULL. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated. Default value: NULL. </dd> |
| </dl> |
| <p><a class="anchor" id="train_logregr"></a></p> |
| <dl class="section user"><dt>Robust Logistic Regression Training Function</dt><dd></dd></dl> |
| <p>The <a class="el" href="robust_8sql__in.html#a8ae95592359e256816172d4a144f0ada">robust_variance_logregr()</a> function has the following syntax: </p> |
| <pre class="syntax"> |
| robust_variance_logregr( source_table, |
| out_table, |
| dependent_varname, |
| independent_varname, |
| grouping_cols, |
| max_iter, |
| optimizer, |
| tolerance, |
| print_warnings |
| ) |
| </pre> <dl class="arglist"> |
| <dt>source_table </dt> |
| <dd>VARCHAR. The name of the table containing the training data. </dd> |
| <dt>out_table </dt> |
| <dd>VARCHAR. Name of the generated table containing the output model. The output table has the following columns: <table class="output"> |
| <tr> |
| <th>coef </th><td>Vector of the coefficients of the regression. </td></tr> |
| <tr> |
| <th>std_err </th><td>Vector of the standard error of the coefficients. </td></tr> |
| <tr> |
| <th>z_stats </th><td>Vector of the z-stats of the coefficients. </td></tr> |
| <tr> |
| <th>p_values </th><td>Vector of the p-values of the coefficients. </td></tr> |
| </table> |
| </dd> |
| <dt>dependent_varname </dt> |
| <dd>VARCHAR. The name of the column containing the independent variable. </dd> |
| <dt>independent_varname </dt> |
| <dd>VARCHAR. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list. </dd> |
| <dt>grouping_cols (optional) </dt> |
| <dd>VARCHAR, default: NULL. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated. </dd> |
| <dt>max_iter (optional) </dt> |
| <dd>INTEGER, default: 20. The maximum number of iterations that are allowed. </dd> |
| <dt>optimizer </dt> |
| <dd>VARCHAR, default: 'fista'. Name of optimizer, either 'fista' or 'igd'. </dd> |
| <dt>tolerance (optional) </dt> |
| <dd>DOUBLE PRECISION, default: 1e-6. The criteria to end iterations. Both the 'fista' and 'igd' optimizers compute the average difference between the coefficients of two consecutive iterations, and when the difference is smaller than tolerance or the iteration number is larger than max_iter, the computation stops. </dd> |
| <dt>print_warnings (optional) </dt> |
| <dd>BOOLEAN, default: FALSE. Whether the regression fit should print any warning messages. </dd> |
| </dl> |
| <p><a class="anchor" id="train_mlogregr"></a></p> |
| <dl class="section user"><dt>Robust Multinomial Logistic Regression Function</dt><dd></dd></dl> |
| <p>The <a class="el" href="robust_8sql__in.html#a111857acdd0f927160d24cbbc9fc1051">robust_variance_mlogregr()</a> function has the following syntax: </p> |
| <pre class="syntax"> |
| robust_variance_mlogregr( source_table, |
| out_table, |
| dependent_varname, |
| independent_varname, |
| ref_category, |
| grouping_cols, |
| max_iter, |
| optimizer, |
| tolerance, |
| print_warnings |
| ) |
| </pre> <dl class="arglist"> |
| <dt>source_table </dt> |
| <dd>VARCHAR. The name of the table containing training data, properly qualified. </dd> |
| <dt>out_table </dt> |
| <dd>VARCHAR. The name of the table where the regression model will be stored. The output table has the following columns: <table class="output"> |
| <tr> |
| <th>ref_category </th><td>The refererence category used for modeling. </td></tr> |
| <tr> |
| <th>coef </th><td>Vector of the coefficients of the regression. </td></tr> |
| <tr> |
| <th>std_err </th><td>Vector of the standard error of the coefficients. </td></tr> |
| <tr> |
| <th>z_stats </th><td>Vector of the z-stats of the coefficients. </td></tr> |
| <tr> |
| <th>p_values </th><td>Vector of the p-values of the coefficients. </td></tr> |
| </table> |
| </dd> |
| <dt>dependent_varname </dt> |
| <dd>VARCHAR. The name of the column containing the dependent variable. </dd> |
| <dt>independent_varname </dt> |
| <dd>VARCHAR. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list. The <em>independent_varname</em> can be the name of a column that contains an array of numeric values. It can also be a string with the format 'ARRAY[1, x1, x2, x3]', where <em>x1</em>, <em>x2</em> and <em>x3</em> are each column names. </dd> |
| <dt>ref_category (optional) </dt> |
| <dd>INTEGER, default: 0. The reference category. </dd> |
| <dt>grouping_cols (optional) </dt> |
| <dd>VARCHAR, default: NULL. <em>Not currently implemented. Any non-NULL value is ignored.</em> An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated. </dd> |
| <dt>max_iter (optional) </dt> |
| <dd>INTEGER, default: 20. The maximum number of iterations to execute. </dd> |
| <dt>optimizer (optional) </dt> |
| <dd>VARCHAR, default: 'irls'. The optimizer to use in the multinomial logistic regression. </dd> |
| <dt>tolerance (optional) </dt> |
| <dd>DOUBLE PRECISION, default: 0.0001. The tolerance of the multinomial logistic regression optimizer. </dd> |
| <dt>print_warnings (optional) </dt> |
| <dd>BOOLEAN, default FALSE. <em>Not currently implemented.</em> TRUE if the regression fit should print warning messages. </dd> |
| </dl> |
| <p><a class="anchor" id="examples"></a></p> |
| <dl class="section user"><dt>Examples</dt><dd><ol type="1"> |
| <li>View online help for the logistic regression training function. <pre class="example"> |
| SELECT madlib.robust_variance_logregr(); |
| </pre></li> |
| <li>Create the training data table. <pre class="example"> |
| DROP TABLE IF EXISTS patients; |
| CREATE TABLE patients (id INTEGER NOT NULL, second_attack INTEGER, |
| treatment INTEGER, trait_anxiety INTEGER); |
| COPY patients FROM STDIN WITH DELIMITER '|'; |
| 1 | 1 | 1 | 70 |
| 3 | 1 | 1 | 50 |
| 5 | 1 | 0 | 40 |
| 7 | 1 | 0 | 75 |
| 9 | 1 | 0 | 70 |
| 11 | 0 | 1 | 65 |
| 13 | 0 | 1 | 45 |
| 15 | 0 | 1 | 40 |
| 17 | 0 | 0 | 55 |
| 19 | 0 | 0 | 50 |
| 2 | 1 | 1 | 80 |
| 4 | 1 | 0 | 60 |
| 6 | 1 | 0 | 65 |
| 8 | 1 | 0 | 80 |
| 10 | 1 | 0 | 60 |
| 12 | 0 | 1 | 50 |
| 14 | 0 | 1 | 35 |
| 16 | 0 | 1 | 50 |
| 18 | 0 | 0 | 45 |
| 20 | 0 | 0 | 60 |
| \. |
| </pre></li> |
| <li>Run the logistic regression training function and compute the robust logistic variance of the regression: <pre class="example"> |
| DROP TABLE IF EXISTS patients_logregr; |
| SELECT madlib.robust_variance_logregr( 'patients', |
| 'patients_logregr', |
| 'second_attack', |
| 'ARRAY[1, treatment, trait_anxiety]' |
| ); |
| </pre></li> |
| <li>View the regression results. <pre class="example"> |
| \x on |
| Expanded display is on. |
| SELECT * FROM patients_logregr; |
| </pre> Result: <pre class="result"> |
|  -[ RECORD 1 ]------------------------------------------------------- |
| coef | {-6.36346994178179,-1.02410605239327,0.119044916668605} |
| std_err | {3.45872062333648,1.1716192578234,0.0534328864185018} |
| z_stats | {-1.83983346294192,-0.874094587943036,2.22793348156809} |
| p_values | {0.0657926909738889,0.382066744585541,0.0258849510757339} |
| </pre> Alternatively, unnest the arrays in the results for easier reading of output. <pre class="example"> |
| \x off |
| SELECT unnest(array['intercept', 'treatment', 'trait_anxiety' ]) as attribute, |
| unnest(coef) as coefficient, |
| unnest(std_err) as standard_error, |
| unnest(z_stats) as z_stat, |
| unnest(p_values) as pvalue |
| FROM patients_logregr; |
| </pre></li> |
| </ol> |
| </dd></dl> |
| <p><a class="anchor" id="background"></a></p> |
| <dl class="section user"><dt>Technical Background</dt><dd></dd></dl> |
| <p>When doing regression analysis, we are sometimes interested in the variance of the computed coefficients \( \boldsymbol c \). While the built-in regression functions provide variance estimates, we may prefer a <em>robust</em> variance estimate.</p> |
| <p>The robust variance calculation can be expressed in a sandwich formation, which is the form </p> |
| <p class="formulaDsp"> |
| \[ S( \boldsymbol c) = B( \boldsymbol c) M( \boldsymbol c) B( \boldsymbol c) \] |
| </p> |
| <p> where \( B( \boldsymbol c)\) and \( M( \boldsymbol c)\) are matrices. The \( B( \boldsymbol c) \) matrix, also known as the bread, is relatively straight forward, and can be computed as </p> |
| <p class="formulaDsp"> |
| \[ B( \boldsymbol c) = n\left(\sum_i^n -H(y_i, x_i, \boldsymbol c) \right)^{-1} \] |
| </p> |
| <p> where \( H \) is the hessian matrix.</p> |
| <p>The \( M( \boldsymbol c)\) matrix has several variations, each with different robustness properties. The form implemented here is the Huber-White sandwich operator, which takes the form </p> |
| <p class="formulaDsp"> |
| \[ M_{H} =\frac{1}{n} \sum_i^n \psi(y_i,x_i, \boldsymbol c)^T \psi(y_i,x_i, \boldsymbol c). \] |
| </p> |
| <p>The above method for calculating robust variance (Huber-White estimates) is implemented for linear regression, logistic regression, and multinomial logistic regression. It is useful in calculating variances in a dataset with potentially noisy outliers. The Huber-White implemented here is identical to the "HC0" sandwich operator in the R module "sandwich".</p> |
| <p>When multinomial logistic regression is computed before the multinomial robust regression, it uses a default reference category of zero and the regression coefficients are included in the output table. The regression coefficients in the output are in the same order as the multinomial logistic regression function, which is described below. For a problem with \( K \) dependent variables \( (1, ..., K) \) and \( J \) categories \( (0, ..., J-1) \), let \( {m_{k,j}} \) denote the coefficient for dependent variable \( k \) and category \( j \) . The output is \( {m_{k_1, j_0}, m_{k_1, j_1} \ldots m_{k_1, j_{J-1}}, m_{k_2, j_0}, m_{k_2, j_1} \ldots m_{k_K, j_{J-1}}} \). The order is NOT CONSISTENT with the multinomial regression marginal effect calculation with function <em>marginal_mlogregr</em>. This is deliberate because the interfaces of all multinomial regressions (robust, clustered, ...) will be moved to match that used in marginal.</p> |
| <p><a class="anchor" id="literature"></a></p> |
| <dl class="section user"><dt>Literature</dt><dd></dd></dl> |
| <p>[1] vce(cluster) function in STATA: <a href="http://www.stata.com/help.cgi?vce_option">http://www.stata.com/help.cgi?vce_option</a></p> |
| <p>[2] clustered estimators in R: <a href="http://people.su.se/~ma/clustering.pdf">http://people.su.se/~ma/clustering.pdf</a></p> |
| <p>[3] Achim Zeileis: Object-oriented Computation of Sandwich Estimators. Research Report Series / Department of Statistics and Mathematics, 37. Department of Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna. <a href="http://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf">http://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf</a></p> |
| <p><a class="anchor" id="related"></a></p> |
| <dl class="section user"><dt>Related Topics</dt><dd>File <a class="el" href="robust_8sql__in.html" title="SQL functions for linear regression. ">robust.sql_in</a> documenting the SQL functions </dd></dl> |
| </div><!-- contents --> |
| </div><!-- doc-content --> |
| <!-- start footer part --> |
| <div id="nav-path" class="navpath"><!-- id is needed for treeview function! --> |
| <ul> |
| <li class="footer">Generated on Thu Jan 9 2014 20:27:17 for MADlib by |
| <a href="http://www.doxygen.org/index.html"> |
| <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.4 </li> |
| </ul> |
| </div> |
| </body> |
| </html> |