| <!-- HTML header for doxygen 1.8.4--> |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| <html xmlns="http://www.w3.org/1999/xhtml"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/> |
| <meta http-equiv="X-UA-Compatible" content="IE=9"/> |
| <meta name="generator" content="Doxygen 1.8.4"/> |
| <meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/> |
| <title>MADlib: Clustered Variance</title> |
| <link href="tabs.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="jquery.js"></script> |
| <script type="text/javascript" src="dynsections.js"></script> |
| <link href="navtree.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="resize.js"></script> |
| <script type="text/javascript" src="navtree.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(initResizable); |
| $(window).load(resizeHeight); |
| </script> |
| <link href="search/search.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="search/search.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(function() { searchBox.OnSelectItem(0); }); |
| </script> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"], |
| jax: ["input/TeX","output/HTML-CSS"], |
| }); |
| </script><script src="../mathjax/MathJax.js"></script> |
| <link href="doxygen.css" rel="stylesheet" type="text/css" /> |
| <link href="madlib_extra.css" rel="stylesheet" type="text/css"/> |
| </head> |
| <body> |
| <div id="top"><!-- do not remove this div, it is closed by doxygen! --> |
| <div id="titlearea"> |
| <table cellspacing="0" cellpadding="0"> |
| <tbody> |
| <tr style="height: 56px;"> |
| <td style="padding-left: 0.5em;"> |
| <div id="projectname">MADlib |
|  <span id="projectnumber">1.3</span> <span style="font-size:10pt; font-style:italic"><a href="../latest/./group__grp__clustered__errors.html"> A newer version is available</a></span> |
| </div> |
| <div id="projectbrief">User Documentation</div> |
| </td> |
| <td> <div id="MSearchBox" class="MSearchBoxInactive"> |
| <span class="left"> |
| <img id="MSearchSelect" src="search/mag_sel.png" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| alt=""/> |
| <input type="text" id="MSearchField" value="Search" accesskey="S" |
| onfocus="searchBox.OnSearchFieldFocus(true)" |
| onblur="searchBox.OnSearchFieldFocus(false)" |
| onkeyup="searchBox.OnSearchFieldChange(event)"/> |
| </span><span class="right"> |
| <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a> |
| </span> |
| </div> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <!-- end header part --> |
| <!-- Generated by Doxygen 1.8.4 --> |
| <script type="text/javascript"> |
| var searchBox = new SearchBox("searchBox", "search",false,'Search'); |
| </script> |
| </div><!-- top --> |
| <div id="side-nav" class="ui-resizable side-nav-resizable"> |
| <div id="nav-tree"> |
| <div id="nav-tree-contents"> |
| <div id="nav-sync" class="sync"></div> |
| </div> |
| </div> |
| <div id="splitbar" style="-moz-user-select:none;" |
| class="ui-resizable-handle"> |
| </div> |
| </div> |
| <script type="text/javascript"> |
| $(document).ready(function(){initNavTree('group__grp__clustered__errors.html','');}); |
| </script> |
| <div id="doc-content"> |
| <!-- window showing the filter options --> |
| <div id="MSearchSelectWindow" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| onkeydown="return searchBox.OnSearchSelectKey(event)"> |
| <a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(0)"><span class="SelectionMark"> </span>All</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(1)"><span class="SelectionMark"> </span>Files</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(2)"><span class="SelectionMark"> </span>Functions</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(3)"><span class="SelectionMark"> </span>Variables</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(4)"><span class="SelectionMark"> </span>Groups</a></div> |
| |
| <!-- iframe showing the search results (closed by default) --> |
| <div id="MSearchResultsWindow"> |
| <iframe src="javascript:void(0)" frameborder="0" |
| name="MSearchResults" id="MSearchResults"> |
| </iframe> |
| </div> |
| |
| <div class="header"> |
| <div class="headertitle"> |
| <div class="title">Clustered Variance<div class="ingroups"><a class="el" href="group__grp__glm.html">Generalized Linear Models</a></div></div> </div> |
| </div><!--header--> |
| <div class="contents"> |
| <div class="toc"><b>Contents</b> </p> |
| <ul> |
| <li> |
| <a href="#train_linregr">Clustered Variance Linear Regression Training Function</a> </li> |
| <li> |
| <a href="#train_logregr">Clustered Variance Logistic Regression Training Function</a> </li> |
| <li> |
| <a href="#train_mlogregr">Clustered Variance Multinomial Logistic Regression Training Function</a> </li> |
| <li> |
| <a href="#examples">Examples</a> </li> |
| <li> |
| <a href="#notes">Notes</a> </li> |
| <li> |
| <a href="#background">Technical Background</a> </li> |
| <li> |
| <a href="#related">Related Topics</a> </li> |
| </ul> |
| </div><p>The Clustered Variance module adjusts standard errors for clustering. For example, replicating a dataset 100 times should not increase the precision of parameter estimates, but performing this procedure with the IID assumption will actually do this. Another example is in economics of education research, it is reasonable to expect that the error terms for children in the same class are not independent. Clustering standard errors can correct for this.</p> |
| <p>The MADlb Clustered Variance module includes functions to calculate linear, logistic, and multinomial logistic regression problems.</p> |
| <p><a class="anchor" id="train_linregr"></a></p> |
| <dl class="section user"><dt>Clustered Variance Linear Regression Training Function</dt><dd></dd></dl> |
| <p>The clustered variance linear regression training function has the following syntax. </p> |
| <pre class="syntax"> |
| clustered_variance_linregr ( tbl_data, |
| tbl_output, |
| depvar, |
| indvar, |
| clustervar, |
| groupingvar |
| ) |
| </pre><p> <b>Arguments</b> </p> |
| <dl class="arglist"> |
| <dt>tbl_data </dt> |
| <dd>TEXT. The name of the table containing the input data. </dd> |
| <dt>tbl_output </dt> |
| <dd>TEXT. The name of the table to store the regression model. </dd> |
| <dt>depvar </dt> |
| <dd>TEXT. An expression to evaluate for the dependent variable. </dd> |
| <dt>indvar </dt> |
| <dd>TEXT. An Expression to evalue for the independent variables. </dd> |
| <dt>clustervar </dt> |
| <dd>TEXT. A comma-separated list of the columns to use as cluster variables. </dd> |
| <dt>groupingvar (optional) </dt> |
| <dd>TEXT, default: NULL. <em>Not currently implemented. Any non-NULL value is ignored.</em> An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL GROUP BY clause. When this value is null, no grouping is used and a single result model is generated. </dd> |
| </dl> |
| <p><a class="anchor" id="train_logregr"></a></p> |
| <dl class="section user"><dt>Clustered Variance Logistic Regression Training Function</dt><dd></dd></dl> |
| <p>The clustered variance logistic regression training function has the following syntax. </p> |
| <pre class="syntax"> |
| clustered_variance_logregr( tbl_data, |
| tbl_output, |
| depvar, |
| indvar, |
| clustervar, |
| groupingvar, |
| max_iter, |
| optimizer, |
| tolerance, |
| verbose |
| ) |
| </pre><p> <b>Arguments</b> </p> |
| <dl class="arglist"> |
| <dt>tbl_data </dt> |
| <dd>TEXT. The name of the table containing the input data. </dd> |
| <dt>tbl_output </dt> |
| <dd>TEXT. The name of the table to store the regression model. </dd> |
| <dt>depvar </dt> |
| <dd>TEXT. An expression to evaluate for the dependent variable. </dd> |
| <dt>indvar </dt> |
| <dd>TEXT. An expression to evaluate for the independent variable. </dd> |
| <dt>clustervar </dt> |
| <dd>TEXT. A comma-separated list of columns to use as cluster variables. </dd> |
| <dt>groupingvar (optional) </dt> |
| <dd>TEXT, default: NULL. <em>Not yet implemented. Any non-NULL values are ignored.</em> An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL GROUP BY clause. When this value is NULL, no grouping is used and a single result model is generated. </dd> |
| <dt>max_iter (optional) </dt> |
| <dd>INTEGER, default: 20. The maximum number of iterations that are allowed. </dd> |
| <dt>optimizer (optional) </dt> |
| <dd>TEXT, default: 'irls'. The name of the optimizer to use: <ul> |
| <li> |
| 'newton' or 'irls': Iteratively reweighted least squares </li> |
| <li> |
| 'cg': conjugate gradient </li> |
| <li> |
| 'igd': incremental gradient descent. </li> |
| </ul> |
| </dd> |
| <dt>tolerance (optional) </dt> |
| <dd>FLOAT8, default: 0.0001 The difference between log-likelihood values in successive iterations that should indicate convergence. A zero disables the convergence criterion, so that execution stops after <em>n</em> Iterations have completed. </dd> |
| <dt>verbose (optional) </dt> |
| <dd>BOOLEAN, default FALSE. Provides verbose output of the results of training. </dd> |
| </dl> |
| <p><a class="anchor" id="train_mlogregr"></a></p> |
| <dl class="section user"><dt>Clustered Variance Multinomial Logistic Regression Training Function</dt><dd></dd></dl> |
| <pre class="syntax"> |
| clustered_variance_mlogregr( tbl_data, |
| tbl_output, |
| depvar, |
| indvar, |
| clustervar, |
| ref_category, |
| groupingvar, |
| max_iter, |
| optimizer, |
| tolerance, |
| verbose |
| ) |
| </pre><p> <b>Arguments</b> </p> |
| <dl class="arglist"> |
| <dt>tbl_data </dt> |
| <dd>TEXT. The name of the table containing the input data. </dd> |
| <dt>tbl_output </dt> |
| <dd>TEXT. The name of the table to store the regression model. </dd> |
| <dt>depvar </dt> |
| <dd>TEXT. An expression to evaluate for the dependent variable. </dd> |
| <dt>indvar </dt> |
| <dd>TEXT. An expression to evaluate for the independent variable. </dd> |
| <dt>clustervar </dt> |
| <dd>TEXT. A comma-separated list of columns to use as cluster variables. </dd> |
| <dt>ref_category </dt> |
| <dd>INTEGER. Reference category in the range [0, num_category). </dd> |
| <dt>groupingvar (optional) </dt> |
| <dd>TEXT, default: NULL. <em>Not yet implemented. Any non-NULL values are ignored.</em> A comma-separated list of columns to use as grouping variables. </dd> |
| <dt>max_iter (optional) </dt> |
| <dd>INTEGER, default: 20. Maximum iteration number for logistic regression. </dd> |
| <dt>optimizer (optional) </dt> |
| <dd>TEXT, default: 'irls'. Optimization method to use for logistic regression. </dd> |
| <dt>tolerance </dt> |
| <dd>FLOAT8, default: 0.0001. The computation ends when the difference of likelihoods in two consecutive iterations is smaller than this value. </dd> |
| <dt>verbose (optional) </dt> |
| <dd>BOOLEAN, default FALSE. If TRUE, detailed information is printed when computing logistic regression. </dd> |
| </dl> |
| <p><a class="anchor" id="examples"></a></p> |
| <dl class="section user"><dt>Examples</dt><dd><ol type="1"> |
| <li>View online help for the clustered variance linear regression function. <pre class="example"> |
| SELECT madlib.clustered_variance_linregr(); |
| </pre></li> |
| <li>Run the linear regression function and view the results. <pre class="example"> |
| DROP TABLE IF EXISTS tbl_output; |
| SELECT madlib.clustered_variance_linregr( 'abalone', |
| 'tbl_output', |
| 'rings', |
| 'ARRAY[1, diameter, length, width]', |
| 'sex', |
| NULL |
| ); |
| SELECT * FROM tbl_output; |
| </pre></li> |
| <li>View online help for the clustered variance logistic regression function. <pre class="example"> |
| SELECT madlib.clustered_variance_logregr(); |
| </pre></li> |
| <li>Run the logistic regression function and view the results. <pre class="example"> |
| DROP TABLE IF EXISTS tbl_output; |
| SELECT madlib.clustered_variance_logregr( 'abalone', |
| 'tbl_output', |
| 'rings < 10', |
| 'ARRAY[1, diameter, length, width]', |
| 'sex' |
| ); |
| SELECT * FROM tbl_output; |
| </pre></li> |
| <li>Run the multinomial logistic regression and view the results. DROP TABLE IF EXISTS tbl_output; SELECT madlib.clustered_variance_mlogregr( 'abalone', 'tbl_output', 'CASE WHEN rings < 10 THEN 1 ELSE 0 END', 'ARRAY[1, diameter, length, width]', 'sex', 0 ); SELECT * FROM tbl_output; </li> |
| </ol> |
| </dd></dl> |
| <p><a class="anchor" id="notes"></a></p> |
| <dl class="section user"><dt>Notes</dt><dd><ul> |
| <li>Note that we need to manually include an intercept term in the independent variable expression. The NULL value of <em>groupingvar</em> means that there is no grouping in the calculation.</li> |
| </ul> |
| </dd></dl> |
| <p><a class="anchor" id="background"></a></p> |
| <dl class="section user"><dt>Technical Background</dt><dd></dd></dl> |
| <p>Assume that the data can be separated into \(m\) clusters. Usually this can be done by grouping the data table according to one or multiple columns.</p> |
| <p>The estimator has a similar form to the usual sandwich estimator </p> |
| <p class="formulaDsp"> |
| \[ S(\vec{c}) = B(\vec{c}) M(\vec{c}) B(\vec{c}) \] |
| </p> |
| <p>The bread part is the same as Huber-White sandwich estimator </p> |
| <p class="formulaDsp"> |
| \begin{eqnarray} B(\vec{c}) & = & \left(-\sum_{i=1}^{n} H(y_i, \vec{x}_i, \vec{c})\right)^{-1}\\ & = & \left(-\sum_{i=1}^{n}\frac{\partial^2 l(y_i, \vec{x}_i, \vec{c})}{\partial c_\alpha \partial c_\beta}\right)^{-1} \end{eqnarray} |
| </p> |
| <p> where \(H\) is the hessian matrix, which is the second derivative of the target function </p> |
| <p class="formulaDsp"> |
| \[ L(\vec{c}) = \sum_{i=1}^n l(y_i, \vec{x}_i, \vec{c})\ . \] |
| </p> |
| <p>The meat part is different </p> |
| <p class="formulaDsp"> |
| \[ M(\vec{c}) = \bf{A}^T\bf{A} \] |
| </p> |
| <p> where the \(m\)-th row of \(\bf{A}\) is </p> |
| <p class="formulaDsp"> |
| \[ A_m = \sum_{i\in G_m}\frac{\partial l(y_i,\vec{x}_i,\vec{c})}{\partial \vec{c}} \] |
| </p> |
| <p> where \(G_m\) is the set of rows that belong to the same cluster.</p> |
| <p>We can compute the quantities of \(B\) and \(A\) for each cluster during one scan through the data table in an aggregate function. Then sum over all clusters to the full \(B\) and \(A\) in the outside of the aggregate function. At last, the matrix mulplitications are done in a separate function on the master node.</p> |
| <p>When multinomial logistic regression is computed before the multinomial clustered variance calculation, it uses a default reference category of zero and the regression coefficients are included in the output table. The regression coefficients in the output are in the same order as multinomial logistic regression function, which is described below. For a problem with \( K \) dependent variables \( (1, ..., K) \) and \( J \) categories \( (0, ..., J-1) \), let \( {m_{k,j}} \) denote the coefficient for dependent variable \( k \) and category \( j \). The output is \( {m_{k_1, j_0}, m_{k_1, j_1} \ldots m_{k_1, j_{J-1}}, m_{k_2, j_0}, m_{k_2, j_1} \ldots m_{k_K, j_{J-1}}} \). The order is NOT CONSISTENT with the multinomial regression marginal effect calculation with function <em>marginal_mlogregr</em>. This is deliberate because the interfaces of all multinomial regressions (robust, clustered, ...) will be moved to match that used in marginal.</p> |
| <p><a class="anchor" id="literature"></a></p> |
| <dl class="section user"><dt>Literature</dt><dd></dd></dl> |
| <p>[1] Standard, Robust, and Clustered Standard Errors Computed in R, <a href="http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/">http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/</a></p> |
| <p><a class="anchor" id="related"></a></p> |
| <dl class="section user"><dt>Related Topics</dt><dd>File <a class="el" href="clustered__variance_8sql__in.html">clustered_variance.sql_in</a> documenting the SQL function </dd></dl> |
| </div><!-- contents --> |
| </div><!-- doc-content --> |
| <!-- start footer part --> |
| <div id="nav-path" class="navpath"><!-- id is needed for treeview function! --> |
| <ul> |
| <li class="footer">Generated on Thu Jan 9 2014 20:27:17 for MADlib by |
| <a href="http://www.doxygen.org/index.html"> |
| <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.4 </li> |
| </ul> |
| </div> |
| </body> |
| </html> |