blob: 901e860607e7d0566f9956c6a09eaa0a53f4d178 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: Covariance and Correlation</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.18.0</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__correlation.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">Covariance and Correlation<div class="ingroups"><a class="el" href="group__grp__stats.html">Statistics</a> &raquo; <a class="el" href="group__grp__desc__stats.html">Descriptive Statistics</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> <ul>
<li>
<a href="#usage">Covariance and Correlation Functions</a> </li>
<li>
<a href="#examples">Examples</a> </li>
<li>
<a href="#literature">Literature</a> </li>
<li>
<a href="#related">Related Topics</a> </li>
</ul>
</div><p>A correlation function is the degree and direction of association of two variables&mdash;how well one random variable can be predicted from the other. It is a normalized version of covariance. The Pearson correlation coefficient is used here, which has a value between -1 and 1, where 1 implies total positive linear correlation, 0 means no linear correlation, and -1 means total negative linear correlation.</p>
<p>This function generates an \(N\)x \(N\) cross correlation matrix for pairs of numeric columns in a <em>source_table</em>. It is square symmetrical with the \( (i,j) \)th element equal to the correlation coefficient between the \(i\)th and the \(j\)th variable. The diagonal elements (correlations of variables with themselves) are always equal to 1.0.</p>
<p>We also provide a covariance function which is similar in nature to correlation, and is a measure of the joint variability of two random variables.</p>
<p><a class="anchor" id="usage"></a></p><dl class="section user"><dt>Covariance and Correlation Functions</dt><dd></dd></dl>
<p>The correlation function has the following syntax: </p><pre class="syntax">
correlation( source_table,
output_table,
target_cols,
verbose,
grouping_cols,
n_groups_per_run
)
</pre><p>The covariance function has a similar syntax: </p><pre class="syntax">
covariance( source_table,
output_table,
target_cols,
verbose,
grouping_cols,
n_groups_per_run
)
</pre><dl class="arglist">
<dt>source_table </dt>
<dd><p class="startdd">TEXT. Name of the table containing the input data.</p>
<p class="enddd"></p>
</dd>
<dt>output_table </dt>
<dd><p class="startdd">TEXT. Name of the table containing the cross correlation matrix. The output table has N rows, where N is the number of '<em>target_cols</em>' in the '<em>source_table'</em> for which correlation or covariance is being computed. It has the following columns: </p><table class="output">
<tr>
<th>column_position </th><td>An automatically generated sequential counter indicating the order of the variable in the '<em>output_table</em>'. </td></tr>
<tr>
<th>variable </th><td>Contains the row header for the variables of interest. </td></tr>
<tr>
<th>grouping_cols </th><td>Contains the grouping columns, if any. </td></tr>
<tr>
<th>&lt;...&gt; </th><td>The remainder of the table is the NxN correlation matrix for the pairs of variables of interest. </td></tr>
</table>
<p>The output table is arranged as a lower-triangular matrix with the upper triangle set to NULL and the diagonal elements set to 1.0. To obtain the result from the '<em>output_table</em>' order by '<em>column_position</em>': </p><pre class="example">
SELECT * FROM output_table ORDER BY column_position;
</pre><p>In addition to output table, a summary table named &lt;output_table&gt;_summary is also created, which has the following columns: </p><table class="output">
<tr>
<th>method</th><td>'Correlation' or 'Covariance' </td></tr>
<tr>
<th>source_table</th><td>VARCHAR. Data source table name. </td></tr>
<tr>
<th>output_table</th><td>VARCHAR. Output table name. </td></tr>
<tr>
<th>column_names</th><td>VARCHAR. Column names used for correlation computation, as a comma-separated string. </td></tr>
<tr>
<th>grouping_cols </th><td>Contains the grouping columns, if any. </td></tr>
<tr>
<th>mean_vector</th><td>FLOAT8[]. Mean value of column for variables of interest. </td></tr>
<tr>
<th>total_rows_processed </th><td>BIGINT. Total numbers of rows processed. </td></tr>
</table>
<p class="enddd"></p>
</dd>
<dt>target_cols (optional) </dt>
<dd><p class="startdd">TEXT, default: '*'. A comma-separated list of the columns to correlate. If NULL or <code>'*'</code>, results are produced for all numeric columns.</p>
<p class="enddd"></p>
</dd>
<dt>verbose (optional) </dt>
<dd><p class="startdd">BOOLEAN, default: FALSE. Print verbose information if TRUE.</p>
<p class="enddd"></p>
</dd>
<dt>grouping_cols (optional) </dt>
<dd><p class="startdd">TEXT, default: NULL. A comma-separated list of the columns to group by.</p>
<p class="enddd"></p>
</dd>
<dt>n_groups_per_run (optional) </dt>
<dd>INTEGER, default: 10. Number of groups to process at a time. This parameter is ignored if 'grouping_cols' is not specified. Generally the default value will work fine, but there may be cases (see below) where you will want to experiment with it to reduce execution time and memory usage. <dl class="section note"><dt>Note</dt><dd>This is a lower level parameter that can potentially be used to improve performance, but should be used with caution. It is designed to handle the case where you have a large number of groups. In general, increasing 'n_groups_per_run' means we construct a larger 'UNION ALL' query which uses more memory and may slow down execution if it gets too big. If you have a large number of groups and a smaller data size, there may be benefits to increasing this value. Conversely, decreasing 'n_groups_per_run' means we issue more 'plpy.execute' commands. This increases overhead and can modestly affect the execution time. </dd></dl>
</dd>
</dl>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<ol type="1">
<li>Create an input dataset. <pre class="example">
DROP TABLE IF EXISTS example_data CASCADE;
CREATE TABLE example_data(
id SERIAL,
outlook TEXT,
temperature FLOAT8,
humidity FLOAT8,
windy TEXT,
class TEXT,
day TEXT
);
INSERT INTO example_data VALUES
(1, 'sunny', 85, 85, 'false', 'Dont Play', 'Mon'),
(2, 'sunny', 80, 90, 'true', 'Dont Play', 'Mon'),
(3, 'overcast', 83, 78, 'false', 'Play', 'Mon'),
(4, 'rain', 70, 96, 'false', 'Play', 'Mon'),
(5, 'rain', 68, 80, 'false', 'Play', 'Mon'),
(6, 'rain', 65, 70, 'true', 'Dont Play', 'Mon'),
(7, 'overcast', 64, 65, 'true', 'Play', 'Mon'),
(8, 'sunny', 72, 95, 'false', 'Dont Play', 'Mon'),
(9, 'sunny', 69, 70, 'false', 'Play', 'Mon'),
(10, 'rain', 75, 80, 'false', 'Play', 'Mon'),
(11, 'sunny', 75, 70, 'true', 'Play', 'Mon'),
(12, 'overcast', 72, 90, 'true', 'Play', 'Mon'),
(13, 'overcast', 81, 75, 'false', 'Play', 'Mon'),
(14, 'rain', 71, 80, 'true', 'Dont Play', 'Mon'),
(15, NULL, 100, 100, 'true', NULL, 'Mon'),
(16, NULL, 110, 100, 'true', NULL, 'Mon'),
(101, 'sunny', 85, 85, 'false', 'Dont Play', 'Tues'),
(102, 'sunny', 80, 90, 'true', 'Dont Play', 'Tues'),
(103, 'overcast', 83, 78, 'false', 'Play', 'Tues'),
(104, 'rain', 70, 96, 'false', 'Play', 'Tues'),
(105, 'rain', 68, 80, 'false', 'Play', 'Tues'),
(106, 'rain', 65, 70, 'true', 'Dont Play', 'Tues'),
(107, 'overcast', 64, 65, 'true', 'Play', 'Tues'),
(108, 'sunny', 72, 95, 'false', 'Dont Play', 'Tues'),
(109, 'sunny', 69, 70, 'false', 'Play', 'Tues'),
(110, 'rain', 75, 80, 'false', 'Play', 'Tues'),
(111, 'sunny', 75, 70, 'true', 'Play', 'Tues'),
(112, 'overcast', 72, 90, 'true', 'Play', 'Tues'),
(113, 'overcast', 81, 75, 'false', 'Play', 'Tues'),
(114, 'rain', 71, 80, 'true', 'Dont Play', 'Tues'),
(115, NULL, 100, 100, 'true', NULL, 'Tues'),
(116, NULL, 110, 100, 'true', NULL, 'Tues'),
(201, 'sunny', 85, 85, 'false', 'Dont Play', 'Wed'),
(202, 'sunny', 80, 90, 'true', 'Dont Play', 'Wed'),
(203, 'overcast', 83, 78, 'false', 'Play', 'Wed'),
(204, 'rain', 70, 96, 'false', 'Play', 'Wed'),
(205, 'rain', 68, 80, 'false', 'Play', 'Wed'),
(206, 'rain', 65, 70, 'true', 'Dont Play', 'Wed'),
(207, 'overcast', 64, 65, 'true', 'Play', 'Wed'),
(208, 'sunny', 7, 95, 'false', 'Dont Play', 'Wed'),
(209, 'sunny', 6, 70, 'false', 'Play', 'Wed'),
(210, 'rain', 7, 80, 'false', 'Play', 'Wed'),
(211, 'sunny', 75, 70, 'true', 'Play', 'Wed'),
(212, 'overcast', 72, 90, 'true', 'Play', 'Wed'),
(213, 'overcast', 81, 75, 'false', 'Play', 'Wed'),
(214, 'rain', 71, 80, 'true', 'Dont Play', 'Wed'),
(215, NULL, 10, 100, 'true', NULL, 'Wed'),
(216, NULL, 10, 100, 'true', NULL, 'Wed'),
(217, 'sunny', 85, 85, 'false', 'Dont Play', 'Wed'),
(218, 'sunny', 80, 9, 'true', 'Dont Play', 'Wed'),
(219, 'overcast', 83, 78, 'false', 'Play', 'Wed'),
(220, 'rain', 70, 9, 'false', 'Play', 'Wed'),
(221, 'rain', 68, 80, 'false', 'Play', 'Wed');
</pre></li>
<li>Get correlation between temperature and humidity: <pre class="example">
DROP TABLE IF EXISTS example_data_output, example_data_output_summary;
SELECT madlib.correlation( 'example_data',
'example_data_output',
'temperature, humidity'
);
</pre> View the correlation matrix: <pre class="example">
SELECT * FROM example_data_output ORDER BY column_position;
</pre> <pre class="result">
column_position | variable | temperature | humidity
-----------------+-------------+---------------------+----------
1 | temperature | 1 |
2 | humidity | 0.00607993890408995 | 1
(2 rows)
</pre> View the summary table: <pre class="example">
\x on
SELECT * FROM example_data_output_summary;
</pre> <pre class="result">
-[ RECORD 1 ]--------+-----------------------------------
method | Correlation
source | example_data
output_table | example_data_output
column_names | {temperature,humidity}
mean_vector | {70.188679245283,79.8679245283019}
total_rows_processed | 53
</pre></li>
<li>Correlation with grouping by day: <pre class="example">
\x off
DROP TABLE IF EXISTS example_data_output, example_data_output_summary;
SELECT madlib.correlation( 'example_data',
'example_data_output',
'temperature, humidity',
FALSE,
'day'
);
</pre> View the correlation matrix by group: <pre class="example">
SELECT * FROM example_data_output ORDER BY day, column_position;
</pre> <pre class="result">
column_position | variable | day | temperature | humidity
-----------------+-------------+------+-------------------+----------
1 | temperature | Mon | 1 |
2 | humidity | Mon | 0.616876934548786 | 1
1 | temperature | Tues | 1 |
2 | humidity | Tues | 0.616876934548786 | 1
1 | temperature | Wed | 1 |
2 | humidity | Wed | -0.28969669368457 | 1
(6 rows)
</pre> View the summary table: <pre class="example">
\x on
SELECT * FROM example_data_output_summary ORDER BY day;
</pre> <pre class="result">
-[ RECORD 1 ]--------+------------------------------------
method | Correlation
source | example_data
output_table | example_data_output
column_names | {temperature,humidity}
day | Mon
mean_vector | {77.5,82.75}
total_rows_processed | 16
-[ RECORD 2 ]--------+------------------------------------
method | Correlation
source | example_data
output_table | example_data_output
column_names | {temperature,humidity}
day | Tues
mean_vector | {77.5,82.75}
total_rows_processed | 16
-[ RECORD 3 ]--------+------------------------------------
method | Correlation
source | example_data
output_table | example_data_output
column_names | {temperature,humidity}
day | Wed
mean_vector | {59.0476190476191,75.4761904761905}
total_rows_processed | 21
</pre></li>
<li>Get covariance between temperature and humidity: <pre class="example">
\x off
DROP TABLE IF EXISTS example_data_output, example_data_output_summary;
SELECT madlib.covariance( 'example_data',
'example_data_output',
'temperature, humidity'
);
</pre> View the covariance matrix: <pre class="example">
SELECT * FROM example_data_output ORDER BY column_position;
</pre> <pre class="result">
column_position | variable | temperature | humidity
-----------------+-------------+------------------+------------------
1 | temperature | 507.926664293343 |
2 | humidity | 2.40227839088644 | 307.359914560342
(2 rows)
</pre> View the summary table: <pre class="example">
\x on
SELECT * FROM example_data_output_summary;
</pre> <pre class="result">
-[ RECORD 1 ]--------+-----------------------------------
method | Covariance
source | example_data
output_table | example_data_output
column_names | {temperature,humidity}
mean_vector | {70.188679245283,79.8679245283019}
total_rows_processed | 53
</pre></li>
</ol>
<dl class="section user"><dt>Notes</dt><dd></dd></dl>
<p>Null values will be replaced by the mean of their respective columns (mean imputation/substitution). Mean imputation is a method in which the missing value on a certain variable is replaced by the mean of the available cases. This method maintains the sample size and is easy to use, but the variability in the data is reduced, so the standard deviations and the variance estimates tend to be underestimated. Please refer to [1] and [2] for details.</p>
<p>If the mean imputation method is not suitable for the target use case, it is advised to employ a view that handles the NULL values prior to calling the correlation/covariance functions.</p>
<p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] <a href="https://en.wikipedia.org/wiki/Imputation_(statistics)">https://en.wikipedia.org/wiki/Imputation_(statistics)</a></p>
<p>[2] <a href="https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/">https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/</a></p>
<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
<p>File <a class="el" href="correlation_8sql__in.html" title="SQL functions for correlation computation. ">correlation.sql_in</a> documenting the SQL functions</p>
<p><a class="el" href="group__grp__summary.html">Summary</a> for general descriptive statistics for a table </p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Wed Mar 31 2021 20:45:48 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>