blob: 637902bc717ef8b66b0a8d3eec8a154779ac6a53 [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: Singular Value Decomposition</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.21.0</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__svd.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">Singular Value Decomposition<div class="ingroups"><a class="el" href="group__grp__datatrans.html">Data Types and Transformations</a> &raquo; <a class="el" href="group__grp__arraysmatrix.html">Arrays and Matrices</a> &raquo; <a class="el" href="group__grp__matrix__factorization.html">Matrix Factorization</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> <ul>
<li>
<a href="#syntax">SVD Functions</a> </li>
<li>
<a href="#output">Output Tables</a> </li>
<li>
<a href="#examples">Examples</a></li>
<li>
</li>
<li>
<a href="#background">Technical Background</a> </li>
</ul>
</div><p>In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics.</p>
<p>Let \(A\) be a \(mxn\) matrix, where \(m \ge n\). Then \(A\) can be decomposed as follows: </p><p class="formulaDsp">
\[ A = U \Sigma V^T, \]
</p>
<p> where \(U\) is a \(m \times n\) orthonormal matrix, \(\Sigma\) is a \(n \times n\) diagonal matrix, and \(V\) is an \(n \times n\) orthonormal matrix. The diagonal elements of \(\Sigma\) are called the <em>singular values</em>.</p>
<p><a class="anchor" id="syntax"></a></p><dl class="section user"><dt>SVD Functions</dt><dd></dd></dl>
<p>SVD factorizations are provided for dense and sparse matrices. In addition, a native implementation is provided for very sparse matrices for improved performance.</p>
<p><b>SVD Function for Dense Matrices</b></p>
<pre class="syntax">
svd( source_table,
output_table_prefix,
row_id,
k,
n_iterations,
result_summary_table
);
</pre><p> <b>Arguments</b> </p><dl class="arglist">
<dt>source_table </dt>
<dd><p class="startdd">TEXT. Source table name (dense matrix).</p>
<p class="enddd">The table contains a <code>row_id</code> column that identifies each row, with numbering starting from 1. The other columns contain the data for the matrix. There are two possible dense formats as illustrated by the 2x2 matrix example below. You can use either of these dense formats:</p><ol type="1">
<li><pre class="example">
row_id col1 col2
row1 1 1 0
row2 2 0 1
</pre></li>
<li><pre class="example">
row_id row_vec
row1 1 {1, 0}
row2 2 {0, 1}
</pre> </li>
</ol>
</dd>
<dt>output_table_prefix </dt>
<dd>TEXT. Prefix for output tables. See <a href="#output">Output Tables</a> below for a description of the convention used. </dd>
<dt>row_id </dt>
<dd>TEXT. ID for each row. </dd>
<dt>k </dt>
<dd>INTEGER. Number of singular values to compute. </dd>
<dt>n_iterations (optional). </dt>
<dd>INTEGER. Number of iterations to run. <dl class="section note"><dt>Note</dt><dd>The number of iterations must be in the range [k, column dimension], where k is number of singular values. </dd></dl>
</dd>
<dt>result_summary_table (optional) </dt>
<dd>TEXT. The name of the table to store the result summary. </dd>
</dl>
<hr/>
<p> <b>SVD Function for Sparse Matrices</b></p>
<p>Use this function for matrices that are represented in the sparse-matrix format (example below). <b>Note that the input matrix is converted to a dense matrix before the SVD operation, for efficient computation reasons. </b></p>
<pre class="syntax">
svd_sparse( source_table,
output_table_prefix,
row_id,
col_id,
value,
row_dim,
col_dim,
k,
n_iterations,
result_summary_table
);
</pre><p> <b>Arguments</b> </p><dl class="arglist">
<dt>source_table </dt>
<dd><p class="startdd">TEXT. Source table name (sparse matrix).</p>
<p>A sparse matrix is represented using the row and column indices for each non-zero entry of the matrix. This representation is useful for matrices containing multiple zero elements. Below is an example of a sparse 4x7 matrix with just 6 out of 28 entries being non-zero. The dimensionality of the matrix is inferred using the max value in <em>row</em> and <em>col</em> columns. Note the last entry is included (even though it is 0) to provide the dimensionality of the matrix, indicating that the 4th row and 7th column contain all zeros. </p><pre class="example">
row_id | col_id | value
--------+--------+-------
1 | 1 | 9
1 | 5 | 6
1 | 6 | 6
2 | 1 | 8
3 | 1 | 3
3 | 2 | 9
4 | 7 | 0
(6 rows)
</pre> <p class="enddd"></p>
</dd>
<dt>output_table_prefix </dt>
<dd>TEXT. Prefix for output tables. See <a href="#output">Output Tables</a> below for a description of the convention used. </dd>
<dt>row_id </dt>
<dd>TEXT. Name of the column containing the row index for each entry in sparse matrix. </dd>
<dt>col_id </dt>
<dd>TEXT. Name of the column containing the column index for each entry in sparse matrix. </dd>
<dt>value </dt>
<dd>TEXT. Name of column containing the non-zero values of the sparse matrix. </dd>
<dt>row_dim </dt>
<dd>INTEGER. Number of rows in matrix. </dd>
<dt>col_dim </dt>
<dd>INTEGER. Number of columns in matrix. </dd>
<dt>k </dt>
<dd>INTEGER. Number of singular values to compute. </dd>
<dt>n_iterations (optional) </dt>
<dd>INTEGER. Number of iterations to run. <dl class="section note"><dt>Note</dt><dd>The number of iterations must be in the range [k, column dimension], where k is number of singular values. </dd></dl>
</dd>
<dt>result_summary_table (optional) </dt>
<dd>TEXT. The name of the table to store the result summary. </dd>
</dl>
<hr/>
<p> <b>Native Implementation for Sparse Matrices</b></p>
<p>Use this function for matrices that are represented in the sparse-matrix format (see sparse matrix example above). This function uses the native sparse representation while computing the SVD. </p><dl class="section note"><dt>Note</dt><dd>Note that this function should be favored if the matrix is highly sparse, since it computes very sparse matrices efficiently. </dd></dl>
<pre class="syntax">
svd_sparse_native( source_table,
output_table_prefix,
row_id,
col_id,
value,
row_dim,
col_dim,
k,
n_iterations,
result_summary_table
);
</pre><p> <b>Arguments</b> </p><dl class="arglist">
<dt>source_table </dt>
<dd>TEXT. Source table name (sparse matrix - see example above). </dd>
<dt>output_table_prefix </dt>
<dd>TEXT. Prefix for output tables. See <a href="#output">Output Tables</a> below for a description of the convention used. </dd>
<dt>row_id </dt>
<dd>TEXT. ID for each row. </dd>
<dt>col_id </dt>
<dd>TEXT. ID for each column. </dd>
<dt>value </dt>
<dd>TEXT. Non-zero values of the sparse matrix. </dd>
<dt>row_dim </dt>
<dd>INTEGER. Row dimension of sparse matrix. </dd>
<dt>col_dim </dt>
<dd>INTEGER. Col dimension of sparse matrix. </dd>
<dt>k </dt>
<dd>INTEGER. Number of singular values to compute. </dd>
<dt>n_iterations (optional) </dt>
<dd>INTEGER. Number of iterations to run. <dl class="section note"><dt>Note</dt><dd>The number of iterations must be in the range [k, column dimension], where k is number of singular values. </dd></dl>
</dd>
<dt>result_summary_table (optional) </dt>
<dd>TEXT. Table name to store result summary. </dd>
</dl>
<hr/>
<p><a class="anchor" id="output"></a></p><dl class="section user"><dt>Output Tables</dt><dd></dd></dl>
<p>Output for eigenvectors/values is in the following three tables:</p><ul>
<li>Left singular matrix: Table is named &lt;output_table_prefix&gt;_u (e.g. ‘netflix_u’)</li>
<li>Right singular matrix: Table is named &lt;output_table_prefix&gt;_v (e.g. ‘netflix_v’)</li>
<li>Singular values: Table is named &lt;output_table_prefix&gt;_s (e.g. ‘netflix_s’)</li>
</ul>
<p>The left and right singular vector tables are of the format: </p><table class="output">
<tr>
<th>row_id </th><td>INTEGER. The ID corresponding to each eigenvalue (in decreasing order). </td></tr>
<tr>
<th>row_vec </th><td>FLOAT8[]. Singular vector elements for this row_id. Each array is of size k. </td></tr>
</table>
<p>The singular values table is in sparse table format, since only the diagonal elements of the matrix are non-zero: </p><table class="output">
<tr>
<th>row_id </th><td>INTEGER. <em>i</em> for <em>ith</em> eigenvalue. </td></tr>
<tr>
<th>col_id </th><td>INTEGER. <em>i</em> for <em>ith</em> eigenvalue (same as row_id). </td></tr>
<tr>
<th>value </th><td>FLOAT8. Eigenvalue. </td></tr>
</table>
<p>All <code>row_id</code> and <code>col_id</code> in the above tables start from 1.</p>
<p>The result summary table has the following columns: </p><table class="output">
<tr>
<th>rows_used </th><td>INTEGER. Number of rows used for SVD calculation. </td></tr>
<tr>
<th>exec_time </th><td>FLOAT8. Total time for executing SVD. </td></tr>
<tr>
<th>iter </th><td>INTEGER. Total number of iterations run. </td></tr>
<tr>
<th>recon_error </th><td>FLOAT8. Total quality score (i.e. approximation quality) for this set of orthonormal basis. </td></tr>
<tr>
<th>relative_recon_error </th><td>FLOAT8. Relative quality score. </td></tr>
</table>
<p>In the result summary table, the reconstruction error is computed as \( \sqrt{mean((X - USV^T)_{ij}^2)} \), where the average is over all elements of the matrices. The relative reconstruction error is then computed as ratio of the reconstruction error and \( \sqrt{mean(X_{ij}^2)} \).</p>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<ol type="1">
<li>View online help for the SVD function. <pre class="example">
SELECT madlib.svd();
</pre></li>
<li>Create an input dataset (dense matrix). <pre class="example">
DROP TABLE IF EXISTS mat, mat_sparse, svd_summary_table, svd_u, svd_v, svd_s;
CREATE TABLE mat (
row_id integer,
row_vec double precision[]
);
INSERT INTO mat VALUES
(1,'{396,840,353,446,318,886,15,584,159,383}'),
(2,'{691,58,899,163,159,533,604,582,269,390}'),
(3,'{293,742,298,75,404,857,941,662,846,2}'),
(4,'{462,532,787,265,982,306,600,608,212,885}'),
(5,'{304,151,337,387,643,753,603,531,459,652}'),
(6,'{327,946,368,943,7,516,272,24,591,204}'),
(7,'{877,59,260,302,891,498,710,286,864,675}'),
(8,'{458,959,774,376,228,354,300,669,718,565}'),
(9,'{824,390,818,844,180,943,424,520,65,913}'),
(10,'{882,761,398,688,761,405,125,484,222,873}'),
(11,'{528,1,860,18,814,242,314,965,935,809}'),
(12,'{492,220,576,289,321,261,173,1,44,241}'),
(13,'{415,701,221,503,67,393,479,218,219,916}'),
(14,'{350,192,211,633,53,783,30,444,176,932}'),
(15,'{909,472,871,695,930,455,398,893,693,838}'),
(16,'{739,651,678,577,273,935,661,47,373,618}');
</pre></li>
<li>Run SVD function for a dense matrix. <pre class="example">
SELECT madlib.svd( 'mat', -- Input table
'svd', -- Output table prefix
'row_id', -- Column name with row index
10, -- Number of singular values to compute
NULL, -- Use default number of iterations
'svd_summary_table' -- Result summary table
);
</pre></li>
<li>Print out the singular values and the summary table. For the singular values: <pre class="example">
SELECT * FROM svd_s ORDER BY row_id;
</pre> Result: <pre class="result">
row_id | col_id | value
&#160;--------+--------+------------------
1 | 1 | 6475.67225281804
2 | 2 | 1875.18065580415
3 | 3 | 1483.25228429636
4 | 4 | 1159.72262897427
5 | 5 | 1033.86092570574
6 | 6 | 948.437358703966
7 | 7 | 795.379572772455
8 | 8 | 709.086240684469
9 | 9 | 462.473775959371
10 | 10 | 365.875217945698
10 | 10 |
(11 rows)
</pre> For the summary table: <pre class="example">
SELECT * FROM svd_summary_table;
</pre> Result: <pre class="result">
rows_used | exec_time (ms) | iter | recon_error | relative_recon_error
&#160;-----------+----------------+------+-------------------+----------------------
16 | 1332.47 | 10 | 4.36920148766e-13 | 7.63134130332e-16
(1 row)
</pre></li>
<li>Create a sparse matrix by running the <a class="el" href="matrix__ops_8sql__in.html#a390fb7234f49e17c780e961184873759">matrix_sparsify()</a> utility function on the dense matrix. <pre class="example">
SELECT madlib.matrix_sparsify('mat',
'row=row_id, val=row_vec',
'mat_sparse',
'row=row_id, col=col_id, val=value');
</pre></li>
<li>Run the SVD function for a sparse matrix. <pre class="example">
SELECT madlib.svd_sparse( 'mat_sparse', -- Input table
'svd', -- Output table prefix
'row_id', -- Column name with row index
'col_id', -- Column name with column index
'value', -- Matrix cell value
16, -- Number of rows in matrix
10, -- Number of columns in matrix
10 -- Number of singular values to compute
);
</pre></li>
<li>Run the SVD function for a very sparse matrix. <pre class="example">
SELECT madlib.svd_sparse_native ( 'mat_sparse', -- Input table
'svd', -- Output table prefix
'row_id', -- Column name with row index
'col_id', -- Column name with column index
'value', -- Matrix cell value
16, -- Number of rows in matrix
10, -- Number of columns in matrix
10 -- Number of singular values to compute
);
</pre> <a class="anchor" id="background"></a><dl class="section user"><dt>Technical Background</dt><dd>In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics. Let \(A\) be a \(m \times n\) matrix, where \(m \ge n\). Then \(A\) can be decomposed as follows: <p class="formulaDsp">
\[ A = U \Sigma V^T, \]
</p>
where \(U\) is a \(m \times n\) orthonormal matrix, \(\Sigma\) is a \(n \times n\) diagonal matrix, and \(V\) is an \(n \times n\) orthonormal matrix. The diagonal elements of \(\Sigma\) are called the <em>singular values</em>. It is possible to formulate the problem of computing the singular triplets ( \(\sigma_i, u_i, v_i\)) of \(A\) as an eigenvalue problem involving a Hermitian matrix related to \(A\). There are two possible ways of achieving this:</dd></dl>
</li>
</ol>
<ul>
<li>With the cross product matrix, \(A^TA\) and \(AA^T\)</li>
<li>With the cyclic matrix <p class="formulaDsp">
\[ H(A) = \begin{bmatrix} 0 &amp; A\\ A^* &amp; 0 \end{bmatrix} \]
</p>
The singular values are the nonnegative square roots of the eigenvalues of the cross product matrix. This approach may imply a severe loss of accuracy in the smallest singular values. The cyclic matrix approach is an alternative that avoids this problem, but at the expense of significantly increasing the cost of the computation. Computing the cross product matrix explicitly is not recommended, especially in the case of sparse A. Bidiagonalization was proposed by Golub and Kahan [citation?] as a way of tridiagonalizing the cross product matrix without forming it explicitly. Consider the following decomposition <p class="formulaDsp">
\[ A = P B Q^T, \]
</p>
where \(P\) and \(Q\) are unitary matrices and \(B\) is an \(m \times n\) upper bidiagonal matrix. Then the tridiagonal matrix \(B*B\) is unitarily similar to \(A*A\). Additionally, specific methods exist that compute the singular values of \(B\) without forming \(B*B\). Therefore, after computing the SVD of B, <p class="formulaDsp">
\[ B = X\Sigma Y^T, \]
</p>
it only remains to compute the SVD of the original matrix with \(U = PX\) and \(V = QY\). </li>
</ul>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Thu Feb 23 2023 19:26:37 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>