blob: 763480a04f75f235d7a556fa2d295f75e84823bd [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: PageRank</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<!-- hack in the navigation tree -->
<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'madlib.apache.org');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.21.0</span>
</div>
<div id="projectbrief">User Documentation for Apache MADlib</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__pagerank.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">PageRank<div class="ingroups"><a class="el" href="group__grp__graph.html">Graph</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> <ul>
<li>
<a href="#pagerank">PageRank</a> </li>
<li>
<a href="#examples">Examples</a> </li>
<li>
<a href="#notes">Notes</a> </li>
<li>
<a href="#literature">Literature</a> </li>
</ul>
</div><p>Given a graph, the PageRank algorithm outputs a probability distribution representing the likelihood that a person randomly traversing the graph will arrive at any particular vertex. This algorithm was originally used by Google to rank websites where the World Wide Web was modeled as a directed graph with the vertices representing the websites. The PageRank algorithm initially proposed by Larry Page and Sergey Brin is implemented here [1].</p>
<p>We also implement personalized PageRank, in which a notion of importance provides personalization to a query. For example, importance scores can be biased according to a specified set of vertices in the graph that are of interest or special in some way [2].</p>
<p><a class="anchor" id="pagerank"></a></p><dl class="section user"><dt>PageRank</dt><dd><pre class="syntax">
pagerank( vertex_table,
vertex_id,
edge_table,
edge_args,
out_table,
damping_factor,
max_iter,
threshold,
grouping_cols,
personalization_vertices
)
</pre></dd></dl>
<p><b>Arguments</b> </p><dl class="arglist">
<dt>vertex_table </dt>
<dd><p class="startdd">TEXT. Name of the table containing the vertex data for the graph. Must contain the column specified in the 'vertex_id' parameter below.</p>
<p>TEXT, default = 'id'. Name of the column(s) in 'vertex_table' containing vertex ids. The vertex ids can be of type INTEGER or BIGINT with no duplicates. They do not need to be contiguous. If multiple columns are used as vertex ids, they are passed in the following format: [&lt;vertex_id1&gt;,&lt;vertex_id2&gt;,...]</p>
<p class="enddd"></p>
</dd>
<dt>edge_table </dt>
<dd><p class="startdd">TEXT. Name of the table containing the edge data. The edge table must contain columns for source vertex and destination vertex.</p>
<p class="enddd"></p>
</dd>
<dt>edge_args </dt>
<dd><p class="startdd">TEXT. A comma-delimited string containing multiple named arguments of the form "name=value". The following parameters are supported for this string argument:</p><ul>
<li>src (INTEGER or BIGINT): Name of the column(s) containing the source vertex ids in the edge table. Default column name is 'src'.</li>
<li>dest (INTEGER or BIGINT): Name of the column(s) containing the destination vertex ids in the edge table. Default column name is 'dest'.</li>
</ul>
<p class="enddd"></p>
</dd>
<dt>out_table </dt>
<dd><p class="startdd">TEXT. Name of the table to store the result of PageRank. It will contain a row for every vertex from 'vertex_table' with the following columns:</p><ul>
<li>vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for column naming.</li>
<li>pagerank : The vertex's PageRank.</li>
<li>grouping_cols : Grouping column (if any) values associated with the vertex_id.</li>
</ul>
<p>A summary table is also created that contains information regarding the number of iterations required for convergence. It is named by adding the suffix '_summary' to the 'out_table' parameter.</p>
<p class="enddd"></p>
</dd>
<dt>damping_factor (optional) </dt>
<dd><p class="startdd">FLOAT8, default 0.85. The probability, at any step, that a user will continue following the links in a random surfer model.</p>
<p class="enddd"></p>
</dd>
<dt>max_iter (optional) </dt>
<dd><p class="startdd">INTEGER, default: 100. The maximum number of iterations allowed.</p>
<p class="enddd"></p>
</dd>
<dt>threshold (optional) </dt>
<dd><p class="startdd">FLOAT8, default: (1/number of vertices * 1000). If the difference between the PageRank of every vertex of two consecutive iterations is smaller than 'threshold', or the iteration number is larger than 'max_iter', the computation stops. If you set the threshold to zero, then you will force the algorithm to run for the full number of iterations specified in 'max_iter'. It is advisable to set threshold to a value lower than 1/(number of vertices in the graph) since the PageRank value of nodes is initialized to that value.</p>
<p class="enddd"></p>
</dd>
<dt>grouping_cols (optional) </dt>
<dd>TEXT, default: NULL. A single column or a list of comma-separated columns that divides the input data into discrete groups, resulting in one distribution per group. When this value is NULL, no grouping is used and a single model is generated for all data. <dl class="section note"><dt>Note</dt><dd>Expressions are not currently supported for 'grouping_cols'.</dd></dl>
</dd>
<dt>personalization_vertices (optional) </dt>
<dd>INTEGER[] or BIGINT[], default: NULL. A comma separated list of vertices or nodes for personalized PageRank. When this parameter is provided, personalized PageRank will run. In the absence of this parameter, regular PageRank will run. If multiple columns are used for identifying vertices, a 2D array will be required for this parameter. </dd>
</dl>
<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
<p><a href="example/madlib_pagerank_example.sql">Download the example sql file here.</a></p>
<ol type="1">
<li>Create vertex and edge tables to represent the graph: <pre class="syntax">
DROP TABLE IF EXISTS vertex, edge;
CREATE TABLE vertex(
node_id INTEGER
);
CREATE TABLE edge(
conn_src INTEGER,
conn_dest INTEGER,
user_id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2),
(3),
(4),
(5),
(6);
INSERT INTO edge VALUES
(0, 1, 1),
(0, 2, 1),
(0, 4, 1),
(1, 2, 1),
(1, 3, 1),
(2, 3, 1),
(2, 5, 1),
(2, 6, 1),
(3, 0, 1),
(4, 0, 1),
(5, 6, 1),
(6, 3, 1),
(0, 1, 2),
(0, 2, 2),
(0, 4, 2),
(1, 2, 2),
(1, 3, 2),
(2, 3, 2),
(3, 0, 2),
(4, 0, 2),
(5, 6, 2),
(6, 3, 2);
</pre></li>
<li>Running PageRank with default values for optional parameters: <pre class="syntax">
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
SELECT madlib.pagerank(
'vertex', -- Vertex table
'node_id', -- Vertex id column
'edge', -- Edge table
'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments
'pagerank_out'); -- Output table of PageRank
SELECT * FROM pagerank_out ORDER BY pagerank DESC;
</pre> <pre class="result">
node_id | pagerank
---------+-------------------
0 | 0.28753749341184
3 | 0.21016988901855
2 | 0.14662683454062
4 | 0.10289614384217
1 | 0.10289614384217
6 | 0.09728637768887
5 | 0.05258711765692
(7 rows)
</pre> <pre class="syntax">
SELECT * FROM pagerank_out_summary;
</pre> <pre class="result">
__iterations__
----------------+
16
(1 row)
</pre></li>
<li>Running PageRank with a damping factor of 0.5 results in different final values: <pre class="syntax">
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
SELECT madlib.pagerank(
'vertex', -- Vertex table
'node_id', -- Vertex id column
'edge', -- Edge table
'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments
'pagerank_out', -- Output table of PageRank
0.5); -- Damping factor
SELECT * FROM pagerank_out ORDER BY pagerank DESC;
</pre> <pre class="result">
node_id | pagerank
---------+--------------------
0 | 0.225477161441199
3 | 0.199090328586664
2 | 0.136261327206477
6 | 0.132691559968224
4 | 0.109009291409508
1 | 0.109009291409508
5 | 0.0884610399788161
(7 rows)
</pre></li>
<li>Now compute the PageRank of vertices associated with each user using the grouping feature: <pre class="syntax">
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
SELECT madlib.pagerank(
'vertex', -- Vertex table
'node_id', -- Vertex id column
'edge', -- Edge table
'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments
'pagerank_out', -- Output table of PageRank
NULL, -- Default damping factor (0.85)
NULL, -- Default max iters (100)
0.00000001, -- Threshold
'user_id'); -- Grouping column name
SELECT * FROM pagerank_out ORDER BY user_id, pagerank DESC;
</pre> <pre class="result">
user_id | node_id | pagerank
---------+---------+--------------------
1 | 0 | 0.27825488388552
1 | 3 | 0.20188114667075
1 | 2 | 0.14288112346059
1 | 6 | 0.11453637832147
1 | 1 | 0.10026745615438
1 | 4 | 0.10026745615438
1 | 5 | 0.06191155535288
2 | 0 | 0.31854625004173
2 | 3 | 0.23786686773343
2 | 2 | 0.15914876489397
2 | 1 | 0.11168334437971
2 | 4 | 0.11168334437971
2 | 6 | 0.03964285714285
2 | 5 | 0.02142857142857
(14 rows)
</pre> <pre class="syntax">
SELECT * FROM pagerank_out_summary ORDER BY user_id;
</pre> <pre class="result">
user_id | __iterations__
---------+----------------
1 | 27
2 | 31
(2 rows)
</pre></li>
<li>Personalized PageRank. Here we specify {2,4} as the personalization vertices. This parameter could be specified as ARRAY[2,4] as well. <pre class="syntax">
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
SELECT madlib.pagerank(
'vertex', -- Vertex table
'node_id', -- Vertex id column
'edge', -- Edge table
'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments
'pagerank_out', -- Output table of PageRank
NULL, -- Default damping factor (0.85)
NULL, -- Default max iters (100)
NULL, -- Default Threshold
NULL, -- No Grouping
'{2,4}'); -- Personalization vertices
SELECT * FROM pagerank_out ORDER BY pagerank DESC;
</pre> <pre class="result">
node_id | pagerank
---------+--------------------
0 | 0.565232961966315
2 | 0.378139420991773
3 | 0.355003292266017
4 | 0.310111215897626
1 | 0.160111215897626
6 | 0.148615315574136
5 | 0.0803403307142321
(7 rows)
</pre> <pre class="syntax">
SELECT * FROM pagerank_out_summary;
</pre> <pre class="result">
__iterations__
----------------+
37
(1 row)
</pre></li>
<li>Create vertex and edge tables with multiple column ids to represent the graph: <pre class="syntax">
DROP TABLE IF EXISTS vertex_multicol_pagerank, edge_multicol_pagerank;
CREATE TABLE vertex_multicol_pagerank(
node_id_major BIGINT,
node_id_minor BIGINT
);
CREATE TABLE edge_multicol_pagerank(
conn_src_major BIGINT,
conn_dest_major BIGINT,
user_id_major BIGINT,
conn_src_minor BIGINT,
conn_dest_minor BIGINT,
user_id_minor BIGINT
);
INSERT INTO vertex_multicol_pagerank VALUES
(0, 0),
(1, 1),
(2, 2),
(3, 3),
(4, 4),
(5, 5),
(6, 6);
INSERT INTO edge_multicol_pagerank VALUES
(0, 1, 1, 0, 1, 1),
(0, 2, 1, 0, 2, 1),
(0, 4, 1, 0, 4, 1),
(1, 2, 1, 1, 2, 1),
(1, 3, 1, 1, 3, 1),
(2, 3, 1, 2, 3, 1),
(2, 5, 1, 2, 5, 1),
(2, 6, 1, 2, 6, 1),
(3, 0, 1, 3, 0, 1),
(4, 0, 1, 4, 0, 1),
(5, 6, 1, 5, 6, 1),
(6, 3, 1, 6, 3, 1),
(0, 1, 2, 0, 1, 2),
(0, 2, 2, 0, 2, 2),
(0, 4, 2, 0, 4, 2),
(1, 2, 2, 1, 2, 2),
(1, 3, 2, 1, 3, 2),
(2, 3, 2, 2, 3, 2),
(3, 0, 2, 3, 0, 2),
(4, 0, 2, 4, 0, 2),
(5, 6, 2, 5, 6, 2),
(6, 3, 2, 6, 3, 2);
</pre></li>
<li>Personalized PageRank. Here we specify {2,4} as the personalization vertices. This parameter could be specified as ARRAY[2,4] as well. <pre class="syntax">
DROP TABLE IF EXISTS pagerank_multicol_out, pagerank_multicol_out_summary;
SELECT madlib.pagerank(
'vertex_multicol_pagerank', -- Vertex table
'[node_id_major,node_id_minor]', -- Vertex id column
'edge_multicol_pagerank', -- Edge table
'src=[conn_src_major,conn_src_minor], dest=[conn_dest_major,conn_dest_minor]', -- Comma delimted string of edge arguments
'pagerank_multicol_out', -- Output table of PageRank
NULL, -- Default damping factor (0.85)
NULL, -- Default max iters (100)
NULL, -- Default Threshold
'user_id_major,user_id_minor', -- Grouping Columns
'{{2,2},{4,4}}'); -- Personalization vertices
SELECT * FROM pagerank_multicol_out ORDER BY pagerank DESC;
</pre> <pre class="result">
user_id_major | user_id_minor | id | pagerank
---------------+---------------+-------+--------------------
2 | 2 | {0,0} | 0.448826703440932
2 | 2 | {3,3} | 0.325943770128465
1 | 1 | {0,0} | 0.270635964385879
2 | 2 | {2,2} | 0.256179815391031
2 | 2 | {4,4} | 0.202149921235622
1 | 1 | {2,2} | 0.18423239851445
1 | 1 | {3,3} | 0.166801820206414
1 | 1 | {4,4} | 0.151661035568349
2 | 2 | {1,1} | 0.127149921235622
1 | 1 | {6,6} | 0.0965411872854988
1 | 1 | {1,1} | 0.0766610355683489
2 | 2 | {5,5} | 0.075
2 | 2 | {6,6} | 0.06375
1 | 1 | {5,5} | 0.0521896024086525
(7 rows)
</pre> <pre class="syntax">
SELECT * FROM pagerank_multicol_out_summary;
</pre> <pre class="result">
user_id_major | user_id_minor | __iterations__
---------------+---------------+----------------
2 | 2 | 45
1 | 1 | 41
(1 row)
</pre></li>
</ol>
<p><a class="anchor" id="notes"></a></p><dl class="section user"><dt>Notes</dt><dd></dd></dl>
<ol type="1">
<li>On a Greenplum cluster, the edge table should be distributed by the source vertex id column for better performance.</li>
</ol>
<p><a class="anchor" id="literature"></a></p><dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>[1] Brin, S. and Page, L. (1998), "The anatomy of a large-scale hypertextual Web search engine", Computer Networks and ISDN Systems. 30: 107–117, <a href="http://infolab.stanford.edu/pub/papers/google.pdf">http://infolab.stanford.edu/pub/papers/google.pdf</a></p>
<p>[2] Jeh, Glen and Widom, Jennifer. "Scaling Personalized Web Search", Proceedings of the 12th international conference on World Wide Web, Pages 271-279 Budapest, Hungary, May 20-24, 2003, <a href="http://ilpubs.stanford.edu:8090/530/1/2002-12.pdf">http://ilpubs.stanford.edu:8090/530/1/2002-12.pdf</a> </p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Thu Feb 23 2023 19:26:40 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
</ul>
</div>
</body>
</html>