| <!-- HTML header for doxygen 1.8.4--> |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| <html xmlns="http://www.w3.org/1999/xhtml"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/> |
| <meta http-equiv="X-UA-Compatible" content="IE=9"/> |
| <meta name="generator" content="Doxygen 1.8.10"/> |
| <meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/> |
| <title>MADlib: Term Frequency</title> |
| <link href="tabs.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="jquery.js"></script> |
| <script type="text/javascript" src="dynsections.js"></script> |
| <link href="navtree.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="resize.js"></script> |
| <script type="text/javascript" src="navtreedata.js"></script> |
| <script type="text/javascript" src="navtree.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(initResizable); |
| $(window).load(resizeHeight); |
| </script> |
| <link href="search/search.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="search/searchdata.js"></script> |
| <script type="text/javascript" src="search/search.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(function() { init_search(); }); |
| </script> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"], |
| jax: ["input/TeX","output/HTML-CSS"], |
| }); |
| </script><script src="../mathjax/MathJax.js"></script> |
| <!-- hack in the navigation tree --> |
| <script type="text/javascript" src="navtree_hack.js"></script> |
| <link href="doxygen.css" rel="stylesheet" type="text/css" /> |
| <link href="madlib_extra.css" rel="stylesheet" type="text/css"/> |
| <!-- google analytics --> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| ga('create', 'UA-45382226-1', 'auto'); |
| ga('send', 'pageview'); |
| </script> |
| </head> |
| <body> |
| <div id="top"><!-- do not remove this div, it is closed by doxygen! --> |
| <div id="titlearea"> |
| <table cellspacing="0" cellpadding="0"> |
| <tbody> |
| <tr style="height: 56px;"> |
| <td id="projectlogo"><a href="http://madlib.incubator.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td> |
| <td style="padding-left: 0.5em;"> |
| <div id="projectname"> |
| <span id="projectnumber">1.8</span> |
| </div> |
| <div id="projectbrief">User Documentation for MADlib</div> |
| </td> |
| <td> <div id="MSearchBox" class="MSearchBoxInactive"> |
| <span class="left"> |
| <img id="MSearchSelect" src="search/mag_sel.png" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| alt=""/> |
| <input type="text" id="MSearchField" value="Search" accesskey="S" |
| onfocus="searchBox.OnSearchFieldFocus(true)" |
| onblur="searchBox.OnSearchFieldFocus(false)" |
| onkeyup="searchBox.OnSearchFieldChange(event)"/> |
| </span><span class="right"> |
| <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a> |
| </span> |
| </div> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <!-- end header part --> |
| <!-- Generated by Doxygen 1.8.10 --> |
| <script type="text/javascript"> |
| var searchBox = new SearchBox("searchBox", "search",false,'Search'); |
| </script> |
| </div><!-- top --> |
| <div id="side-nav" class="ui-resizable side-nav-resizable"> |
| <div id="nav-tree"> |
| <div id="nav-tree-contents"> |
| <div id="nav-sync" class="sync"></div> |
| </div> |
| </div> |
| <div id="splitbar" style="-moz-user-select:none;" |
| class="ui-resizable-handle"> |
| </div> |
| </div> |
| <script type="text/javascript"> |
| $(document).ready(function(){initNavTree('group__grp__text__utilities.html','');}); |
| </script> |
| <div id="doc-content"> |
| <!-- window showing the filter options --> |
| <div id="MSearchSelectWindow" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| onkeydown="return searchBox.OnSearchSelectKey(event)"> |
| </div> |
| |
| <!-- iframe showing the search results (closed by default) --> |
| <div id="MSearchResultsWindow"> |
| <iframe src="javascript:void(0)" frameborder="0" |
| name="MSearchResults" id="MSearchResults"> |
| </iframe> |
| </div> |
| |
| <div class="header"> |
| <div class="headertitle"> |
| <div class="title">Term Frequency<div class="ingroups"><a class="el" href="group__grp__utility__functions.html">Utility Functions</a> » <a class="el" href="group__grp__text__analysis.html">Text Analysis</a></div></div> </div> |
| </div><!--header--> |
| <div class="contents"> |
| <div class="toc"><b>Contents</b> </p><ul> |
| <li> |
| <a href="#term_frequency">Term Frequency</a> </li> |
| <li> |
| <a href="#examples">Examples</a> </li> |
| <li> |
| <a href="#rel;ated">Related Topics</a> </li> |
| </ul> |
| </div><p><a class="anchor" id="term_frequency"></a></p><dl class="section user"><dt>Term frequency</dt><dd>Term frequency <code>tf(t,d)</code> is to the raw frequency of a word/term in a document, i.e. the number of times that word/term <code>t</code> occurs in document <code>d</code>. For this function, 'word' and 'term' are used interchangeably. <b>Note:</b> the term frequency is not normalized by the document length. <pre class="syntax"> |
| term_frequency(input_table, |
| doc_id_col, |
| word_col, |
| output_table, |
| compute_vocab) |
| </pre></dd></dl> |
| <p><b>Arguments:</b> </p><dl class="arglist"> |
| <dt>input_table </dt> |
| <dd><p class="startdd">TEXT. The name of the table storing the documents. Each row is in the form <doc_id, word_vector> where <code>doc_id</code> is an id, unique to each document, and <code>word_vector</code> is a text array containing the words in the document. The <code>word_vector</code> should contain multiple entries of a word if the document contains multiple occurrence of that word. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>id_col </dt> |
| <dd><p class="startdd">TEXT. The name of the column containing the document id. </p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>word_col </dt> |
| <dd><p class="startdd">TEXT. The name of the column containing the vector of words/terms in the document. This column should of type that can be cast to TEXT[].</p> |
| <p class="enddd"></p> |
| </dd> |
| <dt>output_table </dt> |
| <dd><p class="startdd">TEXT. The name of the table to store the term frequency output. The output table contains the following columns:</p><ul> |
| <li><code>id_col:</code> This the document id column (same as the one provided as input).</li> |
| <li><code>word:</code> A word/term present in a document. This is either the original word present in <code>word_col</code> or an id representing the word (depending on the value of compute_vocab below).</li> |
| <li><code>count:</code> The number of times this word is found in the document. </li> |
| </ul> |
| <p class="enddd"></p> |
| </dd> |
| <dt>compute_vocab </dt> |
| <dd>BOOLEAN. (Optional, Default=FALSE) Flag to indicate if a vocabulary is to be created. If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is called <em>output_table</em>_vocabulary (suffix added to the <em>output_table</em> name) and contains the following columns:<ul> |
| <li><code>wordid:</code> An id assignment for each word</li> |
| <li><code>word:</code> The word/term </li> |
| </ul> |
| </dd> |
| </dl> |
| <p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl> |
| <ol type="1"> |
| <li>Prepare datasets with some example documents <pre class="example"> |
| DROP TABLE IF EXISTS documents; |
| CREATE TABLE documents(docid INTEGER, doc_contents TEXT); |
| INSERT INTO documents VALUES |
| (1, 'I like to eat broccoli and banana. I ate a banana and spinach smoothie for breakfast.'), |
| (2, 'Chinchillas and kittens are cute.'), |
| (3, 'My sister adopted two kittens yesterday'), |
| (4, 'Look at this cute hamster munching on a piece of broccoli'); |
| </pre></li> |
| <li>Add a new column containing the words (lower-cased) in a text array <pre class="example"> |
| ALTER TABLE documents DROP COLUMN words; |
| ALTER TABLE documents ADD COLUMN words TEXT[]; |
| UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), E'[\s+\.]'); |
| </pre></li> |
| <li>Compute the frequency of each word in each document <pre class="example"> |
| DROP TABLE IF EXISTS documents_tf; |
| SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf'); |
| SELECT * FROM documents_tf order by docid; |
| </pre> <pre class="result"> |
| docid | word | count |
| -------+------------+------- |
| 1 | ate | 1 |
| 1 | like | 1 |
| 1 | breakfast | 1 |
| 1 | to | 1 |
| 1 | broccoli | 1 |
| 1 | spinach | 1 |
| 1 | i | 2 |
| 1 | and | 2 |
| 1 | a | 1 |
| 1 | | 2 |
| 1 | smoothie | 1 |
| 1 | eat | 1 |
| 1 | banana | 2 |
| 1 | for | 1 |
| 2 | cute | 1 |
| 2 | are | 1 |
| 2 | kitten | 1 |
| 2 | and | 1 |
| 2 | chinchilla | 1 |
| 3 | kitten | 1 |
| 3 | my | 1 |
| 3 | a | 1 |
| 3 | sister | 1 |
| 3 | adopted | 1 |
| 3 | yesterday | 1 |
| 4 | at | 1 |
| 4 | of | 1 |
| 4 | piece | 1 |
| 4 | this | 1 |
| 4 | a | 1 |
| 4 | broccoli | 1 |
| 4 | hamster | 1 |
| 4 | munching | 1 |
| 4 | cute | 1 |
| 4 | look | 1 |
| (35 rows) |
| </pre></li> |
| <li>We also can create a vocabulary of the words and store a wordid in the output table instead of the actual word. <pre class="example"> |
| DROP TABLE IF EXISTS documents_tf; |
| DROP TABLE IF EXISTS documents_tf_vocabulary; |
| SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf', TRUE); |
| -- Output with wordid instead of the actual words |
| SELECT * FROM documents_tf order by docid; |
| </pre> <pre class="result"> |
| docid | wordid | count |
| -------+--------+------- |
| 1 | 0 | 2 |
| 1 | 1 | 1 |
| 1 | 3 | 2 |
| 1 | 6 | 1 |
| 1 | 7 | 2 |
| 1 | 8 | 1 |
| 1 | 9 | 1 |
| 1 | 12 | 1 |
| 1 | 13 | 1 |
| 1 | 15 | 2 |
| 1 | 17 | 1 |
| 1 | 24 | 1 |
| 1 | 25 | 1 |
| 1 | 27 | 1 |
| 2 | 16 | 1 |
| 2 | 3 | 1 |
| 2 | 4 | 1 |
| 2 | 10 | 1 |
| 2 | 11 | 1 |
| 3 | 1 | 1 |
| 3 | 16 | 1 |
| 3 | 28 | 1 |
| 3 | 23 | 1 |
| 3 | 2 | 1 |
| 3 | 20 | 1 |
| 4 | 9 | 1 |
| 4 | 11 | 1 |
| 4 | 22 | 1 |
| 4 | 14 | 1 |
| 4 | 26 | 1 |
| 4 | 1 | 1 |
| 4 | 5 | 1 |
| 4 | 18 | 1 |
| 4 | 19 | 1 |
| 4 | 21 | 1 |
| (35 rows) |
| </pre> <pre class="example"> |
| -- Vocabulary |
| SELECT * FROM documents_tf_vocabulary order by wordid; |
| </pre> <pre class="result"> |
| wordid | word |
| --------+------------ |
| 0 | |
| 1 | a |
| 2 | adopted |
| 3 | and |
| 4 | are |
| 5 | at |
| 6 | ate |
| 7 | banana |
| 8 | breakfast |
| 9 | broccoli |
| 10 | chinchilla |
| 11 | cute |
| 12 | eat |
| 13 | for |
| 14 | hamster |
| 15 | i |
| 16 | kitten |
| 17 | like |
| 18 | look |
| 19 | munching |
| 20 | my |
| 21 | of |
| 22 | piece |
| 23 | sister |
| 24 | smoothie |
| 25 | spinach |
| 26 | this |
| 27 | to |
| 28 | yesterday |
| (29 rows) |
| </pre></li> |
| </ol> |
| <p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl> |
| <p>File <a class="el" href="text__utilities_8sql__in.html" title="SQL functions for carrying out routine text operations. ">text_utilities.sql_in</a> documenting the SQL functions. File <a class="el" href="utilities_8sql__in.html" title="SQL functions for carrying out routine tasks. ">utilities.sql_in</a> documenting the utility functions for DB administration. </p> |
| </div><!-- contents --> |
| </div><!-- doc-content --> |
| <!-- start footer part --> |
| <div id="nav-path" class="navpath"><!-- id is needed for treeview function! --> |
| <ul> |
| <li class="footer">Generated on Mon Jul 27 2015 20:37:45 for MADlib by |
| <a href="http://www.doxygen.org/index.html"> |
| <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.10 </li> |
| </ul> |
| </div> |
| </body> |
| </html> |