docs/v1.18.0/group__grp__text__utilities.html - madlib-site - Git at Google

 <!-- HTML header for doxygen 1.8.4-->
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml">
 <head>
 <meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
 <meta http-equiv="X-UA-Compatible" content="IE=9"/>
 <meta name="generator" content="Doxygen 1.8.13"/>
 <meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
 <title>MADlib: Term Frequency</title>
 <link href="tabs.css" rel="stylesheet" type="text/css"/>
 <script type="text/javascript" src="jquery.js"></script>
 <script type="text/javascript" src="dynsections.js"></script>
 <link href="navtree.css" rel="stylesheet" type="text/css"/>
 <script type="text/javascript" src="resize.js"></script>
 <script type="text/javascript" src="navtreedata.js"></script>
 <script type="text/javascript" src="navtree.js"></script>
 <script type="text/javascript">
   $(document).ready(initResizable);
 </script>
 <link href="search/search.css" rel="stylesheet" type="text/css"/>
 <script type="text/javascript" src="search/searchdata.js"></script>
 <script type="text/javascript" src="search/search.js"></script>
 <script type="text/javascript">
   $(document).ready(function() { init_search(); });
 </script>
 <script type="text/x-mathjax-config">
   MathJax.Hub.Config({
     extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
     jax: ["input/TeX","output/HTML-CSS"],
 });
 </script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
 <!-- hack in the navigation tree -->
 <script type="text/javascript" src="eigen_navtree_hacks.js"></script>
 <link href="doxygen.css" rel="stylesheet" type="text/css" />
 <link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
 <!-- google analytics -->
 <script>
   (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
   (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
   m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
   })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
   ga('create', 'UA-45382226-1', 'madlib.apache.org');
   ga('send', 'pageview');
 </script>
 </head>
 <body>
 <div id="top"><!-- do not remove this div, it is closed by doxygen! -->
 <div id="titlearea">
 <table cellspacing="0" cellpadding="0">
  <tbody>
  <tr style="height: 56px;">
   <td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
   <td style="padding-left: 0.5em;">
    <div id="projectname">
    <span id="projectnumber">1.18.0</span>
    </div>
    <div id="projectbrief">User Documentation for Apache MADlib</div>
   </td>
    <td>        <div id="MSearchBox" class="MSearchBoxInactive">
         <span class="left">
           <img id="MSearchSelect" src="search/mag_sel.png"
                onmouseover="return searchBox.OnSearchSelectShow()"
                onmouseout="return searchBox.OnSearchSelectHide()"
                alt=""/>
           <input type="text" id="MSearchField" value="Search" accesskey="S"
                onfocus="searchBox.OnSearchFieldFocus(true)"
                onblur="searchBox.OnSearchFieldFocus(false)"
                onkeyup="searchBox.OnSearchFieldChange(event)"/>
           </span><span class="right">
             <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
           </span>
         </div>
 </td>
  </tr>
  </tbody>
 </table>
 </div>
 <!-- end header part -->
 <!-- Generated by Doxygen 1.8.13 -->
 <script type="text/javascript">
 var searchBox = new SearchBox("searchBox", "search",false,'Search');
 </script>
 </div><!-- top -->
 <div id="side-nav" class="ui-resizable side-nav-resizable">
   <div id="nav-tree">
     <div id="nav-tree-contents">
       <div id="nav-sync" class="sync"></div>
     </div>
   </div>
   <div id="splitbar" style="-moz-user-select:none;"
        class="ui-resizable-handle">
   </div>
 </div>
 <script type="text/javascript">
 $(document).ready(function(){initNavTree('group__grp__text__utilities.html','');});
 </script>
 <div id="doc-content">
 <!-- window showing the filter options -->
 <div id="MSearchSelectWindow"
      onmouseover="return searchBox.OnSearchSelectShow()"
      onmouseout="return searchBox.OnSearchSelectHide()"
      onkeydown="return searchBox.OnSearchSelectKey(event)">
 </div>

 <!-- iframe showing the search results (closed by default) -->
 <div id="MSearchResultsWindow">
 <iframe src="javascript:void(0)" frameborder="0"
         name="MSearchResults" id="MSearchResults">
 </iframe>
 </div>

 <div class="header">
   <div class="headertitle">
 <div class="title">Term Frequency<div class="ingroups"><a class="el" href="group__grp__other__functions.html">Utilities</a></div></div>  </div>
 </div><!--header-->
 <div class="contents">
 <div class="toc"><b>Contents</b> <ul>
 <li>
 <a href="#function_syntax">Function Syntax</a> </li>
 <li>
 <a href="#examples">Examples</a> </li>
 <li>
 <a href="#related">Related Topics</a> </li>
 </ul>
 </div><p>Term frequency computes the number of times that a word or term occurs in a document. Term frequency is often used as part of a larger text processing pipeline, which may include operations such as stemming, stop word removal and topic modelling.</p>
 <p><a class="anchor" id="function_syntax"></a></p><dl class="section user"><dt>Function Syntax</dt><dd></dd></dl>
 <pre class="syntax">
     term_frequency(input_table,
                    doc_id_col,
                    word_col,
                    output_table,
                    compute_vocab)
 </pre><p><b>Arguments:</b> </p><dl class="arglist">
 <dt>input_table </dt>
 <dd><p class="startdd">TEXT. The name of the table containing the documents, with one document per row. Each row is in the form &lt;doc_id, word_vector&gt; where <code>doc_id</code> is an id unique to each document, and <code>word_vector</code> is a text array containing the words in the document. The <code>word_vector</code> should contain multiple entries of a word if the document contains multiple occurrence of that word. </p>
 <p class="enddd"></p>
 </dd>
 <dt>doc_id_col </dt>
 <dd><p class="startdd">TEXT. The name of the column containing the document id. </p>
 <p class="enddd"></p>
 </dd>
 <dt>word_col </dt>
 <dd><p class="startdd">TEXT. The name of the column containing the vector of words/terms in the document. This column should be of type that can be cast to TEXT[].</p>
 <p class="enddd"></p>
 </dd>
 <dt>output_table </dt>
 <dd><p class="startdd">TEXT. The name of the table to store the term frequency output. The output table contains the following columns:</p><ul>
 <li><code>doc_id_col:</code> This the document id column (name will be same as the one provided as input).</li>
 <li><code>word:</code> Word/term present in a document. Depending on the value of <code>compute_vocab</code> below, this is either the original word as it appears in <code>word_col</code>, or an id representing the word. Note that word id's start from 0 not 1.</li>
 <li><code>count:</code> The number of times this word is found in the document. </li>
 </ul>
 <p class="enddd"></p>
 </dd>
 <dt>compute_vocab </dt>
 <dd>BOOLEAN. (Optional, Default=FALSE) Flag to indicate if a vocabulary table is to be created. If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word in alphabetical order. The table is called <em>output_table</em>_vocabulary (i.e., suffix added to the <em>output_table</em> name) and contains the following columns:<ul>
 <li><code>wordid:</code> An id for each word in alphabetical order.</li>
 <li><code>word:</code> The word/term corresponding to the id.  </li>
 </ul>
 </dd>
 </dl>
 <p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
 <ol type="1">
 <li>First we create a document table with one document per row: <pre class="example">
 DROP TABLE IF EXISTS documents;
 CREATE TABLE documents(docid INT4, contents TEXT);
 INSERT INTO documents VALUES
 (0, 'I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.'),
 (1, 'Chinchillas and kittens are cute.'),
 (2, 'My sister adopted two kittens yesterday.'),
 (3, 'Look at this cute hamster munching on a piece of broccoli.');
 </pre> You can apply stemming, stop word removal and tokenization at this point in order to prepare the documents for text processing. Depending upon your database version, various tools are available. Databases based on more recent versions of PostgreSQL may do something like: <pre class="example">
 SELECT tsvector_to_array(to_tsvector('english',contents)) from documents;
 </pre> <pre class="result">
                     tsvector_to_array
 +----------------------------------------------------------
  {ate,banana,breakfast,broccoli,eat,like,smoothi,spinach}
  {chinchilla,cute,kitten}
  {adopt,kitten,sister,two,yesterday}
  {broccoli,cute,hamster,look,munch,piec}
 (4 rows)
 </pre> In this example, we assume a database based on an older version of PostgreSQL and just perform basic punctuation removal and tokenization. The array of words is added as a new column to the documents table: <pre class="example">
 ALTER TABLE documents ADD COLUMN words TEXT[];
 UPDATE documents SET words =
     regexp_split_to_array(lower(
     regexp_replace(contents, E'[,.;\']','', 'g')
     ), E'[\\s+]');
 \x on
 SELECT * FROM documents ORDER BY docid;
 </pre> <pre class="result">
 -[ RECORD 1 ]------------------------------------------------------------------------------------
 docid    | 0
 contents | I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.
 words    | {i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast}
 -[ RECORD 2 ]------------------------------------------------------------------------------------
 docid    | 1
 contents | Chinchillas and kittens are cute.
 words    | {chinchillas,and,kittens,are,cute}
 -[ RECORD 3 ]------------------------------------------------------------------------------------
 docid    | 2
 contents | My sister adopted two kittens yesterday.
 words    | {my,sister,adopted,two,kittens,yesterday}
 -[ RECORD 4 ]------------------------------------------------------------------------------------
 docid    | 3
 contents | Look at this cute hamster munching on a piece of broccoli.
 words    | {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli}
 </pre></li>
 <li>Compute the frequency of each word in each document: <pre class="example">
 DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
 SELECT madlib.term_frequency('documents',    -- input table
                              'docid',        -- document id column
                              'words',        -- vector of words in document
                              'documents_tf'  -- output table
                             );
 \x off
 SELECT * FROM documents_tf ORDER BY docid;
 </pre> <pre class="result">
  docid |    word     | count
 -------+-------------+-------
      0 | a           |     1
      0 | breakfast   |     1
      0 | banana      |     1
      0 | and         |     2
      0 | eat         |     1
      0 | smoothie    |     1
      0 | to          |     1
      0 | like        |     1
      0 | broccoli    |     1
      0 | bananas     |     1
      0 | spinach     |     1
      0 | i           |     2
      0 | ate         |     1
      0 | for         |     1
      1 | are         |     1
      1 | cute        |     1
      1 | kittens     |     1
      1 | chinchillas |     1
      1 | and         |     1
      2 | two         |     1
      2 | yesterday   |     1
      2 | kittens     |     1
      2 | sister      |     1
      2 | my          |     1
      2 | adopted     |     1
      3 | this        |     1
      3 | at          |     1
      3 | a           |     1
      3 | broccoli    |     1
      3 | of          |     1
      3 | look        |     1
      3 | hamster     |     1
      3 | on          |     1
      3 | piece       |     1
      3 | cute        |     1
      3 | munching    |     1
 (36 rows)
 </pre></li>
 <li>Next we create a vocabulary of the words and store a wordid in the output table instead of the actual word: <pre class="example">
 DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
 SELECT madlib.term_frequency('documents',    -- input table
                              'docid',        -- document id column
                              'words',        -- vector of words in document
                              'documents_tf',-- output table
                              TRUE
                             );
 SELECT * FROM documents_tf ORDER BY docid;
 </pre>  <pre class="result">
  docid | wordid | count
 -------+--------+-------
      0 |     17 |     1
      0 |      9 |     1
      0 |     25 |     1
      0 |     12 |     1
      0 |     13 |     1
      0 |     15 |     2
      0 |      0 |     1
      0 |      2 |     2
      0 |     28 |     1
      0 |      5 |     1
      0 |      6 |     1
      0 |      7 |     1
      0 |      8 |     1
      0 |     26 |     1
      1 |     16 |     1
      1 |     11 |     1
      1 |     10 |     1
      1 |      2 |     1
      1 |      3 |     1
      2 |     30 |     1
      2 |      1 |     1
      2 |     16 |     1
      2 |     20 |     1
      2 |     24 |     1
      2 |     29 |     1
      3 |      4 |     1
      3 |     21 |     1
      3 |     22 |     1
      3 |     23 |     1
      3 |      0 |     1
      3 |     11 |     1
      3 |      9 |     1
      3 |     27 |     1
      3 |     14 |     1
      3 |     18 |     1
      3 |     19 |     1
 (36 rows)
 </pre>  Note above that wordid's start at 0 not 1. The vocabulary table maps wordid to the actual word: <pre class="example">
 SELECT * FROM documents_tf_vocabulary ORDER BY wordid;
 </pre> <pre class="result">
  wordid |    word
 --------+-------------
       0 | a
       1 | adopted
       2 | and
       3 | are
       4 | at
       5 | ate
       6 | banana
       7 | bananas
       8 | breakfast
       9 | broccoli
      10 | chinchillas
      11 | cute
      12 | eat
      13 | for
      14 | hamster
      15 | i
      16 | kittens
      17 | like
      18 | look
      19 | munching
      20 | my
      21 | of
      22 | on
      23 | piece
      24 | sister
      25 | smoothie
      26 | spinach
      27 | this
      28 | to
      29 | two
      30 | yesterday
 (31 rows)
 </pre></li>
 </ol>
 <p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
 <p>See <a class="el" href="text__utilities_8sql__in.html" title="SQL functions for carrying out routine text operations. ">text_utilities.sql_in</a> for the term frequency SQL function definition and <a class="el" href="porter__stemmer_8sql__in.html" title="implementation of porter stemmer operations in SQL ">porter_stemmer.sql_in</a> for the stemmer function. </p>
 </div><!-- contents -->
 </div><!-- doc-content -->
 <!-- start footer part -->
 <div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
   <ul>
     <li class="footer">Generated on Wed Mar 31 2021 20:45:50 for MADlib by
     <a href="http://www.doxygen.org/index.html">
     <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
   </ul>
 </div>
 </body>
 </html>
	<!-- HTML header for doxygen 1.8.4-->
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
	<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
	<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
	<meta http-equiv="X-UA-Compatible" content="IE=9"/>
	<meta name="generator" content="Doxygen 1.8.13"/>
	<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
	<title>MADlib: Term Frequency</title>
	<link href="tabs.css" rel="stylesheet" type="text/css"/>
	<script type="text/javascript" src="jquery.js"></script>
	<script type="text/javascript" src="dynsections.js"></script>
	<link href="navtree.css" rel="stylesheet" type="text/css"/>
	<script type="text/javascript" src="resize.js"></script>
	<script type="text/javascript" src="navtreedata.js"></script>
	<script type="text/javascript" src="navtree.js"></script>
	<script type="text/javascript">
	$(document).ready(initResizable);
	</script>
	<link href="search/search.css" rel="stylesheet" type="text/css"/>
	<script type="text/javascript" src="search/searchdata.js"></script>
	<script type="text/javascript" src="search/search.js"></script>
	<script type="text/javascript">
	$(document).ready(function() { init_search(); });
	</script>
	<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
	extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
	jax: ["input/TeX","output/HTML-CSS"],
	});
	</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
	<!-- hack in the navigation tree -->
	<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
	<link href="doxygen.css" rel="stylesheet" type="text/css" />
	<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
	<!-- google analytics -->
	<script>
	(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]\|\|function(){
	(i[r].q=i[r].q\|\|[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
	m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
	})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
	ga('create', 'UA-45382226-1', 'madlib.apache.org');
	ga('send', 'pageview');
	</script>
	</head>
	<body>
	<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
	<div id="titlearea">
	<table cellspacing="0" cellpadding="0">
	<tbody>
	<tr style="height: 56px;">
	<td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
	<td style="padding-left: 0.5em;">
	<div id="projectname">
	<span id="projectnumber">1.18.0</span>
	</div>
	<div id="projectbrief">User Documentation for Apache MADlib</div>
	</td>
	<td> <div id="MSearchBox" class="MSearchBoxInactive">
	<span class="left">
	<img id="MSearchSelect" src="search/mag_sel.png"
	onmouseover="return searchBox.OnSearchSelectShow()"
	onmouseout="return searchBox.OnSearchSelectHide()"
	alt=""/>
	<input type="text" id="MSearchField" value="Search" accesskey="S"
	onfocus="searchBox.OnSearchFieldFocus(true)"
	onblur="searchBox.OnSearchFieldFocus(false)"
	onkeyup="searchBox.OnSearchFieldChange(event)"/>
	</span><span class="right">
	<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
	</span>
	</div>
	</td>
	</tr>
	</tbody>
	</table>
	</div>
	<!-- end header part -->
	<!-- Generated by Doxygen 1.8.13 -->
	<script type="text/javascript">
	var searchBox = new SearchBox("searchBox", "search",false,'Search');
	</script>
	</div><!-- top -->
	<div id="side-nav" class="ui-resizable side-nav-resizable">
	<div id="nav-tree">
	<div id="nav-tree-contents">
	<div id="nav-sync" class="sync"></div>
	</div>
	</div>
	<div id="splitbar" style="-moz-user-select:none;"
	class="ui-resizable-handle">
	</div>
	</div>
	<script type="text/javascript">
	$(document).ready(function(){initNavTree('group__grp__text__utilities.html','');});
	</script>
	<div id="doc-content">
	<!-- window showing the filter options -->
	<div id="MSearchSelectWindow"
	onmouseover="return searchBox.OnSearchSelectShow()"
	onmouseout="return searchBox.OnSearchSelectHide()"
	onkeydown="return searchBox.OnSearchSelectKey(event)">
	</div>

	<!-- iframe showing the search results (closed by default) -->
	<div id="MSearchResultsWindow">
	<iframe src="javascript:void(0)" frameborder="0"
	name="MSearchResults" id="MSearchResults">
	</iframe>
	</div>

	<div class="header">
	<div class="headertitle">
	<div class="title">Term Frequency<div class="ingroups"><a class="el" href="group__grp__other__functions.html">Utilities</a></div></div> </div>
	</div><!--header-->
	<div class="contents">
	<div class="toc"><b>Contents</b> <ul>
	<li>
	<a href="#function_syntax">Function Syntax</a> </li>
	<li>
	<a href="#examples">Examples</a> </li>
	<li>
	<a href="#related">Related Topics</a> </li>
	</ul>
	</div><p>Term frequency computes the number of times that a word or term occurs in a document. Term frequency is often used as part of a larger text processing pipeline, which may include operations such as stemming, stop word removal and topic modelling.</p>
	<p><a class="anchor" id="function_syntax"></a></p><dl class="section user"><dt>Function Syntax</dt><dd></dd></dl>
	<pre class="syntax">
	term_frequency(input_table,
	doc_id_col,
	word_col,
	output_table,
	compute_vocab)
	</pre><p><b>Arguments:</b> </p><dl class="arglist">
	<dt>input_table </dt>
	<dd><p class="startdd">TEXT. The name of the table containing the documents, with one document per row. Each row is in the form <doc_id, word_vector> where <code>doc_id</code> is an id unique to each document, and <code>word_vector</code> is a text array containing the words in the document. The <code>word_vector</code> should contain multiple entries of a word if the document contains multiple occurrence of that word. </p>
	<p class="enddd"></p>
	</dd>
	<dt>doc_id_col </dt>
	<dd><p class="startdd">TEXT. The name of the column containing the document id. </p>
	<p class="enddd"></p>
	</dd>
	<dt>word_col </dt>
	<dd><p class="startdd">TEXT. The name of the column containing the vector of words/terms in the document. This column should be of type that can be cast to TEXT[].</p>
	<p class="enddd"></p>
	</dd>
	<dt>output_table </dt>
	<dd><p class="startdd">TEXT. The name of the table to store the term frequency output. The output table contains the following columns:</p><ul>
	<li><code>doc_id_col:</code> This the document id column (name will be same as the one provided as input).</li>
	<li><code>word:</code> Word/term present in a document. Depending on the value of <code>compute_vocab</code> below, this is either the original word as it appears in <code>word_col</code>, or an id representing the word. Note that word id's start from 0 not 1.</li>
	<li><code>count:</code> The number of times this word is found in the document. </li>
	</ul>
	<p class="enddd"></p>
	</dd>
	<dt>compute_vocab </dt>
	<dd>BOOLEAN. (Optional, Default=FALSE) Flag to indicate if a vocabulary table is to be created. If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word in alphabetical order. The table is called <em>output_table</em>_vocabulary (i.e., suffix added to the <em>output_table</em> name) and contains the following columns:<ul>
	<li><code>wordid:</code> An id for each word in alphabetical order.</li>
	<li><code>word:</code> The word/term corresponding to the id. </li>
	</ul>
	</dd>
	</dl>
	<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
	<ol type="1">
	<li>First we create a document table with one document per row: <pre class="example">
	DROP TABLE IF EXISTS documents;
	CREATE TABLE documents(docid INT4, contents TEXT);
	INSERT INTO documents VALUES
	(0, 'I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.'),
	(1, 'Chinchillas and kittens are cute.'),
	(2, 'My sister adopted two kittens yesterday.'),
	(3, 'Look at this cute hamster munching on a piece of broccoli.');
	</pre> You can apply stemming, stop word removal and tokenization at this point in order to prepare the documents for text processing. Depending upon your database version, various tools are available. Databases based on more recent versions of PostgreSQL may do something like: <pre class="example">
	SELECT tsvector_to_array(to_tsvector('english',contents)) from documents;
	</pre> <pre class="result">
	tsvector_to_array
	+----------------------------------------------------------
	{ate,banana,breakfast,broccoli,eat,like,smoothi,spinach}
	{chinchilla,cute,kitten}
	{adopt,kitten,sister,two,yesterday}
	{broccoli,cute,hamster,look,munch,piec}
	(4 rows)
	</pre> In this example, we assume a database based on an older version of PostgreSQL and just perform basic punctuation removal and tokenization. The array of words is added as a new column to the documents table: <pre class="example">
	ALTER TABLE documents ADD COLUMN words TEXT[];
	UPDATE documents SET words =
	regexp_split_to_array(lower(
	regexp_replace(contents, E'[,.;\']','', 'g')
	), E'[\\s+]');
	\x on
	SELECT * FROM documents ORDER BY docid;
	</pre> <pre class="result">
	-[ RECORD 1 ]------------------------------------------------------------------------------------
	docid \| 0
	contents \| I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.
	words \| {i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast}
	-[ RECORD 2 ]------------------------------------------------------------------------------------
	docid \| 1
	contents \| Chinchillas and kittens are cute.
	words \| {chinchillas,and,kittens,are,cute}
	-[ RECORD 3 ]------------------------------------------------------------------------------------
	docid \| 2
	contents \| My sister adopted two kittens yesterday.
	words \| {my,sister,adopted,two,kittens,yesterday}
	-[ RECORD 4 ]------------------------------------------------------------------------------------
	docid \| 3
	contents \| Look at this cute hamster munching on a piece of broccoli.
	words \| {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli}
	</pre></li>
	<li>Compute the frequency of each word in each document: <pre class="example">
	DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
	SELECT madlib.term_frequency('documents', -- input table
	'docid', -- document id column
	'words', -- vector of words in document
	'documents_tf' -- output table
	);
	\x off
	SELECT * FROM documents_tf ORDER BY docid;
	</pre> <pre class="result">
	docid \| word \| count
	-------+-------------+-------
	0 \| a \| 1
	0 \| breakfast \| 1
	0 \| banana \| 1
	0 \| and \| 2
	0 \| eat \| 1
	0 \| smoothie \| 1
	0 \| to \| 1
	0 \| like \| 1
	0 \| broccoli \| 1
	0 \| bananas \| 1
	0 \| spinach \| 1
	0 \| i \| 2
	0 \| ate \| 1
	0 \| for \| 1
	1 \| are \| 1
	1 \| cute \| 1
	1 \| kittens \| 1
	1 \| chinchillas \| 1
	1 \| and \| 1
	2 \| two \| 1
	2 \| yesterday \| 1
	2 \| kittens \| 1
	2 \| sister \| 1
	2 \| my \| 1
	2 \| adopted \| 1
	3 \| this \| 1
	3 \| at \| 1
	3 \| a \| 1
	3 \| broccoli \| 1
	3 \| of \| 1
	3 \| look \| 1
	3 \| hamster \| 1
	3 \| on \| 1
	3 \| piece \| 1
	3 \| cute \| 1
	3 \| munching \| 1
	(36 rows)
	</pre></li>
	<li>Next we create a vocabulary of the words and store a wordid in the output table instead of the actual word: <pre class="example">
	DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
	SELECT madlib.term_frequency('documents', -- input table
	'docid', -- document id column
	'words', -- vector of words in document
	'documents_tf',-- output table
	TRUE
	);
	SELECT * FROM documents_tf ORDER BY docid;
	</pre> <pre class="result">
	docid \| wordid \| count
	-------+--------+-------
	0 \| 17 \| 1
	0 \| 9 \| 1
	0 \| 25 \| 1
	0 \| 12 \| 1
	0 \| 13 \| 1
	0 \| 15 \| 2
	0 \| 0 \| 1
	0 \| 2 \| 2
	0 \| 28 \| 1
	0 \| 5 \| 1
	0 \| 6 \| 1
	0 \| 7 \| 1
	0 \| 8 \| 1
	0 \| 26 \| 1
	1 \| 16 \| 1
	1 \| 11 \| 1
	1 \| 10 \| 1
	1 \| 2 \| 1
	1 \| 3 \| 1
	2 \| 30 \| 1
	2 \| 1 \| 1
	2 \| 16 \| 1
	2 \| 20 \| 1
	2 \| 24 \| 1
	2 \| 29 \| 1
	3 \| 4 \| 1
	3 \| 21 \| 1
	3 \| 22 \| 1
	3 \| 23 \| 1
	3 \| 0 \| 1
	3 \| 11 \| 1
	3 \| 9 \| 1
	3 \| 27 \| 1
	3 \| 14 \| 1
	3 \| 18 \| 1
	3 \| 19 \| 1
	(36 rows)
	</pre> Note above that wordid's start at 0 not 1. The vocabulary table maps wordid to the actual word: <pre class="example">
	SELECT * FROM documents_tf_vocabulary ORDER BY wordid;
	</pre> <pre class="result">
	wordid \| word
	--------+-------------
	0 \| a
	1 \| adopted
	2 \| and
	3 \| are
	4 \| at
	5 \| ate
	6 \| banana
	7 \| bananas
	8 \| breakfast
	9 \| broccoli
	10 \| chinchillas
	11 \| cute
	12 \| eat
	13 \| for
	14 \| hamster
	15 \| i
	16 \| kittens
	17 \| like
	18 \| look
	19 \| munching
	20 \| my
	21 \| of
	22 \| on
	23 \| piece
	24 \| sister
	25 \| smoothie
	26 \| spinach
	27 \| this
	28 \| to
	29 \| two
	30 \| yesterday
	(31 rows)
	</pre></li>
	</ol>
	<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
	<p>See <a class="el" href="text__utilities_8sql__in.html" title="SQL functions for carrying out routine text operations. ">text_utilities.sql_in</a> for the term frequency SQL function definition and <a class="el" href="porter__stemmer_8sql__in.html" title="implementation of porter stemmer operations in SQL ">porter_stemmer.sql_in</a> for the stemmer function. </p>
	</div><!-- contents -->
	</div><!-- doc-content -->
	<!-- start footer part -->
	<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
	<ul>
	<li class="footer">Generated on Wed Mar 31 2021 20:45:50 for MADlib by
	<a href="http://www.doxygen.org/index.html">
	<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
	</ul>
	</div>
	</body>
	</html>