| <!-- HTML header for doxygen 1.8.4--> |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| <html xmlns="http://www.w3.org/1999/xhtml"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/> |
| <meta http-equiv="X-UA-Compatible" content="IE=9"/> |
| <meta name="generator" content="Doxygen 1.8.4"/> |
| <meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/> |
| <title>MADlib: Conditional Random Field</title> |
| <link href="tabs.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="jquery.js"></script> |
| <script type="text/javascript" src="dynsections.js"></script> |
| <link href="navtree.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="resize.js"></script> |
| <script type="text/javascript" src="navtree.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(initResizable); |
| $(window).load(resizeHeight); |
| </script> |
| <link href="search/search.css" rel="stylesheet" type="text/css"/> |
| <script type="text/javascript" src="search/search.js"></script> |
| <script type="text/javascript"> |
| $(document).ready(function() { searchBox.OnSelectItem(0); }); |
| </script> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"], |
| jax: ["input/TeX","output/HTML-CSS"], |
| }); |
| </script><script src="../mathjax/MathJax.js"></script> |
| <link href="doxygen.css" rel="stylesheet" type="text/css" /> |
| <link href="madlib_extra.css" rel="stylesheet" type="text/css"/> |
| <!-- google analytics --> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| ga('create', 'UA-45382226-1', 'auto'); |
| ga('send', 'pageview'); |
| </script> |
| </head> |
| <body> |
| <div id="top"><!-- do not remove this div, it is closed by doxygen! --> |
| <div id="titlearea"> |
| <table cellspacing="0" cellpadding="0"> |
| <tbody> |
| <tr style="height: 56px;"> |
| <td style="padding-left: 0.5em;"> |
| <div id="projectname">MADlib |
|  <span id="projectnumber">1.4.1</span> |
| </div> |
| <div id="projectbrief">User Documentation</div> |
| </td> |
| <td> <div id="MSearchBox" class="MSearchBoxInactive"> |
| <span class="left"> |
| <img id="MSearchSelect" src="search/mag_sel.png" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| alt=""/> |
| <input type="text" id="MSearchField" value="Search" accesskey="S" |
| onfocus="searchBox.OnSearchFieldFocus(true)" |
| onblur="searchBox.OnSearchFieldFocus(false)" |
| onkeyup="searchBox.OnSearchFieldChange(event)"/> |
| </span><span class="right"> |
| <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a> |
| </span> |
| </div> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <!-- end header part --> |
| <!-- Generated by Doxygen 1.8.4 --> |
| <script type="text/javascript"> |
| var searchBox = new SearchBox("searchBox", "search",false,'Search'); |
| </script> |
| </div><!-- top --> |
| <div id="side-nav" class="ui-resizable side-nav-resizable"> |
| <div id="nav-tree"> |
| <div id="nav-tree-contents"> |
| <div id="nav-sync" class="sync"></div> |
| </div> |
| </div> |
| <div id="splitbar" style="-moz-user-select:none;" |
| class="ui-resizable-handle"> |
| </div> |
| </div> |
| <script type="text/javascript"> |
| $(document).ready(function(){initNavTree('group__grp__crf.html','');}); |
| </script> |
| <div id="doc-content"> |
| <!-- window showing the filter options --> |
| <div id="MSearchSelectWindow" |
| onmouseover="return searchBox.OnSearchSelectShow()" |
| onmouseout="return searchBox.OnSearchSelectHide()" |
| onkeydown="return searchBox.OnSearchSelectKey(event)"> |
| <a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(0)"><span class="SelectionMark"> </span>All</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(1)"><span class="SelectionMark"> </span>Files</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(2)"><span class="SelectionMark"> </span>Functions</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(3)"><span class="SelectionMark"> </span>Variables</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(4)"><span class="SelectionMark"> </span>Groups</a></div> |
| |
| <!-- iframe showing the search results (closed by default) --> |
| <div id="MSearchResultsWindow"> |
| <iframe src="javascript:void(0)" frameborder="0" |
| name="MSearchResults" id="MSearchResults"> |
| </iframe> |
| </div> |
| |
| <div class="header"> |
| <div class="headertitle"> |
| <div class="title">Conditional Random Field<div class="ingroups"><a class="el" href="group__grp__early__stage.html">Early Stage Development</a></div></div> </div> |
| </div><!--header--> |
| <div class="contents"> |
| <div class="toc"><b>Contents</b> </p> |
| <ul> |
| <li> |
| <a href="#train">Training Function</a> </li> |
| <li> |
| <a href="#usage">Using CRF</a> </li> |
| <li> |
| <a href="#input">Input</a> </li> |
| <li> |
| <a href="#examples">Examples</a> </li> |
| <li> |
| <a href="#background">Technical Background</a> </li> |
| <li> |
| <a href="#literature">Literature</a> </li> |
| <li> |
| <a href="#related">Related Topics</a> </li> |
| </ul> |
| </div><dl class="section warning"><dt>Warning</dt><dd><em> This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change. </em></dd></dl> |
| <p>A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special type of CRF that assumes the current state depends only on the previous state.</p> |
| <p>Feature extraction modules are provided for text-analysis tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:</p> |
| <ul> |
| <li>Edge Feature: transition feature that encodes the transition feature weight from current label to next label.</li> |
| <li>Start Feature: fired when the current token is the first token in a sequence.</li> |
| <li>End Feature: fired when the current token is the last token in a sequence.</li> |
| <li>Word Feature: fired when the current token is observed in the trained dictionary.</li> |
| <li>Unknown Feature: fired when the current token is not observed in the trained dictionary for at least a certain number of times (default 1).</li> |
| <li>Regex Feature: fired when the current token can be matched by a regular expression.</li> |
| </ul> |
| <p>A Viterbi implementation is also provided to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).</p> |
| <p><a class="anchor" id="train"></a></p> |
| <dl class="section user"><dt>Training Function</dt><dd>Get number of iterations and weights for features:<br/> |
| </dd></dl> |
| <pre class="syntax"> |
| lincrf( source, |
| sparse_R, |
| dense_M, |
| sparse_M, |
| featureSize, |
| tagSize, |
| featureset, |
| crf_feature, |
| maxNumIterations |
| ) |
| </pre><p> <b>Arguments</b> </p> |
| <dl class="arglist"> |
| <dt>source </dt> |
| <dd>Name of the source relation containing the training data </dd> |
| <dt>sparse_R </dt> |
| <dd>Name of the sparse single state feature column (of type DOUBLE PRECISION[]) </dd> |
| <dt>dense_M </dt> |
| <dd>Name of the dense two state feature column (of type DOUBLE PRECISION[]) </dd> |
| <dt>sparse_M </dt> |
| <dd>Name of the sparse two state feature column (of type DOUBLE PRECISION[]) </dd> |
| <dt>featureSize </dt> |
| <dd>Name of feature size column (of type DOUBLE PRECISION) </dd> |
| <dt>tagSize </dt> |
| <dd>The number of tags in the tag set </dd> |
| <dt>featureset </dt> |
| <dd>The unique feature set </dd> |
| <dt>crf_feature </dt> |
| <dd>The Name of output feature table </dd> |
| <dt>maxNumIterations </dt> |
| <dd>The maximum number of iterations </dd> |
| </dl> |
| <p>The features and weights are stored in the table named by <em>crf_feature</em>. This function returns a composite value containing the following columns: </p> |
| <table class="output"> |
| <tr> |
| <th>coef </th><td>FLOAT8[]. Array of coefficients </td></tr> |
| <tr> |
| <th>log_likelihood </th><td>FLOAT8. Log-likelihood </td></tr> |
| <tr> |
| <th>num_iterations </th><td>INTEGER. The number of iterations before the algorithm terminated </td></tr> |
| </table> |
| <p><a class="anchor" id="usage"></a></p> |
| <dl class="section user"><dt>Using CRF</dt><dd></dd></dl> |
| <p>Generate text features, calculate their weights, and output the best label sequence for test data:<br/> |
| </p> |
| <ol type="1"> |
| <li>Create tables to store the input data, intermediate data, and output data. Also import the training data to the database. <pre> |
| SELECT madlib.crf_train_data( '<em>/path/to/data</em>'); |
| </pre></li> |
| <li>Generate text analytics features for the training data. <pre>SELECT madlib.crf_train_fgen( |
| '<em>segmenttbl</em>', |
| '<em>regextbl</em>', |
| '<em>dictionary</em>', |
| '<em>featuretbl</em>', |
| '<em>featureset</em>');</pre></li> |
| <li>Use linear-chain CRF for training. <pre>SELECT madlib.lincrf( |
| '<em>source</em>', |
| '<em>sparse_r</em>', |
| '<em>dense_m</em>', |
| '<em>sparse_m</em>', |
| '<em>f_size</em>', |
| <em>tag_size</em>, |
| '<em>feature_set</em>', |
| '<em>featureWeights</em>', |
| '<em>maxNumIterations</em>');</pre></li> |
| <li>Import CRF model to the database. Also load the CRF testing data to the database. <pre>SELECT madlib.crf_test_data( |
| '<em>/path/to/data</em>');</pre></li> |
| <li>Generate text analytics features for the testing data. <pre>SELECT madlib.crf_test_fgen( |
| '<em>segmenttbl</em>', |
| '<em>dictionary</em>', |
| '<em>labeltbl</em>', |
| '<em>regextbl</em>', |
| '<em>featuretbl</em>', |
| '<em>viterbi_mtbl</em>', |
| '<em>viterbi_rtbl</em>');</pre> 'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).</li> |
| <li>Run the Viterbi function to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \). <pre>SELECT madlib.vcrf_label( |
| '<em>segmenttbl</em>', |
| '<em>viterbi_mtbl</em>', |
| '<em>viterbi_rtbl</em>', |
| '<em>labeltbl</em>', |
| '<em>resulttbl</em>');</pre></li> |
| </ol> |
| <p><a class="anchor" id="Input"></a></p> |
| <dl class="section user"><dt>Input</dt><dd><ul> |
| <li>User-provided input:<br/> |
| The user is expected to at least provide the label table, the regular expression table, and the segment table: <pre>{TABLE|VIEW} <em>labelTableName</em> ( |
| ... |
| <em>id</em> INTEGER, |
| <em>label</em> TEXT, |
| ... |
| )</pre> where <em>id</em> is a unique ID for the label and <em>label</em> is the label name. <pre>{TABLE|VIEW} <em>regexTableName</em> ( |
| ... |
| <em>pattern</em> TEXT, |
| <em>name</em> TEXT, |
| ... |
| )</pre> where <em>pattern</em> is a regular expression pattern (e.g. '^.+ing$') and <em>name</em> is a name for the regular expression pattern (e.g. 'endsWithIng'). <pre>{TABLE|VIEW} <em>segmentTableName</em> ( |
| ... |
| <em>start_pos</em> INTEGER, |
| <em>doc_id</em> INTEGER, |
| <em>seg_text</em> TEXT, |
| <em>label</em> INTEGER, |
| <em>max_pos</em> INTEGER, |
| ... |
| )</pre> where <em>start_pos</em> is the position of the word in the sequence, <em>doc_id</em> is a unique ID for the sequence, <em>seg_text</em> is the word, <em>label</em> is the label for the word, and <em>max_pos</em> is the length of the sequence.</li> |
| <li>Training (<a class="el" href="crf_8sql__in.html#a836f0eefb7b6fbc37fa1c382b2a95127">lincrf</a>) input:<br/> |
| The feature table used for training is expected to be of the following form (this table can also be generated by <a class="el" href="crf__feature__gen_8sql__in.html#a80e9192613662ba6dfd3ac90057205ee">crf_train_fgen</a>):<br/> |
| <pre>{TABLE|VIEW} <em>featureTableName</em> ( |
| ... |
| <em>doc_id</em> INTEGER, |
| <em>f_size</em> INTEGER, |
| <em>sparse_r</em> FLOAT8[], |
| <em>dense_m</em> FLOAT8[], |
| <em>sparse_m</em> FLOAT8[], |
| ... |
| )</pre> where<ul> |
| <li><em>doc_id</em> is a unique ID for the sequence</li> |
| <li><em>f_size</em> is the number of features</li> |
| <li><em>sparse_r</em> is the array union of (previous label, label, feature index, start position, training existance indicator) of individal single-state features (e.g. word features, regex features) ordered by their start positon</li> |
| <li><em>dense_m</em> is the array union of (previous label, label, feature index, start position, training existance indicator) of edge features ordered by start position</li> |
| <li><em>sparse_m</em> is the array union of (feature index, previous label, label) of edge features ordered by feature index. Edge features were split into dense_m and sparse_m for performance reasons.</li> |
| </ul> |
| </li> |
| </ul> |
| </dd></dl> |
| <p>The set of features used for training is expected to be of the following form (also can be generated by <a class="el" href="crf__feature__gen_8sql__in.html#a80e9192613662ba6dfd3ac90057205ee">crf_train_fgen</a>):<br/> |
| </p> |
| <pre>{TABLE|VIEW} <em>featureSetName</em> ( |
| ... |
| <em>f_index</em> INTEGER, |
| <em>f_name</em> TEXT, |
| <em>feature_labels</em> INTEGER[], |
| ... |
| )</pre><p> where</p> |
| <ul> |
| <li><em>f_index</em> is a unique ID for the feature</li> |
| <li><em>f_name</em> is the feature name</li> |
| <li><em>feature_labels</em> is an array representing {previous label, label}.</li> |
| </ul> |
| <p>The empty feature weight table (which will be populated after training) is expected to be of the following form: </p> |
| <pre>{TABLE|VIEW} <em>featureWeightsName</em> ( |
| ... |
| <em>f_index</em> INTEGER, |
| <em>f_name</em> TEXT, |
| <em>previous_label</em> INTEGER, |
| <em>label</em> INTEGER, |
| <em>weight</em> FLOAT8, |
| ... |
| )</pre><p><a class="anchor" id="examples"></a></p> |
| <dl class="section user"><dt>Examples</dt><dd>This example uses a trivial training and test data set.<ol type="1"> |
| <li>Load the label table, the regular expressions table, and the training segment table: <pre class="example"> |
| SELECT * FROM crf_label; |
| </pre> Result: <pre class="result"> |
| id | label |
|  ---+------- |
| 1 | CD |
| 13 | NNP |
| 15 | PDT |
| 17 | PRP |
| 29 | VBN |
| 31 | VBZ |
| 33 | WP |
| 35 | WRB |
| ... |
| </pre> The regular expressions table: <pre class="example"> |
| SELECT * from crf_regex; |
| </pre> <pre class="result"> |
| pattern | name |
|  --------------+---------------------- |
| ^.+ing$ | endsWithIng |
| ^[A-Z][a-z]+$ | InitCapital |
| ^[A-Z]+$ | isAllCapital |
| ^.*[0-9]+.*$ | containsDigit |
| ... |
| </pre> The training segment table: <pre class="example"> |
| SELECT * from train_segmenttbl; |
| </pre> <pre class="result"> |
| start_pos | doc_id | seg_text | label | max_pos |
|  ----------+--------+------------+-------+--------- |
| 8 | 1 | alliance | 11 | 26 |
| 10 | 1 | Ford | 13 | 26 |
| 12 | 1 | that | 5 | 26 |
| 24 | 1 | likely | 6 | 26 |
| 26 | 1 | . | 43 | 26 |
| 8 | 2 | interest | 11 | 10 |
| 10 | 2 | . | 43 | 10 |
| 9 | 1 | after | 5 | 26 |
| 11 | 1 | concluded | 27 | 26 |
| 23 | 1 | the | 2 | 26 |
| 25 | 1 | return | 11 | 26 |
| 9 | 2 | later | 19 | 10 |
| ... |
| </pre></li> |
| <li>Create the (empty) dictionary table, feature table, and feature set: <pre class="example"> |
| CREATE TABLE crf_dictionary(token text,total integer); |
| CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]); |
| CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]); |
| </pre></li> |
| <li>Generate the training features: <pre class="example"> |
| SELECT crf_train_fgen( 'train_segmenttbl', |
| 'crf_regex', |
| 'crf_dictionary', |
| 'train_featuretbl', |
| 'train_featureset' |
| ); |
| SELECT * from crf_dictionary; |
| </pre> Result: <pre class="result"> |
| token | total |
|  -----------+------- |
| talks | 1 |
| that | 1 |
| would | 1 |
| alliance | 1 |
| Saab | 2 |
| cost | 1 |
| after | 1 |
| operations | 1 |
| ... |
| </pre> <pre class="example"> |
| SELECT * from train_featuretbl; |
| </pre> Result: <pre class="result"> |
| doc_id | f_size | sparse_r | dense_m | sparse_m |
|  -------+--------+-------------------------------+---------------------------------+----------------------- |
| 2 | 87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...} |
| 1 | 87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...} |
| </pre> <pre class="example"> |
| SELECT * from train_featureset; |
| </pre> <pre class="result"> |
| f_index | f_name | feature |
|  --------+---------------+--------- |
| 1 | R_endsWithED | {-1,29} |
| 13 | W_outweigh | {-1,26} |
| 29 | U | {-1,5} |
| 31 | U | {-1,29} |
| 33 | U | {-1,12} |
| 35 | W_a | {-1,2} |
| 37 | W_possible | {-1,6} |
| 15 | W_signaled | {-1,29} |
| 17 | End. | {-1,43} |
| 49 | W_'s | {-1,16} |
| 63 | W_acquire | {-1,26} |
| 51 | E. | {26,2} |
| 69 | E. | {29,17} |
| 71 | E. | {2,11} |
| 83 | W_the | {-1,2} |
| 85 | E. | {16,11} |
| 4 | W_return | {-1,11} |
| ... |
| </pre></li> |
| <li>Create the (empty) feature weight table: <pre class="example"> |
| CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float); |
| </pre></li> |
| <li>Train using linear CRF: <pre class="example"> |
| SELECT lincrf( 'train_featuretbl', |
| 'sparse_r', |
| 'dense_m', |
| 'sparse_m', |
| 'f_size',45, |
| 'train_featureset', |
| 'train_crf_feature', |
| 20 |
| ); |
| </pre> <pre class="result"> |
| lincrf |
|  ------- |
| 20 |
| </pre> View the feature weight table. <pre class="example"> |
| SELECT * from train_crf_feature; |
| </pre> Result: <pre class="result"> |
| id | name | prev_label_id | label_id | weight |
|  ---+---------------+---------------+----------+------------------- |
| 1 | R_endsWithED | -1 | 29 | 1.54128249293937 |
| 13 | W_outweigh | -1 | 26 | 1.70691232223653 |
| 29 | U | -1 | 5 | 1.40708515869008 |
| 31 | U | -1 | 29 | 0.830356200936407 |
| 33 | U | -1 | 12 | 0.769587378281239 |
| 35 | W_a | -1 | 2 | 2.68470625883726 |
| 37 | W_possible | -1 | 6 | 3.41773107604468 |
| 15 | W_signaled | -1 | 29 | 1.68187039165771 |
| 17 | End. | -1 | 43 | 3.07687845517082 |
| 49 | W_'s | -1 | 16 | 2.61430312229883 |
| 63 | W_acquire | -1 | 26 | 1.67247047385797 |
| 51 | E. | 26 | 2 | 3.0114240119435 |
| 69 | E. | 29 | 17 | 2.82385531733866 |
| 71 | E. | 2 | 11 | 3.00970493772732 |
| 83 | W_the | -1 | 2 | 2.58742315259326 |
| ... |
| </pre></li> |
| <li>To find the best labels for a test set using the trained linear CRF model, repeat steps #1-2 and generate the test features, except instead of creating a new dictionary, use the dictionary generated from the training set. <pre class="example"> |
| SELECT * from test_segmenttbl; |
| </pre> Result: <pre class="result"> |
| start_pos | doc_id | seg_text | max_pos |
|  ----------+--------+-------------+--------- |
| 1 | 1 | collapse | 22 |
| 13 | 1 | , | 22 |
| 15 | 1 | is | 22 |
| 17 | 1 | a | 22 |
| 4 | 1 | speculation | 22 |
| 6 | 1 | Ford | 22 |
| 18 | 1 | defensive | 22 |
| 20 | 1 | with | 22 |
| ... |
| </pre> <pre class="example"> |
| SELECT crf_test_fgen( 'test_segmenttbl', |
| 'crf_dictionary', |
| 'crf_label', |
| 'crf_regex', |
| 'train_crf_feature', |
| 'viterbi_mtbl', |
| 'viterbi_rtbl' |
| ); |
| </pre></li> |
| <li>Calculate the best label sequence and save in the table <code>extracted_best_labels</code>. <pre class="example"> |
| SELECT vcrf_label( 'test_segmenttbl', |
| 'viterbi_mtbl', |
| 'viterbi_rtbl', |
| 'crf_label', |
| 'extracted_best_labels' |
| ); |
| </pre> View the best labels. <pre class="example"> |
| SELECT * FROM extracted_best_labels; |
| </pre> Result: <pre class="result"> |
| doc_id | start_pos | seg_text | label | id | prob |
|  -------+-----------+-------------+-------+----+------- |
| 1 | 2 | Friday | NNP | 14 | 9e-06 |
| 1 | 6 | Ford | NNP | 14 | 9e-06 |
| 1 | 12 | Jaguar | NNP | 14 | 9e-06 |
| 1 | 3 | prompted | VBD | 28 | 9e-06 |
| 1 | 8 | intensify | NN | 12 | 9e-06 |
| 1 | 14 | which | NN | 12 | 9e-06 |
| 1 | 18 | defensive | NN | 12 | 9e-06 |
| 1 | 21 | GM | NN | 12 | 9e-06 |
| 1 | 22 | . | . | 44 | 9e-06 |
| 1 | 1 | collapse | CC | 1 | 9e-06 |
| 1 | 7 | would | POS | 17 | 9e-06 |
| ... |
| </pre></li> |
| </ol> |
| </dd></dl> |
| <p><a class="anchor" id="background"></a></p> |
| <dl class="section user"><dt>Technical Background</dt><dd></dd></dl> |
| <p>Specifically, a linear-chain CRF is a distribution defined by </p> |
| <p class="formulaDsp"> |
| \[ p_\lambda(\boldsymbol y | \boldsymbol x) = \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)} \,. \] |
| </p> |
| <p>where</p> |
| <ul> |
| <li>\( F_m(\boldsymbol x, \boldsymbol y) = \sum_{i=1}^n f_m(y_i,y_{i-1},x_i) \) is a global feature function that is a sum along a sequence \( \boldsymbol x \) of length \( n \)</li> |
| <li>\( f_m(y_i,y_{i-1},x_i) \) is a local feature function dependent on the current token label \( y_i \), the previous token label \( y_{i-1} \), and the observation \( x_i \)</li> |
| <li>\( \lambda_m \) is the corresponding feature weight</li> |
| <li>\( Z_\lambda(\boldsymbol x) \) is an instance-specific normalizer <p class="formulaDsp"> |
| \[ Z_\lambda(\boldsymbol x) = \sum_{\boldsymbol y'} \exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y')} \] |
| </p> |
| </li> |
| </ul> |
| <p>A linear-chain CRF estimates the weights \( \lambda_m \) by maximizing the log-likelihood of a given training set \( T=\{(x_k,y_k)\}_{k=1}^N \).</p> |
| <p>The log-likelihood is defined as </p> |
| <p class="formulaDsp"> |
| \[ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] \] |
| </p> |
| <p>and the zero of its gradient </p> |
| <p class="formulaDsp"> |
| \[ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] \] |
| </p> |
| <p>is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.</p> |
| <p>\(E_{p_\lambda(Y|x)}[F(x,Y)]\) is found by using a variant of the forward-backward algorithm: </p> |
| <p class="formulaDsp"> |
| \[ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y) = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)} \] |
| </p> |
| <p class="formulaDsp"> |
| \[ Z_\lambda(x) = \alpha_n.1^T \] |
| </p> |
| <p> where \(\alpha_i\) and \( \beta_i\) are the forward and backward state cost vectors defined by </p> |
| <p class="formulaDsp"> |
| \[ \alpha_i = \begin{cases} \alpha_{i-1}M_i, & 0<i<=n\\ 1, & i=0 \end{cases}\\ \] |
| </p> |
| <p class="formulaDsp"> |
| \[ \beta_i^T = \begin{cases} M_{i+1}\beta_{i+1}^T, & 1<=i<n\\ 1, & i=n \end{cases} \] |
| </p> |
| <p>To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior: </p> |
| <p class="formulaDsp"> |
| \[ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2} \] |
| </p> |
| <p class="formulaDsp"> |
| \[ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2} \] |
| </p> |
| <dl class="section user"><dt>Literature</dt><dd>[1] F. Sha, F. Pereira. Shallow Parsing with Conditional Random Fields, <a href="http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf">http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf</a></dd></dl> |
| <p>[2] Wikipedia, Conditional Random Field, <a href="http://en.wikipedia.org/wiki/Conditional_random_field">http://en.wikipedia.org/wiki/Conditional_random_field</a></p> |
| <p>[3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, <a href="http://crf.sourceforge.net/">http://crf.sourceforge.net/</a></p> |
| <p>[4] D. Wang, ViterbiCRF, <a href="http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html">http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html</a></p> |
| <p>[5] Wikipedia, Viterbi Algorithm, <a href="http://en.wikipedia.org/wiki/Viterbi_algorithm">http://en.wikipedia.org/wiki/Viterbi_algorithm</a></p> |
| <p>[6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782</p> |
| <p>[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, <a href="http://users.eecs.northwestern.edu/~nocedal/lbfgs.html">http://users.eecs.northwestern.edu/~nocedal/lbfgs.html</a></p> |
| <p><a class="anchor" id="related"></a></p> |
| <dl class="section user"><dt>Related Topics</dt><dd></dd></dl> |
| <p>File <a class="el" href="crf_8sql__in.html" title="SQL functions for conditional random field. ">crf.sql_in</a> <a class="el" href="crf__feature__gen_8sql__in.html" title="SQL function for POS/NER feature extraction. ">crf_feature_gen.sql_in</a> <a class="el" href="viterbi_8sql__in.html" title="concatenate a set of input values into arrays to feed into viterbi c function and create a human read...">viterbi.sql_in</a> (documenting the SQL functions) </p> |
| </div><!-- contents --> |
| </div><!-- doc-content --> |
| <!-- start footer part --> |
| <div id="nav-path" class="navpath"><!-- id is needed for treeview function! --> |
| <ul> |
| <li class="footer">Generated on Thu Jan 9 2014 20:25:07 for MADlib by |
| <a href="http://www.doxygen.org/index.html"> |
| <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.4 </li> |
| </ul> |
| </div> |
| </body> |
| </html> |