| /* ----------------------------------------------------------------------- *//** |
| * |
| * @file plda.sql_in |
| * |
| * @brief SQL functions for parallel Latent Dirichlet Allocation |
| * @date April 2011 |
| * |
| * @sa For an introduction to Latent Dirichlet Allocation models, see the |
| module description \ref grp_plda. |
| * |
| *//* ------------------------------------------------------------------------*/ |
| |
| m4_include(`SQLCommon.m4') |
| |
| /** |
| |
| @addtogroup grp_plda |
| |
| @about |
| |
| Latent Dirichlet Allocation (LDA) is an interesting generative probabilistic |
| model for natural texts that has received a lot of attention in recent years. |
| The model is quite versatile, having found uses in problems like automated |
| topic discovery, collaborative filtering, and document classification. |
| |
| The LDA model posits that each document is associated with a mixture of various |
| topics (e.g. a document is related to Topic 1 with probability 0.7, and Topic 2 with |
| probability 0.3), and that each word in the document is attributable to one |
| of the document's topics. There is a (symmetric) Dirichlet prior with parameter |
| \f$ \alpha \f$ on each document's topic mixture. In addition, there is another |
| (symmateric) Dirichlet prior with parameter \f$ \eta \f$ on the distribution |
| of words for each topic. The following generative process then defines a distribution |
| over a corpus of documents. First sample, for each topic \f$ i \f$, a per-topic word distribution |
| \f$ \Phi_i \f$ from the Dirichlet(\f$\eta\f$) prior. |
| Then for each document: |
| -# Sample a document length N from a suitable distribution, say, Poisson. |
| -# Sample a topic mixture \f$ \theta \f$ for the document from the Dirichlet(\f$\alpha\f$) distribution. |
| -# For each of the N words: |
| -# Sample a topic \f$ z_n \f$ from the multinomial topic distribution \f$ \theta \f$. |
| -# Sample a word \f$ w_n \f$ from the multinomial word distribution \f$ \Phi_{z_n} \f$ associated with topic \f$ z_n \f$. |
| |
| In practice, only the words in each document are observable. The topic mixture of |
| each document and the topic for each word in each document are latent unobservable |
| variables that need to be inferred from the observables, and this is the problem |
| people refer to when they talk about the inference problem for LDA. Exact inference |
| is intractable, but several approximate inference algorithms for LDA have been |
| developed. The simple and effective Gibbs sampling algorithm described in |
| Griffiths and Steyvers [2] appears to be the current algorithm of choice. Our |
| parallel implementation of LDA comes from Wang et al [3], which is essentially |
| a straightforward parallelisation of the Gibbs sampling algorithm. |
| |
| See also http://code.google.com/p/plda/. |
| |
| @input |
| The \b corpus to be analyzed is expected to be of the following form: |
| <pre>{TABLE|VIEW} <em>datatable</em> ( |
| <em>id</em> INTEGER, |
| <em>contents</em> INTEGER[], |
| ... |
| )</pre> |
| where \c id refers to the document ID, and \c contents is an integer array that specifies the words in the document using the index from |
| the dictionary. Words must be represented using positive numbers. |
| |
| The \b dictionary that indexes all the words found in the corpus is of the following form: |
| <pre>{TABLE|VIEW} <em>dicttable</em> ( |
| <em>dict</em> TEXT[], |
| ... |
| )</pre> |
| |
| @usage |
| |
| - Topic inference is achieved through the following UDF |
| <pre> |
| SELECT \ref plda_run('<em>datatable</em>', '<em>dicttable</em>', '<em>modeltable</em>', '<em>outputdatatable</em>', |
| <em>numiter</em>, <em>numtopics</em>, <em>alpha</em>, <em>eta</em>); |
| </pre> |
| This function stores the resulting model in <tt><em>outputdatatable</em></tt>. |
| - Labelling of test documents using a learned LDA model is achieved using the following UDF |
| <pre> |
| SELECT \ref plda_label_test_documents('<em>testtable</em>', '<em>outputtable</em>', '<em>modeltable</em>', '<em>dicttable</em>', |
| <em>numtopics</em>, <em>alpha</em>, <em>eta</em>); |
| </pre> |
| This creates the following table with the assigned topic for each word in the test corpus. |
| <pre> |
| id | contents | topics |
| ----+-------------------------------+----------------------------------- |
| ... |
| </pre> |
| |
| @implementation |
| The input format for the Parallel LDA module is different from that used by the |
| `lda' package for R. In the `lda' package, each document is represented by two |
| equal dimensional integer arrays. The first array represents the words that occur |
| in the document, and the second array captures the number of times each word in |
| the first array occurs in the document. |
| This representation appears to have a major weakness in that all the occurrences |
| of a word in a document must be assigned the same topic, which is clearly not |
| satisfactory. Further, at the time of writing, the main learning function in the |
| `lda' package actually does not work correctly when the occurrence counts for words |
| aren't one. |
| |
| There is a script called generateTestCases.cc that can be used to generate some |
| simple test documents to validate the correctness and effiency of the parallel |
| LDA implementation. |
| |
| @examp |
| |
| We now give a usage example. |
| |
| -# As a first step, we need to prepare a corpus and dictionary in the appropriate structure. |
| \code |
| sql> CREATE TABLE MADLIB_SCHEMA.plda_mydict ( dict text[] ) DISTRIBUTED RANDOMLY; |
| sql> INSERT into MADLIB_SCHEMA.plda_mydict values |
| ('{human,machine,interface,for,abc,computer,applications,a,survey,of, |
| user,opinion,system,response,time,the,eps,management,and,engineering, |
| testing,relation,perceived,to,error,generation,random,binary,order,tree, |
| intersection,graph,path,in,minor,IV,widths,well,quasi}'); |
| |
| sql> CREATE TABLE MADLIB_SCHEMA.plda_mycorpus ( id int4, contents int4[] ); |
| sql> INSERT INTO MADLIB_SCHEMA.plda_mycorpus VALUES |
| (1, '{1,2,3,4,5,6,7}'), |
| (2, '{8,9,10,11,12,10,6,13,14,15}'), |
| (3, '{16,17,11,3,18,13}'), |
| (4, '{13,19,1,13,20,21,10,17}'), |
| (5, '{22,10,11,23,14,15,24,25,18}'), |
| (6, '{16,26,10,27,28,29,30}'), |
| (7, '{16,31,32,10,33,34,30}'), |
| (8, '{32,35,36,37,10,30,19,38,39,29}'), |
| (9, '{32,35,8,9}') ; |
| \endcode |
| -# To perform inference, we call the plda_run() function with the appropriate parameters. |
| Here is an example. |
| \code |
| sql> select MADLIB_SCHEMA.plda_run('MADLIB_SCHEMA.plda_mycorpus', 'MADLIB_SCHEMA.plda_mydict', |
| 'MADLIB_SCHEMA.plda_mymodel', 'MADLIB_SCHEMA.plda_corpus', |
| 30,10,0.5,0.5); |
| \endcode |
| After a successful run of the plda_run() function, the most probable words associated |
| with each topic are displayed. Other results of the learning process can be obtained |
| by running the following commands. Here we assume the modeltable and outputdatatable |
| parameters to plda_run() are 'MADLIB_SCHEMA.plda_mymodel' and 'MADLIB_SCHEMA.plda_corpus' respectively. |
| -# The topic assignments for each document can be obtained as follows: |
| \code |
| sql> select id, (topics).topics from MADLIB_SCHEMA.plda_corpus; |
| \endcode |
| -# The topic distribution of each document can be obtained as follows: |
| \code |
| sql> select id, (topics).topic_d from MADLIB_SCHEMA.plda_corpus; |
| \endcode |
| -# The number of times each word was assigned to a given topic in the whole corpus can |
| be computed as follows: |
| \code |
| sql> select ss.i, MADLIB_SCHEMA.plda_word_topic_distrn(gcounts,$numtopics,ss.i) |
| from MADLIB_SCHEMA.plda_mymodel, |
| (select generate_series(1,$dictsize) i) as ss; |
| \endcode |
| where $numtopics is the number of topics used in the learning process, and |
| $dictsize is the size of the dictionary. |
| -# The total number of words assigned to each topic in the whole corpus can be computed |
| as follows: |
| \code |
| sql> select sum((topics).topic_d) topic_sums from MADLIB_SCHEMA.plda_corpus; |
| \endcode |
| -# To use a learned LDA model to label new documents, we can use the following commands: |
| \code |
| sql> select MADLIB_SCHEMA.plda_label_test_documents('MADLIB_SCHEMA.plda_mycorpus', 'MADLIB_SCHEMA.plda_testresult', 'MADLIB_SCHEMA.plda_mymodel', 'MADLIB_SCHEMA.plda_mydict', 10,0.5,0.5); |
| sql> select * from MADLIB_SCHEMA.plda_testresult; |
| \endcode |
| |
| @literature |
| |
| [1] D.M. Blei, A.Y. Ng, M.I. Jordan, <em>Latent Dirichlet Allocation</em>, |
| Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003. |
| |
| [2] T. Griffiths and M. Steyvers, <em>Finding scientific topics</em>, |
| PNAS, vol. 101, pp. 5228-5235, 2004. |
| |
| [3] Y. Wang, H. Bai, M. Stanton, W-Y. Chen, and E.Y. Chang, <em>PLDA: |
| Parallel Dirichlet Allocation for Large-scale Applications</em>, AAIM, 2009. |
| |
| [4] http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation |
| |
| [5] J. Chang, Collapsed Gibbs sampling methods for topic models, R manual, 2010. |
| |
| @sa File plda.sql_in documenting the SQL functions. |
| |
| */ |
| |
| |
| -- The plda_topics_t data type store the assignment of topics to each word in a document, |
| -- plus the distribution of those topics in the document. |
| CREATE TYPE MADLIB_SCHEMA.plda_topics_t AS ( |
| topics int4[], |
| topic_d int4[] |
| ); |
| |
| -- Returns a zero'd array of a given dimension |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.plda_zero_array(d int4) RETURNS int4[] |
| AS 'MODULE_PATHNAME', 'zero_array' LANGUAGE C STRICT; |
| |
| -- Returns the element-wise sum of two integer arrays |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.plda_sum_int4array(int4[],int4[]) RETURNS int4[] |
| AS 'MODULE_PATHNAME', 'sum_int4array' LANGUAGE C; |
| |
| -- Aggregate function for computing the element-wise sum of a set of integer arrays |
| CREATE AGGREGATE MADLIB_SCHEMA.plda_sum_int4array_agg(int4[]) |
| ( |
| sfunc = MADLIB_SCHEMA.plda_sum_int4array, |
| stype = int4[] |
| ); |
| |
| -- Returns an array of random topic assignments for a given document length |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.plda_random_topics(doclen int4, numtopics int4) RETURNS MADLIB_SCHEMA.plda_topics_t |
| AS 'MODULE_PATHNAME', 'randomTopics' LANGUAGE C STRICT; |
| |
| -- This function assigns a randomly chosen topic to each word in a document according to |
| -- the count statistics obtained for the document and the whole corpus so far. |
| -- Parameters |
| -- doc : the document to be analysed |
| -- topics : the topic of each word in the doc |
| -- topic_d : the topic distribution for the doc |
| -- global_count : the global word-topic counts |
| -- topic_counts : the counts of all words in the corpus in each topic |
| -- num_topics : number of topics to be discovered |
| -- dsize : size of dictionary |
| -- alpha : the parameter of the Dirichlet distribution |
| -- eta : the parameter of the Dirichlet distribution |
| -- |
| CREATE OR REPLACE FUNCTION |
| MADLIB_SCHEMA.plda_sample_new_topics(doc int4[], topics int4[], topic_d int4[], global_count int4[], |
| topic_counts int4[], num_topics int4, dsize int4, alpha float, eta float) |
| RETURNS MADLIB_SCHEMA.plda_topics_t |
| AS 'MODULE_PATHNAME', 'sampleNewTopics' LANGUAGE C STRICT; |
| |
| -- Computes the per document word-topic counts |
| CREATE OR REPLACE FUNCTION |
| MADLIB_SCHEMA.plda_cword_count(mystate int4[], doc int4[], topics int4[], doclen int4, num_topics int4, dsize int4) |
| RETURNS int4[] |
| AS 'MODULE_PATHNAME', 'cword_count' LANGUAGE C; |
| |
| -- Aggregate function to compute all word-topic counts given topic assignments for each document |
| CREATE AGGREGATE MADLIB_SCHEMA.plda_cword_agg(int4[], int4[], int4, int4, int4) ( |
| sfunc = MADLIB_SCHEMA.plda_cword_count, |
| stype = int4[] |
| ); |
| |
| -- The main parallel LDA learning function |
| CREATE OR REPLACE FUNCTION |
| MADLIB_SCHEMA.plda_train(num_topics int4, num_iter int4, alpha float, eta float, |
| data_table text, dict_table text, model_table text, output_data_table text) |
| RETURNS int4 AS $$ |
| |
| PythonFunctionBodyOnly(`plda', `plda') |
| |
| # MADlibSchema comes from PythonFunctionBodyOnly |
| return plda.plda_train( MADlibSchema, num_topics, num_iter, alpha, eta, data_table, dict_table, model_table, output_data_table) |
| |
| $$ LANGUAGE plpythonu; |
| |
| CREATE TYPE MADLIB_SCHEMA.plda_word_weight AS ( word text, prob float, wcount int4 ); |
| |
| -- Returns the most important words for each topic, base on Pr( word | topic ). |
| CREATE OR REPLACE FUNCTION |
| MADLIB_SCHEMA.plda_topic_word_prob(num_topics int4, topic int4, model_table text, dict_table text) |
| RETURNS SETOF MADLIB_SCHEMA.plda_word_weight AS $$ |
| |
| PythonFunctionBodyOnly(`plda', `plda') |
| |
| # MADlibSchema comes from PythonFunctionBodyOnly |
| return plda.plda_topic_word_prob( MADlibSchema, num_topics, topic, model_table, dict_table) |
| |
| $$ LANGUAGE plpythonu; |
| |
| |
| CREATE TYPE MADLIB_SCHEMA.plda_word_distrn AS ( word text, distrn int4[], prob float8[] ); |
| |
| -- This function computes the topic assignments to words in a document given previously computed |
| -- statistics from the training corpus. |
| -- This function has not been rewritten in PL/Python because it is hard to handle structured types |
| -- like MADLIB_SCHEMA.plda_topics_t in PL/Python (this has to be done via conversion to text, which is not nice). |
| CREATE OR REPLACE FUNCTION |
| MADLIB_SCHEMA.plda_label_document(doc int4[], global_count int4[], topic_counts int4[], num_topics int4, dsize int4, |
| alpha float, eta float) |
| RETURNS MADLIB_SCHEMA.plda_topics_t AS $$ |
| DECLARE |
| ret MADLIB_SCHEMA.plda_topics_t; |
| BEGIN |
| ret := MADLIB_SCHEMA.plda_random_topics(array_upper(doc,1), num_topics); |
| FOR i in 1..20 LOOP |
| ret := MADLIB_SCHEMA.plda_sample_new_topics(doc,(ret).topics,(ret).topic_d,global_count,topic_counts,num_topics,dsize,alpha,eta); |
| END LOOP; |
| RETURN ret; |
| END; |
| $$ LANGUAGE plpgsql; |
| |
| -- This function computes the topic assignments to documents in a test corpus. |
| -- The data_table argument appears unnecessary, as long as topic_counts is saved in a table from the plda_train() routine |
| /** |
| * @brief This function computes the topic assignments to documents in a test corpus. |
| * |
| * @param test_table Name of table containing the test corpus (must have columns <tt>id INT</tt> and <tt>contents INT[]</tt>). |
| * @param output_table Name of table to store the results. |
| * @param model_table Name of table where learned model is stored (in the form of word-topic counts and topic counts). |
| * @param dict_table Name of the table containing the dictionary (mandatory column is dict text[]) |
| * @param num_topics Number of topics to discover. |
| * @param alpha Parameter to the topic Dirichlet prior. |
| * @param eta Parameter to the Dirichlet prior on the per-topic word distributions. |
| * |
| */ |
| CREATE OR REPLACE FUNCTION |
| MADLIB_SCHEMA.plda_label_test_documents(test_table text, output_table text, model_table text, dict_table text, num_topics int4, alpha float, eta float) |
| RETURNS VOID AS $$ |
| |
| PythonFunctionBodyOnly(`plda', `plda') |
| |
| # MADlibSchema comes from PythonFunctionBodyOnly |
| plda.plda_label_test_documents( MADlibSchema, test_table, output_table, model_table, dict_table, num_topics, alpha, eta) |
| |
| $$ LANGUAGE plpythonu; |
| |
| -- Returns the distribution of topics for a given word |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.plda_word_topic_distrn(arr int4[], ntopics int4, word int4, OUT ret int4[]) AS $$ |
| SELECT $1[(($3-1)*$2 + 1):(($3-1)*$2 + $2)]; |
| $$ LANGUAGE sql; |
| /** |
| * @brief Main plda function |
| * |
| * @param datatable Name of table containing the corpus (must have columns <tt>id INT</tt> and <tt>contents INT[]</tt>). |
| * @param dicttable Name of table containing the dictionary (must have column \c dict \c TEXT[]). |
| * @param modeltable Name of table where learned model will be stored (in the form of word-topic counts and topic counts). |
| * @param outputdatatable Name of the table the system will store a copy of the datatable plus topic assignments. |
| * @param numiter Number of iterations to run the Gibbs sampling. |
| * @param numtopics Number of topics to discover. |
| * @param alpha Parameter to the topic Dirichlet prior. |
| * @param eta Parameter to the Dirichlet prior on the per-topic word distributions. |
| * |
| */ |
| CREATE OR REPLACE FUNCTION |
| MADLIB_SCHEMA.plda_run(datatable text, dicttable text, modeltable text, outputdatatable text, |
| numiter int4, numtopics int4, alpha float, eta float) |
| RETURNS VOID AS $$ |
| |
| PythonFunctionBodyOnly(`plda', `plda') |
| |
| # MADlibSchema comes from PythonFunctionBodyOnly |
| plda.plda_run( MADlibSchema, datatable, dicttable, modeltable, outputdatatable, numiter, numtopics, alpha, eta) |
| |
| $$ LANGUAGE plpythonu; |
| |
| |