Topic modeling is a way to analyze massive documents by clustering them into some topics. In particular, Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling techniques; papers introduce the method are as follows:
Hivemall enables you to analyze your data such as, but not limited to, documents based on LDA. This page gives usage instructions of the feature.
Note
This feature is supported from Hivemall v0.5-rc.1 or later.
Assume that we already have a table docs
which contains many documents as string format:
docid | doc |
---|---|
1 | “Fruits and vegetables are healthy.” |
2 | “I like apples, oranges, and avocados. I do not like the flu or colds.” |
... | ... |
Hivemall has several functions which are particularly useful for text processing. More specifically, by using tokenize()
and is_stopword()
, you can immediately convert the documents to bag-of-words-like format:
with word_counts as ( select docid, feature(word, count(word)) as word_count from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by docid, word ) select docid, collect_list(word_count) as features from word_counts group by docid ;
docid | features |
---|---|
1 | [“fruits:1”,“healthy:1”,“vegetables:1”] |
2 | [“apples:1”,“avocados:1”,“colds:1”,“flu:1”,“like:2”,“oranges:1”] |
Note
It should be noted that, as long as your data can be represented as the feature format, LDA can be applied for arbitrary data as a generic clustering technique.
Each feature vector is input to the train_lda()
function:
with word_counts as ( select docid, feature(word, count(word)) as word_count from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by docid, word ), input as ( select docid, collect_list(word_count) as features from word_counts group by docid ) select train_lda(features, '-topics 2 -iter 20') as (label, word, lambda) from input ;
Here, an option -topics 2
specifies the number of topics we assume in the set of documents.
Notice that order by docid
ensures building a LDA model precisely in a single node. In case that you like to launch train_lda
in parallel, following query hopefully returns similar (but might be slightly approximated) result:
with word_counts as ( -- same as above ), input as ( select docid, collect_list(f) as features from word_counts group by docid ) select label, word, avg(lambda) as lambda from ( select train_lda(features, '-topics 2 -iter 20') as (label, word, lambda) from input ) t2 group by label, word -- order by lambda desc -- ordering is optional ;
Eventually, a new table lda_model
is generated as shown below:
label | word | lambda |
---|---|---|
0 | fruits | 0.33372128 |
0 | vegetables | 0.33272517 |
0 | healthy | 0.33246377 |
0 | flu | 2.3617347E-4 |
0 | apples | 2.1898883E-4 |
0 | oranges | 1.8161473E-4 |
0 | like | 1.7666373E-4 |
0 | avocados | 1.726186E-4 |
0 | colds | 1.037139E-4 |
1 | colds | 0.16622013 |
1 | avocados | 0.16618845 |
1 | oranges | 0.1661859 |
1 | like | 0.16618414 |
1 | apples | 0.16616651 |
1 | flu | 0.16615893 |
1 | healthy | 0.0012059759 |
1 | vegetables | 0.0010818697 |
1 | fruits | 6.080827E-4 |
In the table, label
indicates a topic index, and lambda
is a value which represents how each word is likely to characterize a topic. That is, we can say that, in terms of lambda
, top-N words are the topic words of a topic.
Obviously, we can observe that topic 0
corresponds to document 1
, and topic 1
represents words in document 2
.
Once you have constructed topic models as described before, a function lda_predict()
allows you to predict topic assignments of documents.
For example, if we consider the docs
table, the exactly same set of documents as used for training, probability that a document is assigned to a topic can be computed by:
with test as ( select docid, word, count(word) as value from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by docid, word ) select t.docid, lda_predict(t.word, t.value, m.label, m.lambda, '-topics 2') as probabilities from test t JOIN lda_model m ON (t.word = m.word) group by t.docid ;
docid | probabilities (sorted by probabilities) |
---|---|
1 | [{“label”:0,“probability”:0.875},{“label”:1,“probability”:0.125}] |
2 | [{“label”:1,“probability”:0.9375},{“label”:0,“probability”:0.0625}] |
Importantly, an option -topics
is expected to be the same value as you set for training.
Since the probabilities are sorted in descending order, a label of the most promising topic is easily obtained as:
select docid, probabilities[0].label from topic;
docid | label |
---|---|
1 | 0 |
2 | 1 |
Of course, using the different set of documents for prediction is possible. Predicting topic assignments of newly observed documents should be more realistic scenario.