Okapi BM25 is a ranking function for documents for a given query.
It can also be used for a better replacement of TF-IDF and can be used for term-weight for each document.
Given a query $$Q$$, containing keywords $$q1,....,q_n$$, the BM25 score of a document $$D$$ is:
$$ score(Q, D) = \sum_{i=1}^{n}IDF(q_{i}) \cdot \frac{tf(q_{i},D) \cdot (k_{1}+1)}{tf(q_{i},D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$
where $$tf(q_{i}, D)$$ is $$q_{i}$$'s term frequency in the document $$D$$, $$|D|$$ is the length of the document $$D$$ in words, and $$avgdl$$ is the average document length in the text collection from which documents are drawn. $$k_{1}$$ and $$b$$ are free parameters, usually chosen, in absence of an advanced optimization, as $$k_{1} \in [1.2,2.0]$$ and $$b = 0.75$$.
BM25 can also be applied for term weighing, showing how important a word is to a document in a collection or corpus, as follows:
$$ score(t_{i}, D) = IDF(t_{i}) \cdot \frac{tf(t_{i},D) \cdot (k_{1}+1)}{tf(t_{i},D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$
where $$t_{i}$$ is a term appeared in document $$D$$.
In similar to TF-IDF, you need to prepare a relation consists of (docid,word) tuples to compute BM25 score.
create external table wikipage ( docid int, page string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE; cd ~/tmp wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage; create or replace view wikipage_exploded as select docid, word from wikipage LATERAL VIEW explode(tokenize(page,true)) t as word where not is_stopword(word);
create or replace view term_frequency as select t1.docid, t2.word, t2.freq from ( select docid, tf(word) as word2freq from wikipage_exploded group by docid ) t1 LATERAL VIEW explode(word2freq) t2 as word, freq; create or replace view document_frequency as select word, count(distinct docid) docs from wikipage_exploded group by word; create or replace view doc_len as select docid, count(1) as dl, avg(count(1)) over () as avgdl, count(distinct docid) over () as total_docs from wikipage_exploded group by docid ;
BM25 (and TF-IDF) score that represents importance of term for each document is useful for feature weight in feature engineering.
create table scores as select tf.docid, tf.word, bm25( tf.freq, dl.dl, dl.avgdl, dl.total_docs, df.docs -- , '-k1 1.5 -b 0.75' ) as bm25, tfidf(tf.freq, df.docs, dl.total_docs) as tfidf from term_frequency tf JOIN document_frequency df ON (tf.word = df.word) JOIN doc_len dl ON (tf.docid = dl.docid) ;
bm25()
's function signature and hyperparameters are as follows:
hive> select bm25(); FAILED: SemanticException Line 1:7 Wrong arguments 'bm25': #arguments must be greater than or equal to 5: 0 usage: bm25(double termFrequency, int docLength, double avgDocLength, int numDocs, int numDocsWithTerm [, const string options]) - Return an Okapi BM25 score in double [-b <arg>] [-d <arg>] [-k1 <arg>] [-min_idf <arg>] -b <arg> Hyperparameter with type double in range 0.0 and 1.0 [default: 0.75] -d,--delta <arg> Hyperparameter delta of BM25+ [default: 0.0] -k1 <arg> Hyperparameter with type double, usually in range 1.2 and 2.0 [default: 1.2] -min_idf,--epsilon <arg> Hyperparameter delta of BM25+ [default: 1e-8]
select docid, to_ordered_list(feature(word,bm25), bm25, '-k 10') as bm25_scores, to_ordered_list(feature(word,tfidf),tfidf, '-k 10') as tfidf_scores from scores group by docid limit 10;
You can retrieve relevant documents for a given search query wisdom, justice, discussion
as follows:
WITH scores as ( select tf.docid, tf.word, bm25( tf.freq, dl.dl, dl.avgdl, dl.total_docs, df.docs -- , '-k1 1.5 -b 0.75' ) as bm25, tfidf(tf.freq, df.docs, dl.total_docs) as tfidf from term_frequency tf JOIN document_frequency df ON (tf.word = df.word) JOIN doc_len dl ON (tf.docid = dl.docid) where tf.word in ('wisdom', 'justice', 'discussion') ) select docid, sum(bm25) as score from scores group by docid order by score DESC LIMIT 10 ;
docid | score |
---|---|
1 | 0.14190456024682774 |
2 | 0.02197354085722251 |