Okapi BM25 is a ranking function for documents for a given query.

It can also be used for a better replacement of TF-IDF and can be used for term-weight for each document.

The ranking function

Given a query $$Q$$, containing keywords $$q1,....,q_n$$, the BM25 score of a document $$D$$ is:

$$ score(Q, D) = \sum_{i=1}^{n}IDF(q_{i}) \cdot \frac{tf(q_{i},D) \cdot (k_{1}+1)}{tf(q_{i},D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$

where $$tf(q_{i}, D)$$ is $$q_{i}$$'s term frequency in the document $$D$$, $$|D|$$ is the length of the document $$D$$ in words, and $$avgdl$$ is the average document length in the text collection from which documents are drawn. $$k_{1}$$ and $$b$$ are free parameters, usually chosen, in absence of an advanced optimization, as $$k_{1} \in [1.2,2.0]$$ and $$b = 0.75$$.

BM25 can also be applied for term weighing, showing how important a word is to a document in a collection or corpus, as follows:

$$ score(t_{i}, D) = IDF(t_{i}) \cdot \frac{tf(t_{i},D) \cdot (k_{1}+1)}{tf(t_{i},D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} $$

where $$t_{i}$$ is a term appeared in document $$D$$.

Data preparation

In similar to TF-IDF, you need to prepare a relation consists of (docid,word) tuples to compute BM25 score.

create external table wikipage (
  docid int,
  page string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

cd ~/tmp
wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv

LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage;

create or replace view wikipage_exploded
as
select
  docid, 
  word
from
  wikipage LATERAL VIEW explode(tokenize(page,true)) t as word
where
  not is_stopword(word);

Define views of term/doc frequency

create or replace view term_frequency 
as
select
  t1.docid, 
  t2.word,
  t2.freq
from (
  select
    docid,
    tf(word) as word2freq
  from
    wikipage_exploded
  group by
    docid
) t1 
LATERAL VIEW explode(word2freq) t2 as word, freq;

create or replace view document_frequency
as
select
  word, 
  count(distinct docid) docs
from
  wikipage_exploded
group by
  word;

create or replace view doc_len
as
select 
  docid, 
  count(1) as dl,
  avg(count(1)) over () as avgdl,
  count(distinct docid) over () as total_docs
from
  wikipage_exploded
group by
  docid
;

Compute Okapi BM25 score

BM25 (and TF-IDF) score that represents importance of term for each document is useful for feature weight in feature engineering.

create table scores
as
select
  tf.docid,
  tf.word,
  bm25(
    tf.freq,
    dl.dl,
    dl.avgdl,
    dl.total_docs,
    df.docs
    -- , '-k1 1.5 -b 0.75'
  ) as bm25,
  tfidf(tf.freq, df.docs, dl.total_docs) as tfidf
from
  term_frequency tf
  JOIN document_frequency df ON (tf.word = df.word)
  JOIN doc_len dl ON (tf.docid = dl.docid)
;

Hyperparameters

bm25()'s function signature and hyperparameters are as follows:

hive> select bm25();
FAILED: SemanticException Line 1:7 Wrong arguments 'bm25':

#arguments must be greater than or equal to 5: 0

usage: bm25(double termFrequency, int docLength, double avgDocLength, int
       numDocs, int numDocsWithTerm [, const string options]) - Return an
       Okapi BM25 score in double [-b <arg>] [-d <arg>] [-k1 <arg>]
       [-min_idf <arg>]
 -b <arg>                   Hyperparameter with type double in range 0.0
                            and 1.0 [default: 0.75]
 -d,--delta <arg>           Hyperparameter delta of BM25+ [default: 0.0]
 -k1 <arg>                  Hyperparameter with type double, usually in
                            range 1.2 and 2.0 [default: 1.2]
 -min_idf,--epsilon <arg>   Hyperparameter delta of BM25+ [default: 1e-8]

Show important terms for each document

select
  docid, 
  to_ordered_list(feature(word,bm25), bm25, '-k 10') as bm25_scores,
  to_ordered_list(feature(word,tfidf),tfidf, '-k 10') as tfidf_scores
from 
  scores
group by
  docid
limit 10;

Retrive relevant documents for a given search terms

You can retrieve relevant documents for a given search query wisdom, justice, discussion as follows:

WITH scores as (
  select
    tf.docid,
    tf.word,
    bm25(
      tf.freq,
      dl.dl,
      dl.avgdl,
      dl.total_docs,
      df.docs
      -- , '-k1 1.5 -b 0.75'
    ) as bm25,
    tfidf(tf.freq, df.docs, dl.total_docs) as tfidf
  from
    term_frequency tf
    JOIN document_frequency df ON (tf.word = df.word)
    JOIN doc_len dl ON (tf.docid = dl.docid)
  where
    tf.word in ('wisdom', 'justice', 'discussion')
)
select
  docid,
  sum(bm25) as score 
from
  scores 
group by
  docid
order by
  score DESC 
LIMIT 10
;
docidscore
10.14190456024682774
20.02197354085722251