Tokenizer for English Texts

Hivemall provides simple English text tokenizer UDF that has following syntax:

tokenize(text input, optional boolean toLowerCase = false)

Tokenizer for Non-English Texts

Hivemall-NLP module provides some Non-English Text tokenizer UDFs as follows.

First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in hivemall-with-dependencies.jar.

add jar /path/to/hivemall-nlp-xxx-with-dependencies.jar;

source /path/to/define-additional.hive;

Japanese Tokenizer

Japanese text tokenizer UDF uses Kuromoji.

The signature of the UDF is as follows:

tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)

Note
tokenize_ja is supported since Hivemall v0.4.1, and the fifth argument is supported since v0.5-rc.1 and later.

Its basic usage is as follows:

select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");

[“kuromoji”,“使う”,“分かち書き”,“テスト”,“第”,“二”,“引数”,“normal”,“search”,“extended”,“指定”,“デフォルト”,“normal”,“モード”]

In addition, the third and fourth argument respectively allow you to use your own list of stop words and stop tags. For example, the following query simply ignores “kuromoji” (as a stop word) and noun word “分かち書き” (as a stop tag):

select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), array("名詞-一般"));

[“を”,“使う”,“た”,“の”,“テスト”,“です”]

select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));

[“分かち書き”,“テスト”]

stoptags_exclude(array<string> tags, [, const string lang='ja']) is a useful UDF for getting stoptags excluding given part-of-speech tags as seen below:

select stoptags_exclude(array("名詞-固有名詞"));

[“その他”,“その他-間投”,“フィラー”,“副詞”,“副詞-一般”,“副詞-助詞類接続”,“助動詞”,“助詞”,“助詞-並立助詞” ,“助詞-係助詞”,“助詞-副助詞”,“助詞-副助詞／並立助詞／終助詞”,“助詞-副詞化”,“助詞-接続助詞”,“助詞-格助詞 “,“助詞-格助詞-一般”,“助詞-格助詞-引用”,“助詞-格助詞-連語”,“助詞-特殊”,“助詞-終助詞”,“助詞-連体化”,“助詞-間投助詞”,“動詞”,“動詞-接尾”,“動詞-自立”,“動詞-非自立”,“名詞”,“名詞-サ変接続”,“名詞-ナイ形容詞語幹”, “名詞-一般”,“名詞-代名詞”,“名詞-代名詞-一般”,“名詞-代名詞-縮約”,“名詞-副詞可能”,“名詞-動詞非自立的”,“名詞-引用文字列”,“名詞-形容動詞語幹”,“名詞-接尾”,“名詞-接尾-サ変接続”,“名詞-接尾-一般”,“名詞-接尾-人名”,” 名詞-接尾-副詞可能”,“名詞-接尾-助動詞語幹”,“名詞-接尾-助数詞”,“名詞-接尾-地域”,“名詞-接尾-形容動詞語幹” ,“名詞-接尾-特殊”,“名詞-接続詞的”,“名詞-数”,“名詞-特殊”,“名詞-特殊-助動詞語幹”,“名詞-非自立”,“名詞-非自立-一般”,“名詞-非自立-副詞可能”,“名詞-非自立-助動詞語幹”,“名詞-非自立-形容動詞語幹”,“形容詞”,“形容詞-接尾”,“形容詞-自立”,“形容詞-非自立”,“感動詞”,“接続詞”,“接頭詞”,“接頭詞-動詞接続”,“接頭詞-名詞接続”,“接頭詞-形容詞接続”,“接頭詞-数接”,“未知語”,“記号”,“記号-アルファベット”,“記号-一般”,“記号-句点”,"記号-括弧閉 ",“記号-括弧開”,“記号-空白”,“記号-読点”,“語断片”,“連体詞”,“非言語音”]

Moreover, the fifth argument userDict enables you to register a user-defined custom dictionary in Kuromoji official format:

select tokenize_ja("日本経済新聞＆関西国際空港", "normal", null, null, 
                   array(
                     "日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞", 
                     "関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞"
                   ));

[“日本”,“経済”,“新聞”,“関西”,“国際”,“空港”]

Note that you can pass null to each of the third and fourth argument to explicitly use Kuromoji's default stop words and stop tags.

If you have a large custom dictionary as an external file, userDict can also be const string userDictURL which indicates URL of the external file on somewhere like Amazon S3:

select tokenize_ja("日本経済新聞＆関西国際空港", "normal", null, null,
                   "https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt");

[“日本”,“経済”,“新聞”,“関西”,“国際”,“空港”]

Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with .gz suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.

If you want to use HTTP Basic Authentication, please use the following form: https://user:password@www.sitreurl.com/my_dict.txt.gz (see Sec 3.1 of rfc1738)

For detailed APIs, please refer Javadoc of JapaneseAnalyzer as well.

Part-of-speech

From Hivemall v0.6.0, the second argument can also accept the following option format:

 -mode <arg>   The tokenization mode. One of ['normal', 'search',
               'extended', 'default' (normal)]
 -pos          Return part-of-speech information

Then, you can get part-of-speech information as follows:

WITH tmp as (
  select
    tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
)
select
  r.tokens,
  r.pos,
  r.tokens[0] as token0,
  r.pos[0] as pos0
from
  tmp;

tokens	pos	token0	pos0
[“kuromoji”,“使う”,“分かち書き”,“テスト”]	[“名詞-一般”,“動詞-自立”,“名詞-一般”,“名詞-サ変接続”]	kuromoji	名詞-一般

Note that when -pos option is specified, tokenize_ja returns a struct record containing array<string> tokens and array<string> pos as the elements.

Chinese Tokenizer

Chinese text tokenizer UDF uses SmartChineseAnalyzer.

The signature of the UDF is as follows:

tokenize_cn(string line, optional const array<string> stopWords)

Its basic usage is as follows:

select tokenize_cn("Smartcn为Apache2.0协议的开源中文分词系统，Java语言编写，修改的中科院计算所ICTCLAS分词系统。");

[smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 系统]

For detailed APIs, please refer Javadoc of SmartChineseAnalyzer as well.