Hivemall provides simple English text tokenizer UDF that has following syntax:
tokenize(text input, optional boolean toLowerCase = false)
Hivemall-NLP module provides some Non-English Text tokenizer UDFs as follows.
First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in hivemall-with-dependencies.jar
.
add jar /path/to/hivemall-nlp-xxx-with-dependencies.jar;
source /path/to/define-additional.hive;
Japanese text tokenizer UDF uses Kuromoji.
The signature of the UDF is as follows:
tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)
Note
tokenize_ja
is supported since Hivemall v0.4.1, and the fifth argument is supported since v0.5-rc.1 and later.
Its basic usage is as follows:
select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
[“kuromoji”,“使う”,“分かち書き”,“テスト”,“第”,“二”,“引数”,“normal”,“search”,“extended”,“指定”,“デフォルト”,“normal”,“モード”]
In addition, the third and fourth argument respectively allow you to use your own list of stop words and stop tags. For example, the following query simply ignores “kuromoji” (as a stop word) and noun word “分かち書き” (as a stop tag):
select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), array("名詞-一般"));
[“を”,“使う”,“た”,“の”,“テスト”,“です”]
Moreover, the fifth argument userDict
enables you to register a user-defined custom dictionary in Kuromoji official format:
select tokenize_ja("日本経済新聞&関西国際空港", "normal", null, null, array( "日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞", "関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞" ));
[“日本”,“経済”,“新聞”,“関西”,“国際”,“空港”]
Note that you can pass null
to each of the third and fourth argument to explicitly use Kuromoji's default stop words and stop tags.
If you have a large custom dictionary as an external file, userDict
can also be const string userDictURL
which indicates URL of the external file on somewhere like Amazon S3:
select tokenize_ja("日本経済新聞&関西国際空港", "normal", null, null, "https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt");
[“日本”,“経済”,“新聞”,“関西”,“国際”,“空港”]
For detailed APIs, please refer Javadoc of JapaneseAnalyzer as well.
Chinese text tokenizer UDF uses SmartChineseAnalyzer.
The signature of the UDF is as follows:
tokenize_cn(string line, optional const array<string> stopWords)
Its basic usage is as follows:
select tokenize_cn("Smartcn为Apache2.0协议的开源中文分词系统,Java语言编写,修改的中科院计算所ICTCLAS分词系统。");
[smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 系统]
For detailed APIs, please refer Javadoc of SmartChineseAnalyzer as well.