docs/gitbook/misc/tokenizer.md - incubator-hivemall - Git at Google

 <!--
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 <!-- toc -->

 # Tokenizer for English Texts

 Hivemall provides simple English text tokenizer UDF that has following syntax:
 ```sql
 tokenize(text input, optional boolean toLowerCase = false)
 ```

 # Tokenizer for Non-English Texts

 Hivemall-NLP module provides some Non-English Text tokenizer UDFs as follows.

 First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in `hivemall-with-dependencies.jar`.

 > add jar /path/to/hivemall-nlp-xxx-with-dependencies.jar;

 > source /path/to/define-additional.hive;

 ## Japanese Tokenizer

 Japanese text tokenizer UDF uses [Kuromoji](https://github.com/atilika/kuromoji).

 The signature of the UDF is as follows:

 ```sql
 tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)
 ```

 > #### Note
 > `tokenize_ja` is supported since Hivemall v0.4.1, and the fifth argument is supported since v0.5-rc.1 and later.

 Its basic usage is as follows:
 ```sql
 select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
 ```
 > ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]

 In addition, the third and fourth argument respectively allow you to use your own list of stop words and stop tags. For example, the following query simply ignores "kuromoji" (as a stop word) and noun word "分かち書き" (as a stop tag):

 ```sql
 select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), array("名詞-一般"));
 ```

 > ["を","使う","た","の","テスト","です"]

 ```sql
 select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));
 ```
 > ["分かち書き","テスト"]

 `stoptags_exclude(array<string> tags, [, const string lang='ja'])` is a useful UDF for getting [stoptags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt) excluding given part-of-speech tags as seen below:


 ```sql
 select stoptags_exclude(array("名詞-固有名詞"));
 ```
 > ["その他","その他-間投","フィラー","副詞","副詞-一般","副詞-助詞類接続","助動詞","助詞","助詞-並立助詞"
 ,"助詞-係助詞","助詞-副助詞","助詞-副助詞／並立助詞／終助詞","助詞-副詞化","助詞-接続助詞","助詞-格助詞
 ","助詞-格助詞-一般","助詞-格助詞-引用","助詞-格助詞-連語","助詞-特殊","助詞-終助詞","助詞-連体化","助
 詞-間投助詞","動詞","動詞-接尾","動詞-自立","動詞-非自立","名詞","名詞-サ変接続","名詞-ナイ形容詞語幹",
 "名詞-一般","名詞-代名詞","名詞-代名詞-一般","名詞-代名詞-縮約","名詞-副詞可能","名詞-動詞非自立的","名
 詞-引用文字列","名詞-形容動詞語幹","名詞-接尾","名詞-接尾-サ変接続","名詞-接尾-一般","名詞-接尾-人名","
 名詞-接尾-副詞可能","名詞-接尾-助動詞語幹","名詞-接尾-助数詞","名詞-接尾-地域","名詞-接尾-形容動詞語幹"
 ,"名詞-接尾-特殊","名詞-接続詞的","名詞-数","名詞-特殊","名詞-特殊-助動詞語幹","名詞-非自立","名詞-非自
 立-一般","名詞-非自立-副詞可能","名詞-非自立-助動詞語幹","名詞-非自立-形容動詞語幹","形容詞","形容詞-接
 尾","形容詞-自立","形容詞-非自立","感動詞","接続詞","接頭詞","接頭詞-動詞接続","接頭詞-名詞接続","接頭
 詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
 ","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]

 Moreover, the fifth argument `userDict` enables you to register a user-defined custom dictionary in [Kuromoji official format](https://github.com/atilika/kuromoji/blob/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt):

 ```sql
 select tokenize_ja("日本経済新聞＆関西国際空港", "normal", null, null,
                    array(
                      "日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞",
                      "関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞"
                    ));
 ```

 > ["日本","経済","新聞","関西","国際","空港"]

 Note that you can pass `null` to each of the third and fourth argument to explicitly use Kuromoji's [default stop words](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stopwords.txt) and [stop tags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt).

 If you have a large custom dictionary as an external file, `userDict` can also be `const string userDictURL` which indicates URL of the external file on somewhere like Amazon S3:

 ```sql
 select tokenize_ja("日本経済新聞＆関西国際空港", "normal", null, null,
                    "https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt");
 ```

 > ["日本","経済","新聞","関西","国際","空港"]

 Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with `.gz` suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.

 If you want to use HTTP Basic Authentication, please use the following form: `https://user:password@www.sitreurl.com/my_dict.txt.gz` (see Sec 3.1 of [rfc1738](https://www.ietf.org/rfc/rfc1738.txt))

 For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well.

 ## Part-of-speech

 From Hivemall v0.6.0, the second argument can also accept the following option format:

 ```
  -mode <arg>   The tokenization mode. One of ['normal', 'search',
                'extended', 'default' (normal)]
  -pos          Return part-of-speech information
 ```

 Then, you can get part-of-speech information as follows:

 ```sql
 WITH tmp as (
   select
     tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
 )
 select
   r.tokens,
   r.pos,
   r.tokens[0] as token0,
   r.pos[0] as pos0
 from
   tmp;
 ```

 | tokens |pos | token0 | pos0 |
 |:-:|:-:|:-:|:-:|
 | ["kuromoji","使う","分かち書き","テスト"] | ["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"] | kuromoji | 名詞-一般 |

 Note that when `-pos` option is specified, `tokenize_ja` returns a struct record containing `array<string> tokens` and `array<string> pos` as the elements.

 ## Chinese Tokenizer

 Chinese text tokenizer UDF uses [SmartChineseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html).

 The signature of the UDF is as follows:
 ```sql
 tokenize_cn(string line, optional const array<string> stopWords)
 ```

 Its basic usage is as follows:
 ```sql
 select tokenize_cn("Smartcn为Apache2.0协议的开源中文分词系统，Java语言编写，修改的中科院计算所ICTCLAS分词系统。");
 ```
 > [smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 系统]

 For detailed APIs, please refer Javadoc of [SmartChineseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html) as well.
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<!-- toc -->

	# Tokenizer for English Texts

	Hivemall provides simple English text tokenizer UDF that has following syntax:
	```sql
	tokenize(text input, optional boolean toLowerCase = false)
	```

	# Tokenizer for Non-English Texts

	Hivemall-NLP module provides some Non-English Text tokenizer UDFs as follows.

	First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in `hivemall-with-dependencies.jar`.

	> add jar /path/to/hivemall-nlp-xxx-with-dependencies.jar;

	> source /path/to/define-additional.hive;

	## Japanese Tokenizer

	Japanese text tokenizer UDF uses [Kuromoji](https://github.com/atilika/kuromoji).

	The signature of the UDF is as follows:

	```sql
	tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)
	```

	> #### Note
	> `tokenize_ja` is supported since Hivemall v0.4.1, and the fifth argument is supported since v0.5-rc.1 and later.

	Its basic usage is as follows:
	```sql
	select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
	```
	> ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]

	In addition, the third and fourth argument respectively allow you to use your own list of stop words and stop tags. For example, the following query simply ignores "kuromoji" (as a stop word) and noun word "分かち書き" (as a stop tag):

	```sql
	select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), array("名詞-一般"));
	```

	> ["を","使う","た","の","テスト","です"]

	```sql
	select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));
	```
	> ["分かち書き","テスト"]

	`stoptags_exclude(array<string> tags, [, const string lang='ja'])` is a useful UDF for getting [stoptags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt) excluding given part-of-speech tags as seen below:


	```sql
	select stoptags_exclude(array("名詞-固有名詞"));
	```
	> ["その他","その他-間投","フィラー","副詞","副詞-一般","副詞-助詞類接続","助動詞","助詞","助詞-並立助詞"
	,"助詞-係助詞","助詞-副助詞","助詞-副助詞／並立助詞／終助詞","助詞-副詞化","助詞-接続助詞","助詞-格助詞
	","助詞-格助詞-一般","助詞-格助詞-引用","助詞-格助詞-連語","助詞-特殊","助詞-終助詞","助詞-連体化","助
	詞-間投助詞","動詞","動詞-接尾","動詞-自立","動詞-非自立","名詞","名詞-サ変接続","名詞-ナイ形容詞語幹",
	"名詞-一般","名詞-代名詞","名詞-代名詞-一般","名詞-代名詞-縮約","名詞-副詞可能","名詞-動詞非自立的","名
	詞-引用文字列","名詞-形容動詞語幹","名詞-接尾","名詞-接尾-サ変接続","名詞-接尾-一般","名詞-接尾-人名","
	名詞-接尾-副詞可能","名詞-接尾-助動詞語幹","名詞-接尾-助数詞","名詞-接尾-地域","名詞-接尾-形容動詞語幹"
	,"名詞-接尾-特殊","名詞-接続詞的","名詞-数","名詞-特殊","名詞-特殊-助動詞語幹","名詞-非自立","名詞-非自
	立-一般","名詞-非自立-副詞可能","名詞-非自立-助動詞語幹","名詞-非自立-形容動詞語幹","形容詞","形容詞-接
	尾","形容詞-自立","形容詞-非自立","感動詞","接続詞","接頭詞","接頭詞-動詞接続","接頭詞-名詞接続","接頭
	詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
	","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]

	Moreover, the fifth argument `userDict` enables you to register a user-defined custom dictionary in [Kuromoji official format](https://github.com/atilika/kuromoji/blob/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt):

	```sql
	select tokenize_ja("日本経済新聞＆関西国際空港", "normal", null, null,
	array(
	"日本経済新聞,日本経済新聞,ニホンケイザイシンブン,カスタム名詞",
	"関西国際空港,関西国際空港,カンサイコクサイクウコウ,テスト名詞"
	));
	```

	> ["日本","経済","新聞","関西","国際","空港"]

	Note that you can pass `null` to each of the third and fourth argument to explicitly use Kuromoji's [default stop words](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stopwords.txt) and [stop tags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt).

	If you have a large custom dictionary as an external file, `userDict` can also be `const string userDictURL` which indicates URL of the external file on somewhere like Amazon S3:

	```sql
	select tokenize_ja("日本経済新聞＆関西国際空港", "normal", null, null,
	"https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt");
	```

	> ["日本","経済","新聞","関西","国際","空港"]

	Dictionary SHOULD be accessible through http/https protocol. And, it SHOULD be compressed using gzip with `.gz` suffix because the maximum dictionary size is limited to 32MB and read timeout is set to 60 sec. Also, connection must be established in 10 sec.

	If you want to use HTTP Basic Authentication, please use the following form: `https://user:password@www.sitreurl.com/my_dict.txt.gz` (see Sec 3.1 of [rfc1738](https://www.ietf.org/rfc/rfc1738.txt))

	For detailed APIs, please refer Javadoc of [JapaneseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html) as well.

	## Part-of-speech

	From Hivemall v0.6.0, the second argument can also accept the following option format:

	```
	-mode <arg> The tokenization mode. One of ['normal', 'search',
	'extended', 'default' (normal)]
	-pos Return part-of-speech information
	```

	Then, you can get part-of-speech information as follows:

	```sql
	WITH tmp as (
	select
	tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
	)
	select
	r.tokens,
	r.pos,
	r.tokens[0] as token0,
	r.pos[0] as pos0
	from
	tmp;
	```

	\| tokens \|pos \| token0 \| pos0 \|
	\|:-:\|:-:\|:-:\|:-:\|
	\| ["kuromoji","使う","分かち書き","テスト"] \| ["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"] \| kuromoji \| 名詞-一般 \|

	Note that when `-pos` option is specified, `tokenize_ja` returns a struct record containing `array<string> tokens` and `array<string> pos` as the elements.

	## Chinese Tokenizer

	Chinese text tokenizer UDF uses [SmartChineseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html).

	The signature of the UDF is as follows:
	```sql
	tokenize_cn(string line, optional const array<string> stopWords)
	```

	Its basic usage is as follows:
	```sql
	select tokenize_cn("Smartcn为Apache2.0协议的开源中文分词系统，Java语言编写，修改的中科院计算所ICTCLAS分词系统。");
	```
	> [smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 系统]

	For detailed APIs, please refer Javadoc of [SmartChineseAnalyzer](https://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html) as well.