blob: 5310c70d699e790aeebeebe13714d184c3f23585 [file] [view]
---
{
"title": "TOKENIZE",
"language": "en",
"description": "The TOKENIZE function tokenizes a string using a specified analyzer and returns the tokenization results as a JSON-formatted string array."
}
---
## Description
The `TOKENIZE` function tokenizes a string using a specified analyzer and returns the tokenization results as a JSON-formatted string array. This function is particularly useful for understanding how text will be analyzed when using inverted indexes with full-text search capabilities.
## Syntax
```sql
VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)
```
## Parameters
- `str`: The input string to be tokenized. Type: `VARCHAR`
- `properties`: A property string specifying the analyzer configuration. Type: `VARCHAR`
The `properties` parameter supports the following key-value pairs (format: `"key1"="value1", "key2"="value2"`):
### Common Properties
| Property | Description | Example Values |
|----------|-------------|----------------|
| `built_in_analyzer` | Built-in analyzer type | `"english"`, `"chinese"`, `"unicode"`, `"icu"`, `"basic"`, `"ik"`, `"standard"`, `"none"` |
| `analyzer` | Custom analyzer name (created via `CREATE INVERTED INDEX ANALYZER`) | `"my_custom_analyzer"` |
| `parser_mode` | Parser mode (for chinese analyzers) | `"fine_grained"`, `"coarse_grained"` |
| `support_phrase` | Enable phrase support (stores position information) | `"true"`, `"false"` |
| `lower_case` | Convert tokens to lowercase | `"true"`, `"false"` |
| `char_filter_type` | Character filter type | Varies by filter |
| `stop_words` | Stop words configuration | Varies by implementation |
## Return Value
Returns a `VARCHAR` containing a JSON array of tokenization results. Each element in the array is an object with the following structure:
- `token`: The tokenized term
- `position`: (Optional) The position index of the token when `support_phrase` is enabled
## Examples
### Example 1: Using built-in analyzers
```sql
-- Using the standard analyzer
SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');
```
```
[{ "token": "hello" }, { "token": "world" }]
```
```sql
-- Using the english analyzer
SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');
```
```
[{ "token": "run" }, { "token": "quick" }]
```
```sql
-- Using the unicode analyzer with Chinese text
SELECT TOKENIZE("Apache Doris数据库", '"built_in_analyzer"="unicode"');
```
```
[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" }, { "token": "库" }]
```
```sql
-- Using the chinese analyzer
SELECT TOKENIZE("我来到北京清华大学", '"built_in_analyzer"="chinese"');
```
```
[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]
```
```sql
-- Using the icu analyzer for multilingual text
SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');
```
```
[{ "token": "hello" }, { "token": "world" }, {"token": "世界"}]
```
```sql
-- Using the basic analyzer
SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", '"built_in_analyzer"="basic"');
```
```
[{ "token": "get" }, { "token": "images" }, {"token": "hm"}, {"token": "bg"}, {"token": "jpg"}, {"token": "http"}, {"token": "1"}, {"token": "0"}]
```
```sql
-- Using the ik analyzer for Chinese text
SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');
```
```
[{ "token": "中华人民共和国" }, { "token": "国歌" }]
```
### Example 2: Using custom analyzers
First, create a custom analyzer:
```sql
CREATE INVERTED INDEX ANALYZER lowercase_delimited
PROPERTIES (
"tokenizer" = "standard",
"token_filter" = "asciifolding, lowercase"
);
```
Then use it with `TOKENIZE`:
```sql
SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');
```
```
[{ "token": "foo" }, { "token": "bar" }]
```
### Example 3: With phrase support (position information)
```sql
SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", "support_phrase"="true"');
```
```
[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]
```
## Notes
1. **Analyzer Configuration**: The `properties` parameter must be a valid property string. If using a custom analyzer, it must be created beforehand using `CREATE INVERTED INDEX ANALYZER`.
2. **Supported Analyzers**: Currently supported built-in analyzers include:
- `standard`: Standard analyzer for general text
- `english`: English language analyzer with stemming
- `chinese`: Chinese text analyzer
- `unicode`: Unicode-based analyzer for multilingual text
- `icu`: ICU-based analyzer for advanced Unicode processing
- `basic`: Basic tokenization
- `ik`: IK analyzer for Chinese text
- `none`: No tokenization (returns original string as single token)
3. **Performance**: The `TOKENIZE` function is primarily intended for testing and debugging analyzer configurations. For production full-text search, use inverted indexes with the `MATCH` or `SEARCH` operators.
4. **JSON Output**: The output is a formatted JSON string that can be further processed using JSON functions if needed.
5. **Compatibility with Inverted Indexes**: The same analyzer configuration used in `TOKENIZE` can be applied to inverted indexes when creating tables:
```sql
CREATE TABLE example (
content TEXT,
INDEX idx_content(content) USING INVERTED PROPERTIES("analyzer"="my_analyzer")
)
```
6. **Testing Analyzer Behavior**: Use `TOKENIZE` to preview how text will be tokenized before creating inverted indexes, helping to choose the most appropriate analyzer for your data.
## Related Functions
- [MATCH](../../../../sql-manual/basic-element/operators/conditional-operators/full-text-search-operators): Full-text search using inverted indexes
- [SEARCH](../../../../ai/text-search/search-function): Advanced search with DSL support
## Keywords
TOKENIZE, STRING, FULL-TEXT SEARCH, INVERTED INDEX, ANALYZER