docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md

{ “title”: “TOKENIZE”, “language”: “en”, “description”: “The TOKENIZE function tokenizes a string using a specified analyzer and returns the tokenization results as a JSON-formatted string array.” }

Description

The TOKENIZE function tokenizes a string using a specified analyzer and returns the tokenization results as a JSON-formatted string array. This function is particularly useful for understanding how text will be analyzed when using inverted indexes with full-text search capabilities.

Syntax

VARCHAR TOKENIZE(VARCHAR str, VARCHAR properties)

Parameters

str: The input string to be tokenized. Type: VARCHAR
properties: A property string specifying the analyzer configuration. Type: VARCHAR

The properties parameter supports the following key-value pairs (format: "key1"="value1", "key2"="value2"):

Common Properties

Property	Description	Example Values
`built_in_analyzer`	Built-in analyzer type	`"english"`, `"chinese"`, `"unicode"`, `"icu"`, `"basic"`, `"ik"`, `"standard"`, `"none"`
`analyzer`	Custom analyzer name (created via `CREATE INVERTED INDEX ANALYZER`)	`"my_custom_analyzer"`
`parser_mode`	Parser mode (for chinese analyzers)	`"fine_grained"`, `"coarse_grained"`
`support_phrase`	Enable phrase support (stores position information)	`"true"`, `"false"`
`lower_case`	Convert tokens to lowercase	`"true"`, `"false"`
`char_filter_type`	Character filter type	Varies by filter
`stop_words`	Stop words configuration	Varies by implementation

Return Value

Returns a VARCHAR containing a JSON array of tokenization results. Each element in the array is an object with the following structure:

token: The tokenized term
position: (Optional) The position index of the token when support_phrase is enabled

Examples

Example 1: Using built-in analyzers

-- Using the standard analyzer
SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard"');

[{ "token": "hello" }, { "token": "world" }]

-- Using the english analyzer
SELECT TOKENIZE("running quickly", '"built_in_analyzer"="english"');

[{ "token": "run" }, { "token": "quick" }]

-- Using the unicode analyzer with Chinese text
SELECT TOKENIZE("Apache Doris数据库", '"built_in_analyzer"="unicode"');

[{ "token": "apache" }, { "token": "doris" }, { "token": "数" }, { "token": "据" }, { "token": "库" }]

-- Using the chinese analyzer
SELECT TOKENIZE("我来到北京清华大学", '"built_in_analyzer"="chinese"');

[{ "token": "我" }, { "token": "来到" }, { "token": "北京" }, { "token": "清华大学" }]

-- Using the icu analyzer for multilingual text
SELECT TOKENIZE("Hello World 世界", '"built_in_analyzer"="icu"');

[{ "token": "hello" }, { "token": "world" }, {"token": "世界"}]

-- Using the basic analyzer
SELECT TOKENIZE("GET /images/hm_bg.jpg HTTP/1.0", '"built_in_analyzer"="basic"');

[{ "token": "get" }, { "token": "images" }, {"token": "hm"}, {"token": "bg"}, {"token": "jpg"}, {"token": "http"}, {"token": "1"}, {"token": "0"}]

-- Using the ik analyzer for Chinese text
SELECT TOKENIZE("中华人民共和国国歌", '"built_in_analyzer"="ik"');

[{ "token": "中华人民共和国" }, { "token": "国歌" }]

Example 2: Using custom analyzers

First, create a custom analyzer:

CREATE INVERTED INDEX ANALYZER lowercase_delimited
PROPERTIES (
    "tokenizer" = "standard",
    "token_filter" = "asciifolding, lowercase"
);

Then use it with TOKENIZE:

SELECT TOKENIZE("FOO-BAR", '"analyzer"="lowercase_delimited"');

[{ "token": "foo" }, { "token": "bar" }]

Example 3: With phrase support (position information)

SELECT TOKENIZE("Hello World", '"built_in_analyzer"="standard", "support_phrase"="true"');

[{ "token": "hello", "position": 0 }, { "token": "world", "position": 1 }]

Notes

Analyzer Configuration: The properties parameter must be a valid property string. If using a custom analyzer, it must be created beforehand using CREATE INVERTED INDEX ANALYZER.
Supported Analyzers: Currently supported built-in analyzers include:
- standard: Standard analyzer for general text
- english: English language analyzer with stemming
- chinese: Chinese text analyzer
- unicode: Unicode-based analyzer for multilingual text
- icu: ICU-based analyzer for advanced Unicode processing
- basic: Basic tokenization
- ik: IK analyzer for Chinese text
- none: No tokenization (returns original string as single token)
Performance: The TOKENIZE function is primarily intended for testing and debugging analyzer configurations. For production full-text search, use inverted indexes with the MATCH or SEARCH operators.
JSON Output: The output is a formatted JSON string that can be further processed using JSON functions if needed.

Compatibility with Inverted Indexes: The same analyzer configuration used in TOKENIZE can be applied to inverted indexes when creating tables:

CREATE TABLE example (
    content TEXT,
    INDEX idx_content(content) USING INVERTED PROPERTIES("analyzer"="my_analyzer")
)

Testing Analyzer Behavior: Use TOKENIZE to preview how text will be tokenized before creating inverted indexes, helping to choose the most appropriate analyzer for your data.

Related Functions

MATCH: Full-text search using inverted indexes
SEARCH: Advanced search with DSL support

Keywords

TOKENIZE, STRING, FULL-TEXT SEARCH, INVERTED INDEX, ANALYZER