[doc] support ngram_search function (#899) https://github.com/apache/doris/pull/38226

commit: e676a91adecd4445ff81133e0bb6b4317a898aae [log] [tgz]
author: Mryange <59914473+Mryange@users.noreply.github.com> Fri Sep 27 20:06:27 2024 +0800
committer: GitHub <noreply@github.com> Fri Sep 27 20:06:27 2024 +0800
tree: 3be0f80dcde9c17ae2d0f067b8db05a19cb1f4d1
parent: 240c0e7ce99a2a080650e2c2e4f4427ef97155de [diff]
diff --git a/docs/sql-manual/sql-functions/string-functions/ngram-search.md b/docs/sql-manual/sql-functions/string-functions/ngram-search.md
new file mode 100644
index 0000000..ae42731
--- /dev/null
+++ b/docs/sql-manual/sql-functions/string-functions/ngram-search.md

@@ -0,0 +1,67 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. 
+
+Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0.
+
+N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}.
+
+The N-gram similarity is calculated as:
+
+2 * |Intersection| / (|text set| + |pattern set|)
+
+where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets.
+
+Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical.
+
+Only supports ASCII encoding.
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH

diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/string-functions/ngram-search.md
new file mode 100644
index 0000000..1a2eecc
--- /dev/null
+++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/string-functions/ngram-search.md

@@ -0,0 +1,67 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1，相似度越高证明两个字符串越相似。
+其中`pattern`，`gram_num`必须为常量。
+如果`text`或者`pattern`的长度小于`gram_num`，返回 0。
+
+N-gram 相似度（N-gram similarity）是一种基于 N-gram（N 元语法）的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如，对于字符串“text”，当 N=2 时，其二元组（bi-gram）为：{“te”, “ex”, “xt”}。
+
+N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|)
+
+其中|text set|，|pattern set|为 text 和 pattern 的 N-gram，`Intersection`为两个集合的交集。
+
+注意，根据定义，相似度为 1 不代表两个字符串相同。
+
+仅支持 ASCII 编码。
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH

diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md
new file mode 100644
index 0000000..e080165
--- /dev/null
+++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md

@@ -0,0 +1,65 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1，相似度越高证明两个字符串越相似。
+其中`pattern`，`gram_num`必须为常量。
+如果`text`或者`pattern`的长度小于`gram_num`，返回 0。
+
+N-gram 相似度（N-gram similarity）是一种基于 N-gram（N 元语法）的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如，对于字符串“text”，当 N=2 时，其二元组（bi-gram）为：{“te”, “ex”, “xt”}。
+
+N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|)
+
+其中|text set|，|pattern set|为 text 和 pattern 的 N-gram，`Intersection`为两个集合的交集。
+
+注意，根据定义，相似度为 1 不代表两个字符串相同。
+
+仅支持 ASCII 编码。
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH

diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md
new file mode 100644
index 0000000..e080165
--- /dev/null
+++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md

@@ -0,0 +1,65 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1，相似度越高证明两个字符串越相似。
+其中`pattern`，`gram_num`必须为常量。
+如果`text`或者`pattern`的长度小于`gram_num`，返回 0。
+
+N-gram 相似度（N-gram similarity）是一种基于 N-gram（N 元语法）的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如，对于字符串“text”，当 N=2 时，其二元组（bi-gram）为：{“te”, “ex”, “xt”}。
+
+N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|)
+
+其中|text set|，|pattern set|为 text 和 pattern 的 N-gram，`Intersection`为两个集合的交集。
+
+注意，根据定义，相似度为 1 不代表两个字符串相同。
+
+仅支持 ASCII 编码。
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH

diff --git a/sidebars.json b/sidebars.json
index 4f7831a..fc9c971 100644
--- a/sidebars.json
+++ b/sidebars.json

@@ -895,6 +895,7 @@
                                 "sql-manual/sql-functions/string-functions/split-by-regexp",
                                 "sql-manual/sql-functions/string-functions/substring-index",
                                 "sql-manual/sql-functions/string-functions/money-format",
+                                "sql-manual/sql-functions/string-functions/ngram-search",
                                 "sql-manual/sql-functions/string-functions/parse-url",
                                 "sql-manual/sql-functions/string-functions/quote",
                                 "sql-manual/sql-functions/string-functions/url-decode",

diff --git a/versioned_docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md b/versioned_docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md
new file mode 100644
index 0000000..a39c0f6
--- /dev/null
+++ b/versioned_docs/version-2.1/sql-manual/sql-functions/string-functions/ngram-search.md

@@ -0,0 +1,69 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. 
+
+Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0.
+
+N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}.
+
+The N-gram similarity is calculated as:
+
+2 * |Intersection| / (|text set| + |pattern set|)
+
+where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets.
+
+Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical.
+
+Only supports ASCII encoding.
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH

diff --git a/versioned_docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md b/versioned_docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md
new file mode 100644
index 0000000..ae42731
--- /dev/null
+++ b/versioned_docs/version-3.0/sql-manual/sql-functions/string-functions/ngram-search.md

@@ -0,0 +1,67 @@
+---
+{
+    "title": "NGRAM_SEARCH",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Description
+
+Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. 
+
+Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0.
+
+N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}.
+
+The N-gram similarity is calculated as:
+
+2 * |Intersection| / (|text set| + |pattern set|)
+
+where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets.
+
+Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical.
+
+Only supports ASCII encoding.
+
+## Syntax
+
+`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
+
+## Example
+
+```sql
+mysql> select ngram_search('123456789' , '12345' , 3);
++---------------------------------------+
+| ngram_search('123456789', '12345', 3) |
++---------------------------------------+
+|                                   0.6 |
++---------------------------------------+
+
+mysql> select ngram_search("abababab","babababa",2);
++-----------------------------------------+
+| ngram_search('abababab', 'babababa', 2) |
++-----------------------------------------+
+|                                       1 |
++-----------------------------------------+
+```
+## keywords
+    NGRAM_SEARCH,NGRAM,SEARCH

diff --git a/versioned_sidebars/version-2.1-sidebars.json b/versioned_sidebars/version-2.1-sidebars.json
index 0746448..7572a6b 100644
--- a/versioned_sidebars/version-2.1-sidebars.json
+++ b/versioned_sidebars/version-2.1-sidebars.json

@@ -840,6 +840,7 @@
                                 "sql-manual/sql-functions/string-functions/split-by-string",
                                 "sql-manual/sql-functions/string-functions/substring-index",
                                 "sql-manual/sql-functions/string-functions/money-format",
+                                "sql-manual/sql-functions/string-functions/ngram-search",
                                 "sql-manual/sql-functions/string-functions/parse-url",
                                 "sql-manual/sql-functions/string-functions/quote",
                                 "sql-manual/sql-functions/string-functions/url-decode",

diff --git a/versioned_sidebars/version-3.0-sidebars.json b/versioned_sidebars/version-3.0-sidebars.json
index f82c6e8..d7a7efc 100644
--- a/versioned_sidebars/version-3.0-sidebars.json
+++ b/versioned_sidebars/version-3.0-sidebars.json

@@ -885,6 +885,7 @@
                                 "sql-manual/sql-functions/string-functions/split-by-string",
                                 "sql-manual/sql-functions/string-functions/substring-index",
                                 "sql-manual/sql-functions/string-functions/money-format",
+                                "sql-manual/sql-functions/string-functions/ngram-search",
                                 "sql-manual/sql-functions/string-functions/parse-url",
                                 "sql-manual/sql-functions/string-functions/quote",
                                 "sql-manual/sql-functions/string-functions/url-decode",
commit	e676a91adecd4445ff81133e0bb6b4317a898aae	[log] [tgz]
author	Mryange <59914473+Mryange@users.noreply.github.com>	Fri Sep 27 20:06:27 2024 +0800
committer	GitHub <noreply@github.com>	Fri Sep 27 20:06:27 2024 +0800
tree	3be0f80dcde9c17ae2d0f067b8db05a19cb1f4d1
parent	240c0e7ce99a2a080650e2c2e4f4427ef97155de [diff]