| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="utf_8"> |
| <title>UTF-8 Support</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Impala Functions"/> |
| <data name="Category" value="utf_8"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| <conbody> |
| <p>Impala has traditionally offered a single-byte binary character set for STRING data type and |
| the character data is encoded in ASCII character set. Prior to this release, Impala was |
| incompatible with Hive in some functions applying on non-ASCII strings. E.g. length() in Impala |
| used to return the length of bytes of the string, while length() in Hive returns the length of |
| UTF-8 characters of the string. UTF-8 characters (code points) are assembled in variant-length |
| bytes (1~4 bytes), so the results differ when there are non-ASCII characters in the string. This |
| release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with |
| Hive on UTF-8 strings using a query option.</p> |
| <p>UTF-8 support allows you to read and write UTF-8 from standard formats like Parquet and ORC, |
| thus improving interoperability with other engines that also support those standard formats.</p> |
| </conbody> |
| <concept id="turning_ON"> |
| <title>Turning ON the UTF-8 behavior</title> |
| <conbody> |
| <p>You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The |
| query option can be set globally, or at per session level. Only queries with UTF8_MODE=true will |
| have UTF-8 aware behaviors.</p> |
| <p> |
| <note> |
| <ul id="ul_vs2_qrx_p5b"> |
| <li>If the query option UTF8_MODE is turned on globally, existing queries that depend on |
| the original binary behavior need to explicitly set UTF8_MODE=false.</li> |
| <li>Impala Daemons should be deployed on nodes using the same Glibc version since |
| different Glibc version supports different Unicode standard version and also ensure |
| that the en_US.UTF-8 locale is installed in the nodes. Not using the same Glibc |
| version might result in inconsistent UTF-8 behavior when UTF8_MODE is set to |
| true.</li> |
| </ul> |
| </note></p> |
| </conbody> |
| </concept> |
| <concept id="list_string_functions"> |
| <title>List of STRING Functions</title> |
| <conbody> |
| <p>The new query option introduced will turn on the UTF-8 aware behavior of the following string |
| functions:</p> |
| <ul> |
| <li>LENGTH(STRING a)<ul id="ul_jgr_x1l_gtb"> |
| <li>returns the number of UTF-8 characters instead of bytes</li> |
| </ul></li> |
| <li>SUBSTR(STRING a, INT start [, INT len])</li> |
| <li>SUBSTRING(STRING a, INT start [, INT len])()<ul id="ul_tkh_x1l_gtb"> |
| <li>the substring start position and length is counted by UTF-8 characters instead of |
| bytes</li> |
| </ul></li> |
| <li>REVERSE(STRING a)<ul id="ul_o1d_jbl_gtb"> |
| <li>the unit of the operation is a UTF-8 character, ie. it won't reverse bytes inside a UTF-8 |
| character.<p> |
| <note>The results of reverse("最快的SQL引擎") used to be "��敼�LQS��竿倜�" and now |
| "擎引LQS的快最".</note></p></li> |
| </ul></li> |
| <li>INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])</li> |
| <li>LOCATE(STRING substr, STRING str[, INT pos])<ul id="ul_y1p_sbl_gtb"> |
| <li>These functions have an optional position argument. The return values are also positions |
| in the string. In UTF-8 mode, these positions are counted by UTF-8 characters instead of |
| bytes.</li> |
| </ul></li> |
| <li>mask functions<ul id="ul_qmg_5bl_gtb"> |
| <li>The unit of the operation is a UTF-8 character, ie. they won't mask the string |
| byte-to-byte.</li> |
| </ul></li> |
| <li>upper/lower/initcap<ul id="ul_x3c_wbl_gtb"> |
| <li>These functions will recognize non-ascii characters and transform them based on the |
| current locale used by the Impala process.</li> |
| </ul></li> |
| </ul> |
| </conbody> |
| </concept> |
| <concept id="limitations"> |
| <title>Limitations</title> |
| <conbody> |
| <ul id="ul_dhh_dcl_gtb"> |
| <li>Use the UTF8_MODE option only when needed since the performance of UTF_8 is not optimized |
| yet. It is only an experimental feature.</li> |
| <li>UTF-8 support for CHAR and VARCHAR types is not implemented yet. So VARCHAR(N) will still |
| return N bytes instead of N UTF-8 characters.</li> |
| </ul> |
| </conbody> |
| </concept> |
| </concept> |