docs/topics/impala_utf_8.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="utf_8">
  <title>UTF-8 Support</title>
  <prolog>
   <metadata>
    <data name="Category" value="Impala"/>
    <data name="Category" value="Impala Functions"/>
    <data name="Category" value="utf_8"/>
    <data name="Category" value="Developers"/>
    <data name="Category" value="Data Analysts"/>
   </metadata>
  </prolog>
  <conbody>
   <p>Impala has traditionally offered a single-byte binary character set for STRING data type and
    the character data is encoded in ASCII character set. Prior to this release, Impala was
    incompatible with Hive in some functions applying on non-ASCII strings. E.g. length() in Impala
    used to return the length of bytes of the string, while length() in Hive returns the length of
    UTF-8 characters of the string. UTF-8 characters (code points) are assembled in variant-length
    bytes (1~4 bytes), so the results differ when there are non-ASCII characters in the string. This
    release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with
    Hive on UTF-8 strings using a query option.</p>
   <p>UTF-8 support allows you to read and write UTF-8 from standard formats like Parquet and ORC,
    thus improving interoperability with other engines that also support those standard formats.</p>
  </conbody>
  <concept id="turning_ON">
   <title>Turning ON the UTF-8 behavior</title>
   <conbody>
    <p>You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The
     query option can be set globally, or at per session level. Only queries with UTF8_MODE=true will
     have UTF-8 aware behaviors.</p>
    <p>
     <note>
           <ul id="ul_vs2_qrx_p5b">
             <li>If the query option UTF8_MODE is turned on globally, existing queries that depend on
               the original binary behavior need to explicitly set UTF8_MODE=false.</li>
             <li>Impala Daemons should be deployed on nodes using the same Glibc version since
               different Glibc version supports different Unicode standard version and also ensure
               that the en_US.UTF-8 locale is installed in the nodes. Not using the same Glibc
               version might result in inconsistent UTF-8 behavior when UTF8_MODE is set to
               true.</li>
           </ul>
         </note></p>
   </conbody>
  </concept>
  <concept id="list_string_functions">
   <title>List of STRING Functions</title>
   <conbody>
    <p>The new query option introduced will turn on the UTF-8 aware behavior of the following string
     functions:</p>
    <ul>
     <li>LENGTH(STRING a)<ul id="ul_jgr_x1l_gtb">
       <li>returns the number of UTF-8 characters instead of bytes</li>
      </ul></li>
     <li>SUBSTR(STRING a, INT start [, INT len])</li>
     <li>SUBSTRING(STRING a, INT start [, INT len])()<ul id="ul_tkh_x1l_gtb">
       <li>the substring start position and length is counted by UTF-8 characters instead of
        bytes</li>
      </ul></li>
     <li>REVERSE(STRING a)<ul id="ul_o1d_jbl_gtb">
       <li>the unit of the operation is a UTF-8 character, ie. it won't reverse bytes inside a UTF-8
        character.<p>
         <note>The results of reverse("最快的SQL引擎") used to be "��敼�LQS��竿倜�" and now
          "擎引LQS的快最".</note></p></li>
      </ul></li>
     <li>INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])</li>
     <li>LOCATE(STRING substr, STRING str[, INT pos])<ul id="ul_y1p_sbl_gtb">
       <li>These functions have an optional position argument. The return values are also positions
        in the string. In UTF-8 mode, these positions are counted by UTF-8 characters instead of
        bytes.</li>
      </ul></li>
     <li>mask functions<ul id="ul_qmg_5bl_gtb">
       <li>The unit of the operation is a UTF-8 character, ie. they won't mask the string
        byte-to-byte.</li>
      </ul></li>
     <li>upper/lower/initcap<ul id="ul_x3c_wbl_gtb">
       <li>These functions will recognize non-ascii characters and transform them based on the
        current locale used by the Impala process.</li>
      </ul></li>
    </ul>
   </conbody>
  </concept>
  <concept id="limitations">
   <title>Limitations</title>
   <conbody>
    <ul id="ul_dhh_dcl_gtb">
     <li>Use the UTF8_MODE option only when needed since the performance of UTF_8 is not optimized
      yet. It is only an experimental feature.</li>
     <li>UTF-8 support for CHAR and VARCHAR types is not implemented yet. So VARCHAR(N) will still
      return N bytes instead of N UTF-8 characters.</li>
    </ul>
   </conbody>
  </concept>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="utf_8">
	<title>UTF-8 Support</title>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Impala Functions"/>
	<data name="Category" value="utf_8"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>
	<conbody>
	<p>Impala has traditionally offered a single-byte binary character set for STRING data type and
	the character data is encoded in ASCII character set. Prior to this release, Impala was
	incompatible with Hive in some functions applying on non-ASCII strings. E.g. length() in Impala
	used to return the length of bytes of the string, while length() in Hive returns the length of
	UTF-8 characters of the string. UTF-8 characters (code points) are assembled in variant-length
	bytes (1~4 bytes), so the results differ when there are non-ASCII characters in the string. This
	release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with
	Hive on UTF-8 strings using a query option.</p>
	<p>UTF-8 support allows you to read and write UTF-8 from standard formats like Parquet and ORC,
	thus improving interoperability with other engines that also support those standard formats.</p>
	</conbody>
	<concept id="turning_ON">
	<title>Turning ON the UTF-8 behavior</title>
	<conbody>
	<p>You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The
	query option can be set globally, or at per session level. Only queries with UTF8_MODE=true will
	have UTF-8 aware behaviors.</p>
	<p>
	<note>
	<ul id="ul_vs2_qrx_p5b">
	<li>If the query option UTF8_MODE is turned on globally, existing queries that depend on
	the original binary behavior need to explicitly set UTF8_MODE=false.</li>
	<li>Impala Daemons should be deployed on nodes using the same Glibc version since
	different Glibc version supports different Unicode standard version and also ensure
	that the en_US.UTF-8 locale is installed in the nodes. Not using the same Glibc
	version might result in inconsistent UTF-8 behavior when UTF8_MODE is set to
	true.</li>
	</ul>
	</note></p>
	</conbody>
	</concept>
	<concept id="list_string_functions">
	<title>List of STRING Functions</title>
	<conbody>
	<p>The new query option introduced will turn on the UTF-8 aware behavior of the following string
	functions:</p>
	<ul>
	<li>LENGTH(STRING a)<ul id="ul_jgr_x1l_gtb">
	<li>returns the number of UTF-8 characters instead of bytes</li>
	</ul></li>
	<li>SUBSTR(STRING a, INT start [, INT len])</li>
	<li>SUBSTRING(STRING a, INT start [, INT len])()<ul id="ul_tkh_x1l_gtb">
	<li>the substring start position and length is counted by UTF-8 characters instead of
	bytes</li>
	</ul></li>
	<li>REVERSE(STRING a)<ul id="ul_o1d_jbl_gtb">
	<li>the unit of the operation is a UTF-8 character, ie. it won't reverse bytes inside a UTF-8
	character.<p>
	<note>The results of reverse("最快的SQL引擎") used to be "��敼�LQS��竿倜�" and now
	"擎引LQS的快最".</note></p></li>
	</ul></li>
	<li>INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])</li>
	<li>LOCATE(STRING substr, STRING str[, INT pos])<ul id="ul_y1p_sbl_gtb">
	<li>These functions have an optional position argument. The return values are also positions
	in the string. In UTF-8 mode, these positions are counted by UTF-8 characters instead of
	bytes.</li>
	</ul></li>
	<li>mask functions<ul id="ul_qmg_5bl_gtb">
	<li>The unit of the operation is a UTF-8 character, ie. they won't mask the string
	byte-to-byte.</li>
	</ul></li>
	<li>upper/lower/initcap<ul id="ul_x3c_wbl_gtb">
	<li>These functions will recognize non-ascii characters and transform them based on the
	current locale used by the Impala process.</li>
	</ul></li>
	</ul>
	</conbody>
	</concept>
	<concept id="limitations">
	<title>Limitations</title>
	<conbody>
	<ul id="ul_dhh_dcl_gtb">
	<li>Use the UTF8_MODE option only when needed since the performance of UTF_8 is not optimized
	yet. It is only an experimental feature.</li>
	<li>UTF-8 support for CHAR and VARCHAR types is not implemented yet. So VARCHAR(N) will still
	return N bytes instead of N UTF-8 characters.</li>
	</ul>
	</conbody>
	</concept>
	</concept>