docs/topics/impala_string.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="string">

   <title>STRING Data Type</title>
   <titlealts audience="PDF"><navtitle>STRING</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Impala Data Types"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Data Analysts"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Schemas"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       A data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements.
     </p>

     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

     <p>
       In the column definition of a <codeph>CREATE TABLE</codeph> statement:
     </p>

 <codeblock><varname>column_name</varname> STRING</codeblock>

     <p>
       <b>Length:</b> Maximum of 32,767 bytes. Do not use any length constraint when declaring
       <codeph>STRING</codeph> columns, as you might be familiar with from <codeph>VARCHAR</codeph>,
       <codeph>CHAR</codeph>, or similar column types from relational database systems. <ph rev="2.0.0">If you do
       need to manipulate string values with precise or maximum lengths, in Impala 2.0 and higher you can declare
       columns as <codeph>VARCHAR(<varname>max_length</varname>)</codeph> or
       <codeph>CHAR(<varname>length</varname>)</codeph>, but for best performance use <codeph>STRING</codeph>
       where practical.</ph>
     </p>

     <p>
       <b>Character sets:</b> For full support in all Impala subsystems, restrict string values to the ASCII
       character set. Although some UTF-8 character data can be stored in Impala and retrieved through queries, UTF-8 strings
       containing non-ASCII characters are not guaranteed to work properly in combination with many SQL aspects,
       including but not limited to:
     </p>
     <ul>
       <li>
         String manipulation functions.
       </li>
       <li>
         Comparison operators.
       </li>
       <li>
         The <codeph>ORDER BY</codeph> clause.
       </li>
       <li>
         Values in partition key columns.
       </li>
     </ul>

     <p>
       For any national language aspects such as
       collation order or interpreting extended ASCII variants such as ISO-8859-1 or ISO-8859-2 encodings, Impala
       does not include such metadata with the table definition. If you need to sort, manipulate, or display data
       depending on those national language characteristics of string data, use logic on the application side.
     </p>

     <p>
       <b>Conversions:</b>
     </p>

     <ul>
       <li>
         <p>
           Impala does not automatically convert <codeph>STRING</codeph> to any numeric type. Impala does
           automatically convert <codeph>STRING</codeph> to <codeph>TIMESTAMP</codeph> if the value matches one of
           the accepted <codeph>TIMESTAMP</codeph> formats; see <xref href="impala_timestamp.xml#timestamp"/> for
           details.
         </p>
       </li>

       <li>
         <p>
           You can use <codeph>CAST()</codeph> to convert <codeph>STRING</codeph> values to
           <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, <codeph>INT</codeph>, <codeph>BIGINT</codeph>,
           <codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, or <codeph>TIMESTAMP</codeph>.
         </p>
       </li>

       <li>
         <p>
           You cannot directly cast a <codeph>STRING</codeph> value to <codeph>BOOLEAN</codeph>. You can use a
           <codeph>CASE</codeph> expression to evaluate string values such as <codeph>'T'</codeph>,
           <codeph>'true'</codeph>, and so on and return Boolean <codeph>true</codeph> and <codeph>false</codeph>
           values as appropriate.
         </p>
       </li>

       <li>
         <p>
           You can cast a <codeph>BOOLEAN</codeph> value to <codeph>STRING</codeph>, returning <codeph>'1'</codeph>
           for <codeph>true</codeph> values and <codeph>'0'</codeph> for <codeph>false</codeph> values.
         </p>
       </li>
     </ul>

     <p conref="../shared/impala_common.xml#common/partitioning_blurb"/>

     <p>
       Although it might be convenient to use <codeph>STRING</codeph> columns for partition keys, even when those
       columns contain numbers, for performance and scalability it is much better to use numeric columns as
       partition keys whenever practical. Although the underlying HDFS directory name might be the same in either
       case, the in-memory storage for the partition key columns is more compact, and computations are faster, if
       partition key columns such as <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, <codeph>DAY</codeph> and so on
       are declared as <codeph>INT</codeph>, <codeph>SMALLINT</codeph>, and so on.
     </p>

     <p conref="../shared/impala_common.xml#common/zero_length_strings"/>

 <!-- <p conref="../shared/impala_common.xml#common/hbase_blurb"/> -->

 <!-- <p conref="../shared/impala_common.xml#common/parquet_blurb"/> -->

     <p conref="../shared/impala_common.xml#common/text_bulky"/>

     <p><b>Avro considerations:</b></p>
     <p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>

 <!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->

 <!-- <p conref="../shared/impala_common.xml#common/internals_blurb"/> -->

 <!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> -->

     <p conref="../shared/impala_common.xml#common/column_stats_variable"/>

     <p conref="../shared/impala_common.xml#common/example_blurb"/>

     <p>
       The following examples demonstrate double-quoted and single-quoted string literals, and required escaping for
       quotation marks within string literals:
     </p>

 <codeblock>SELECT 'I am a single-quoted string';
 SELECT "I am a double-quoted string";
 SELECT 'I\'m a single-quoted string with an apostrophe';
 SELECT "I\'m a double-quoted string with an apostrophe";
 SELECT 'I am a "short" single-quoted string containing quotes';
 SELECT "I am a \"short\" double-quoted string containing quotes";
 </codeblock>

     <p>
       The following examples demonstrate calls to string manipulation functions to concatenate strings, convert
       numbers to strings, or pull out substrings:
     </p>

 <codeblock>SELECT CONCAT("Once upon a time, there were ", CAST(3 AS STRING), ' little pigs.');
 SELECT SUBSTR("hello world",7,5);
 </codeblock>

     <p>
       The following examples show how to perform operations on <codeph>STRING</codeph> columns within a table:
     </p>

 <codeblock>CREATE TABLE t1 (s1 STRING, s2 STRING);
 INSERT INTO t1 VALUES ("hello", 'world'), (CAST(7 AS STRING), "wonders");
 SELECT s1, s2, length(s1) FROM t1 WHERE s2 LIKE 'w%';
 </codeblock>

     <p conref="../shared/impala_common.xml#common/related_info"/>

     <p>
       <xref href="impala_literals.xml#string_literals"/>, <xref href="impala_char.xml#char"/>,
       <xref href="impala_varchar.xml#varchar"/>, <xref href="impala_string_functions.xml#string_functions"/>,
       <xref href="impala_datetime_functions.xml#datetime_functions"/>
     </p>
   </conbody>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="string">

	<title>STRING Data Type</title>
	<titlealts audience="PDF"><navtitle>STRING</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Impala Data Types"/>
	<data name="Category" value="SQL"/>
	<data name="Category" value="Data Analysts"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Schemas"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	A data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements.
	</p>

	<p conref="../shared/impala_common.xml#common/syntax_blurb"/>

	<p>
	In the column definition of a <codeph>CREATE TABLE</codeph> statement:
	</p>

	<codeblock><varname>column_name</varname> STRING</codeblock>

	<p>
	<b>Length:</b> Maximum of 32,767 bytes. Do not use any length constraint when declaring
	<codeph>STRING</codeph> columns, as you might be familiar with from <codeph>VARCHAR</codeph>,
	<codeph>CHAR</codeph>, or similar column types from relational database systems. <ph rev="2.0.0">If you do
	need to manipulate string values with precise or maximum lengths, in Impala 2.0 and higher you can declare
	columns as <codeph>VARCHAR(<varname>max_length</varname>)</codeph> or
	<codeph>CHAR(<varname>length</varname>)</codeph>, but for best performance use <codeph>STRING</codeph>
	where practical.</ph>
	</p>

	<p>
	<b>Character sets:</b> For full support in all Impala subsystems, restrict string values to the ASCII
	character set. Although some UTF-8 character data can be stored in Impala and retrieved through queries, UTF-8 strings
	containing non-ASCII characters are not guaranteed to work properly in combination with many SQL aspects,
	including but not limited to:
	</p>
	<ul>
	<li>
	String manipulation functions.
	</li>
	<li>
	Comparison operators.
	</li>
	<li>
	The <codeph>ORDER BY</codeph> clause.
	</li>
	<li>
	Values in partition key columns.
	</li>
	</ul>

	<p>
	For any national language aspects such as
	collation order or interpreting extended ASCII variants such as ISO-8859-1 or ISO-8859-2 encodings, Impala
	does not include such metadata with the table definition. If you need to sort, manipulate, or display data
	depending on those national language characteristics of string data, use logic on the application side.
	</p>

	<p>
	<b>Conversions:</b>
	</p>

	<ul>
	<li>
	<p>
	Impala does not automatically convert <codeph>STRING</codeph> to any numeric type. Impala does
	automatically convert <codeph>STRING</codeph> to <codeph>TIMESTAMP</codeph> if the value matches one of
	the accepted <codeph>TIMESTAMP</codeph> formats; see <xref href="impala_timestamp.xml#timestamp"/> for
	details.
	</p>
	</li>

	<li>
	<p>
	You can use <codeph>CAST()</codeph> to convert <codeph>STRING</codeph> values to
	<codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, <codeph>INT</codeph>, <codeph>BIGINT</codeph>,
	<codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, or <codeph>TIMESTAMP</codeph>.
	</p>
	</li>

	<li>
	<p>
	You cannot directly cast a <codeph>STRING</codeph> value to <codeph>BOOLEAN</codeph>. You can use a
	<codeph>CASE</codeph> expression to evaluate string values such as <codeph>'T'</codeph>,
	<codeph>'true'</codeph>, and so on and return Boolean <codeph>true</codeph> and <codeph>false</codeph>
	values as appropriate.
	</p>
	</li>

	<li>
	<p>
	You can cast a <codeph>BOOLEAN</codeph> value to <codeph>STRING</codeph>, returning <codeph>'1'</codeph>
	for <codeph>true</codeph> values and <codeph>'0'</codeph> for <codeph>false</codeph> values.
	</p>
	</li>
	</ul>

	<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>

	<p>
	Although it might be convenient to use <codeph>STRING</codeph> columns for partition keys, even when those
	columns contain numbers, for performance and scalability it is much better to use numeric columns as
	partition keys whenever practical. Although the underlying HDFS directory name might be the same in either
	case, the in-memory storage for the partition key columns is more compact, and computations are faster, if
	partition key columns such as <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, <codeph>DAY</codeph> and so on
	are declared as <codeph>INT</codeph>, <codeph>SMALLINT</codeph>, and so on.
	</p>

	<p conref="../shared/impala_common.xml#common/zero_length_strings"/>

	<!-- <p conref="../shared/impala_common.xml#common/hbase_blurb"/> -->

	<!-- <p conref="../shared/impala_common.xml#common/parquet_blurb"/> -->

	<p conref="../shared/impala_common.xml#common/text_bulky"/>

	<p><b>Avro considerations:</b></p>
	<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>

	<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->

	<!-- <p conref="../shared/impala_common.xml#common/internals_blurb"/> -->

	<!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> -->

	<p conref="../shared/impala_common.xml#common/column_stats_variable"/>

	<p conref="../shared/impala_common.xml#common/example_blurb"/>

	<p>
	The following examples demonstrate double-quoted and single-quoted string literals, and required escaping for
	quotation marks within string literals:
	</p>

	<codeblock>SELECT 'I am a single-quoted string';
	SELECT "I am a double-quoted string";
	SELECT 'I\'m a single-quoted string with an apostrophe';
	SELECT "I\'m a double-quoted string with an apostrophe";
	SELECT 'I am a "short" single-quoted string containing quotes';
	SELECT "I am a \"short\" double-quoted string containing quotes";
	</codeblock>

	<p>
	The following examples demonstrate calls to string manipulation functions to concatenate strings, convert
	numbers to strings, or pull out substrings:
	</p>

	<codeblock>SELECT CONCAT("Once upon a time, there were ", CAST(3 AS STRING), ' little pigs.');
	SELECT SUBSTR("hello world",7,5);
	</codeblock>

	<p>
	The following examples show how to perform operations on <codeph>STRING</codeph> columns within a table:
	</p>

	<codeblock>CREATE TABLE t1 (s1 STRING, s2 STRING);
	INSERT INTO t1 VALUES ("hello", 'world'), (CAST(7 AS STRING), "wonders");
	SELECT s1, s2, length(s1) FROM t1 WHERE s2 LIKE 'w%';
	</codeblock>

	<p conref="../shared/impala_common.xml#common/related_info"/>

	<p>
	<xref href="impala_literals.xml#string_literals"/>, <xref href="impala_char.xml#char"/>,
	<xref href="impala_varchar.xml#varchar"/>, <xref href="impala_string_functions.xml#string_functions"/>,
	<xref href="impala_datetime_functions.xml#datetime_functions"/>
	</p>
	</conbody>
	</concept>