| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="string"> |
| |
| <title>STRING Data Type</title> |
| <titlealts audience="PDF"><navtitle>STRING</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Impala Data Types"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Schemas"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| <p> |
| A data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER |
| TABLE</codeph> statements. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <p> |
| In the column definition of a <codeph>CREATE TABLE</codeph> and |
| <codeph>ALTER TABLE</codeph> statements: |
| </p> |
| |
| <codeblock><varname>column_name</varname> STRING</codeblock> |
| |
| <p> |
| <b>Length:</b> |
| </p> |
| <p><ph rev="2.0.0"> |
| If you need to manipulate string values with precise or |
| maximum lengths, in Impala 2.0 and higher you can declare columns as |
| <codeph>VARCHAR(<varname>max_length</varname>)</codeph> or |
| <codeph>CHAR(<varname>length</varname>)</codeph>, but for best |
| performance use <codeph>STRING</codeph> where practical.</ph> |
| </p> |
| |
| <p> |
| Take the following considerations for <codeph>STRING</codeph> |
| lengths: |
| </p> |
| |
| <ul> |
| <li> |
| The hard limit on the size of a <codeph>STRING</codeph> and the total |
| size of a row is 2 GB. |
| |
| <p> |
| If a query tries to process or create a string |
| larger than this limit, it will return an error to the user. |
| </p> |
| </li> |
| |
| <li> |
| The limit is 1 GB on <codeph>STRING</codeph> when writing to Parquet |
| files. |
| </li> |
| |
| <li> |
| Queries operating on strings with 32 KB or less will work reliably and |
| will not hit significant performance or memory problems (unless you have |
| very complex queries, very many columns, etc.) |
| </li> |
| |
| <li> |
| Performance and memory consumption may degrade with strings larger |
| than 32 KB. |
| </li> |
| |
| <li> |
| The row size, i.e. the total size of all string and other columns, is |
| subject to lower limits at various points in query execution that |
| support spill-to-disk. A few examples for lower row size limits are: |
| |
| <ul> |
| <li> |
| Rows coming from the right side of any hash join |
| </li> |
| |
| <li> |
| Rows coming from either side of a hash join that spills to disk |
| </li> |
| |
| <li> |
| Rows being sorted by the <codeph>SORT</codeph> operator without a |
| limit |
| </li> |
| |
| <li> |
| Rows in a grouping aggregation |
| </li> |
| |
| </ul> |
| |
| <p> |
| In <keyword keyref="impala29"/> and lower, the default limit of |
| the row size in the above cases is 8 MB. |
| </p> |
| |
| <p> |
| In <keyword |
| keyref="impala210"/> and higher, the max row size is configurable on |
| a per-query basis with the <codeph>MAX_ROW_SIZE</codeph> query option. |
| Rows up to <codeph>MAX_ROW_SIZE</codeph> (which defaults to 512 KB) |
| can always be processed in the above cases. Rows larger than |
| <codeph>MAX_ROW_SIZE</codeph> are processed on a best-effort basis. |
| See <keyword keyref="max_row_size">MAX_ROW_SIZE</keyword> for more |
| details. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| <b>Character sets:</b> |
| </p> |
| |
| <p> |
| For full support in all Impala subsystems, restrict string values to the |
| ASCII character set. Although some UTF-8 character data can be stored in |
| Impala and retrieved through queries, UTF-8 strings containing non-ASCII |
| characters are not guaranteed to work properly in combination with many |
| SQL aspects, including but not limited to: |
| </p> |
| |
| <ul> |
| <li>CHAR/VARCHAR truncating/padding.</li> |
| |
| <li> |
| Comparison operators. |
| </li> |
| |
| <li> |
| The <codeph>ORDER BY</codeph> clause. |
| </li> |
| |
| <li> Values in partition key columns. |
| </li> |
| </ul> |
| |
| <p> |
| For any national language aspects such as collation order or |
| interpreting extended ASCII variants such as ISO-8859-1 or ISO-8859-2 |
| encodings, Impala does not include such metadata with the table |
| definition. If you need to sort, manipulate, or display data depending on |
| those national language characteristics of string data, use logic on the |
| application side. |
| </p> |
| <p>If you just need Hive-compatible string function behaviors on UTF-8 encoded strings, turn on |
| the query option UTF8_MODE. See more in <xref href="impala_utf_8.xml"/>.</p> |
| <p> |
| <b>Conversions:</b> |
| </p> |
| <ul> |
| <li> |
| <p> |
| Impala does not automatically convert <codeph>STRING</codeph> to any |
| numeric type. Impala does automatically convert |
| <codeph>STRING</codeph> to <codeph>TIMESTAMP</codeph> if the value |
| matches one of the accepted <codeph>TIMESTAMP</codeph> formats; see |
| <xref href="impala_timestamp.xml#timestamp"/> for details. |
| </p> |
| </li> |
| <li> |
| <p> |
| You can use <codeph>CAST()</codeph> to convert |
| <codeph>STRING</codeph> values to <codeph>TINYINT</codeph>, |
| <codeph>SMALLINT</codeph>, <codeph>INT</codeph>, |
| <codeph>BIGINT</codeph>, <codeph>FLOAT</codeph>, |
| <codeph>DOUBLE</codeph>, or <codeph>TIMESTAMP</codeph>. |
| </p> |
| </li> |
| <li> |
| <p> |
| You cannot directly cast a <codeph>STRING</codeph> value to |
| <codeph>BOOLEAN</codeph>. You can use a <codeph>CASE</codeph> |
| expression to evaluate string values such as <codeph>'T'</codeph>, |
| <codeph>'true'</codeph>, and so on and return Boolean |
| <codeph>true</codeph> and <codeph>false</codeph> values as |
| appropriate. |
| </p> |
| </li> |
| <li> |
| <p> |
| You can cast a <codeph>BOOLEAN</codeph> value to |
| <codeph>STRING</codeph>, returning <codeph>'1'</codeph> for |
| <codeph>true</codeph> values and <codeph>'0'</codeph> for |
| <codeph>false</codeph> values. |
| </p> |
| </li> |
| </ul> |
| <p conref="../shared/impala_common.xml#common/partitioning_blurb"/> |
| <p> |
| Although it might be convenient to use <codeph>STRING</codeph> columns |
| for partition keys, even when those columns contain numbers, for |
| performance and scalability it is much better to use numeric columns as |
| partition keys whenever practical. Although the underlying HDFS directory |
| name might be the same in either case, the in-memory storage for the |
| partition key columns is more compact, and computations are faster, if |
| partition key columns such as <codeph>YEAR</codeph>, |
| <codeph>MONTH</codeph>, <codeph>DAY</codeph> and so on are declared as |
| <codeph>INT</codeph>, <codeph>SMALLINT</codeph>, and so on. |
| </p> |
| <p conref="../shared/impala_common.xml#common/zero_length_strings"/> |
| <!-- <p conref="../shared/impala_common.xml#common/hbase_blurb"/> --> |
| <!-- <p conref="../shared/impala_common.xml#common/parquet_blurb"/> --> |
| <p conref="../shared/impala_common.xml#common/text_bulky"/> |
| <p><b>Avro considerations:</b></p> |
| <p conref="../shared/impala_common.xml#common/avro_2gb_strings"/> |
| <!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> --> |
| <!-- <p conref="../shared/impala_common.xml#common/internals_blurb"/> --> |
| <!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> --> |
| <p conref="../shared/impala_common.xml#common/column_stats_variable"/> |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| <p> |
| The following examples demonstrate double-quoted and single-quoted |
| string literals, and required escaping for quotation marks within string |
| literals: |
| </p> |
| <codeblock>SELECT 'I am a single-quoted string'; |
| SELECT "I am a double-quoted string"; |
| SELECT 'I\'m a single-quoted string with an apostrophe'; |
| SELECT "I\'m a double-quoted string with an apostrophe"; |
| SELECT 'I am a "short" single-quoted string containing quotes'; |
| SELECT "I am a \"short\" double-quoted string containing quotes"; |
| </codeblock> |
| <p> |
| The following examples demonstrate calls to string manipulation |
| functions to concatenate strings, convert numbers to strings, or pull out |
| substrings: |
| </p> |
| <codeblock>SELECT CONCAT("Once upon a time, there were ", CAST(3 AS STRING), ' little pigs.'); |
| SELECT SUBSTR("hello world",7,5); |
| </codeblock> |
| <p> |
| The following examples show how to perform operations on |
| <codeph>STRING</codeph> columns within a table: |
| </p> |
| <codeblock>CREATE TABLE t1 (s1 STRING, s2 STRING); |
| INSERT INTO t1 VALUES ("hello", 'world'), (CAST(7 AS STRING), "wonders"); |
| SELECT s1, s2, length(s1) FROM t1 WHERE s2 LIKE 'w%'; |
| </codeblock> |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| <p> |
| <xref href="impala_literals.xml#string_literals"/>, <xref |
| href="impala_char.xml#char"/>, <xref href="impala_varchar.xml#varchar" |
| />, <xref href="impala_string_functions.xml#string_functions"/>, <xref |
| href="impala_datetime_functions.xml#datetime_functions"/> |
| </p> |
| </conbody> |
| </concept> |