blob: 920481293d1e203047b563ce62ee62cd21b16fc9 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="char" rev="2.0.0">
<title>CHAR Data Type (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>CHAR</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Data Types"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p rev="2.0.0">
<indexterm audience="hidden">CHAR data type</indexterm>
A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If
values are longer than the specified length, Impala truncates any trailing characters.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
</p>
<codeblock><varname>column_name</varname> CHAR(<varname>length</varname>)</codeblock>
<p>
The maximum length you can specify is 255.
</p>
<p>
<b>Semantics of trailing spaces:</b>
</p>
<ul>
<li>
When you store a <codeph>CHAR</codeph> value shorter than the specified length in a table, queries return
the value padded with trailing spaces if necessary; the resulting value has the same length as specified in
the column definition.
</li>
<li>
If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those trailing spaces are
not stored in the data file. When the value is retrieved by a query, the result could have a different
number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the
specified length of the column.
</li>
<li>
If you compare two <codeph>CHAR</codeph> values that differ only in the number of trailing spaces, those
values are considered identical.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/partitioning_bad"/>
<p conref="../shared/impala_common.xml#common/hbase_no"/>
<p conref="../shared/impala_common.xml#common/parquet_blurb"/>
<ul>
<li>
This type can be read from and written to Parquet files.
</li>
<li>
There is no requirement for a particular level of Parquet.
</li>
<li>
Parquet files generated by Impala and containing this type can be freely interchanged with other components
such as Hive and MapReduce.
</li>
<li>
Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
</li>
<li>
Parquet data files might contain values that are longer than allowed by the
<codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra trailing characters when
it processes those values during a query.
</li>
</ul>
<p conref="../shared/impala_common.xml#common/text_blurb"/>
<p>
Text data files might contain values that are longer than allowed for a particular
<codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are ignored when Impala
processes those values during a query. Text data files can also contain values that are shorter than the
defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
files produced by Impala <codeph>INSERT</codeph> statements do not include any trailing blanks for
<codeph>CHAR</codeph> columns.
</p>
<p><b>Avro considerations:</b></p>
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
<p>
This type is available using <keyword keyref="impala20_full"/> or higher.
</p>
<p>
Some other database systems make the length specification optional. For Impala, the length is required.
</p>
<!--
<p>
The Impala maximum length is larger than for the <codeph>CHAR</codeph> data type in Hive.
If a Hive query encounters a <codeph>CHAR</codeph> value longer than 255 during processing,
it silently treats the value as length 255.
</p>
-->
<p conref="../shared/impala_common.xml#common/internals_max_bytes"/>
<p conref="../shared/impala_common.xml#common/added_in_20"/>
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
<!-- Seems like a logical design decision but don't think it's currently implemented like this.
<p>
Because both the maximum and average length are always known and always the same for
any given <codeph>CHAR(<varname>n</varname>)</codeph> column, those fields are always filled
in for <codeph>SHOW COLUMN STATS</codeph> output, even before you run
<codeph>COMPUTE STATS</codeph> on the table.
</p>
-->
<p conref="../shared/impala_common.xml#common/udf_blurb_no"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
These examples show how trailing spaces are not considered significant when comparing or processing
<codeph>CHAR</codeph> values. <codeph>CAST()</codeph> truncates any longer string to fit within the defined
length. If a <codeph>CHAR</codeph> value is shorter than the specified length, it is padded on the right with
spaces until it matches the specified length. Therefore, <codeph>LENGTH()</codeph> represents the length
including any trailing spaces, and <codeph>CONCAT()</codeph> also treats the column value as if it has
trailing spaces.
</p>
<codeblock>select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded";
+--------------------------+
| unpadded equal to padded |
+--------------------------+
| true |
+--------------------------+
create table char_length(c char(3));
insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
select concat("[",c,"]") as c, length(c) from char_length;
+-------+-----------+
| c | length(c) |
+-------+-----------+
| [1 ] | 3 |
| [12 ] | 3 |
| [123] | 3 |
| [123] | 3 |
+-------+-----------+
</codeblock>
<p>
This example shows a case where data values are known to have a specific length, where <codeph>CHAR</codeph>
is a logical data type to use.
<!--
Because all the <codeph>CHAR</codeph> values have a constant predictable length,
Impala can efficiently analyze how best to use these values in join queries,
aggregation queries, and other contexts where column length is significant.
-->
</p>
<codeblock>create table addresses
(id bigint,
street_name string,
state_abbreviation char(2),
country_abbreviation char(2));
</codeblock>
<p>
The following example shows how values written by Impala do not physically include the trailing spaces. It
creates a table using text format, with <codeph>CHAR</codeph> values much shorter than the declared length,
and then prints the resulting data file to show that the delimited values are not separated by spaces. The
same behavior applies to binary-format Parquet data files.
</p>
<codeblock>create table char_in_text (a char(20), b char(30), c char(40))
row format delimited fields terminated by ',';
insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
-- Running this Linux command inside impala-shell using the ! shortcut.
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
foo,bar,baz
hello,goodbye,aloha
</codeblock>
<p>
The following example further illustrates the treatment of spaces. It replaces the contents of the previous
table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved
within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the
leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
</p>
<codeblock>insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40)));
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
trailing, leading and trailing, leading
select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
+------------------------+----------------------------------+--------------------------------------------+
| a | b | c |
+------------------------+----------------------------------+--------------------------------------------+
| [trailing ] | [ leading and trailing ] | [ leading ] |
+------------------------+----------------------------------+--------------------------------------------+
</codeblock>
<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
<p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p>
Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
scalability reasons avoid declaring <codeph>CHAR</codeph> columns that are much longer than typical values in
that column.
</p>
<p conref="../shared/impala_common.xml#common/blobs_are_strings"/>
<p>
When an expression compares a <codeph>CHAR</codeph> with a <codeph>STRING</codeph> or
<codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to <codeph>STRING</codeph>
first, with trailing spaces preserved.
</p>
<codeblock>select cast("foo " as char(5)) = 'foo' as "char equal to string";
+----------------------+
| char equal to string |
+----------------------+
| false |
+----------------------+
</codeblock>
<p>
This behavior differs from other popular database systems. To get the expected result of
<codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values of the appropriate
length:
</p>
<codeblock>select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string";
+----------------------+
| char equal to string |
+----------------------+
| true |
+----------------------+
</codeblock>
<p>
This behavior is subject to change in future releases.
</p>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_string.xml#string"/>, <xref href="impala_varchar.xml#varchar"/>,
<xref href="impala_literals.xml#string_literals"/>,
<xref href="impala_string_functions.xml#string_functions"/>
</p>
</conbody>
</concept>