docs/topics/impala_timestamp.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="timestamp">

   <title>TIMESTAMP Data Type</title>

   <titlealts audience="PDF">

     <navtitle>TIMESTAMP</navtitle>

   </titlealts>

   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Impala Data Types"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Data Analysts"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Dates and Times"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       A data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
       statements, representing a point in time.
     </p>

     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

     <p>
       In the column definition of a <codeph>CREATE TABLE</codeph> statement:
     </p>

 <codeblock><varname>column_name</varname> TIMESTAMP</codeblock>

     <p>
       <b>Range:</b> Allowed date values range from 1400-01-01 to 9999-12-31; this range is
       different from the Hive <codeph>TIMESTAMP</codeph> type. Internally, the resolution of the
       time portion of a <codeph>TIMESTAMP</codeph> value is in nanoseconds.
     </p>

     <p>
       <b>INTERVAL expressions:</b>
     </p>

     <p>
       You can perform date arithmetic by adding or subtracting a specified number of time units,
       using the <codeph>INTERVAL</codeph> keyword and the <codeph>+</codeph> and
       <codeph>-</codeph> operators or <codeph>date_add()</codeph> and
       <codeph>date_sub()</codeph> functions. You can specify units as <codeph>YEAR[S]</codeph>,
       <codeph>MONTH[S]</codeph>, <codeph>WEEK[S]</codeph>, <codeph>DAY[S]</codeph>,
       <codeph>HOUR[S]</codeph>, <codeph>MINUTE[S]</codeph>, <codeph>SECOND[S]</codeph>,
       <codeph>MILLISECOND[S]</codeph>, <codeph>MICROSECOND[S]</codeph>, and
       <codeph>NANOSECOND[S]</codeph>. You can only specify one time unit in each interval
       expression, for example <codeph>INTERVAL 3 DAYS</codeph> or <codeph>INTERVAL 25
       HOURS</codeph>, but you can produce any granularity by adding together successive
       <codeph>INTERVAL</codeph> values, such as <codeph><varname>timestamp_value</varname> +
       INTERVAL 3 WEEKS - INTERVAL 1 DAY + INTERVAL 10 MICROSECONDS</codeph>.
     </p>

     <p>
       For example:
     </p>

 <codeblock>select now() + interval 1 day;
 select date_sub(now(), interval 5 minutes);
 insert into auction_details
   select auction_id, auction_start_time, auction_start_time + interval 2 days + interval 12 hours
   from new_auctions;</codeblock>

     <p>
       <b>Time zones:</b>
     </p>

     <p>
       By default, Impala does not store timestamps using the local timezone, to avoid undesired
       results from unexpected time zone issues. Timestamps are stored and interpreted relative
       to UTC, both when written to or read from data files, or when converted to or from Unix
       time values through functions such as <codeph>from_unixtime()</codeph> or
       <codeph>unix_timestamp()</codeph>. To convert such a <codeph>TIMESTAMP</codeph> value to
       one that represents the date and time in a specific time zone, convert the original value
       with the <codeph>from_utc_timestamp()</codeph> function.
     </p>

     <p>
       Because Impala does not assume that <codeph>TIMESTAMP</codeph> values are in any
       particular time zone, you must be conscious of the time zone aspects of data that you
       query, insert, or convert.
     </p>

     <p>
       For consistency with Unix system calls, the <codeph>TIMESTAMP</codeph> returned by the
       <codeph>now()</codeph> function represents the local time in the system time zone, rather
       than in UTC. To store values relative to the current time in a portable way, convert any
       <codeph>now()</codeph> return values using the <codeph>to_utc_timestamp()</codeph>
       function first. For example, the following example shows that the current time in
       California (where this Impala cluster is located) is shortly after 2 PM. If that value was
       written to a data file, and shipped off to a distant server to be analyzed alongside other
       data from far-flung locations, the dates and times would not match up precisely because of
       time zone differences. Therefore, the <codeph>to_utc_timestamp()</codeph> function
       converts it using a common reference point, the UTC time zone (descended from the old
       Greenwich Mean Time standard). The <codeph>'PDT'</codeph> argument indicates that the
       original value is from the Pacific time zone with Daylight Saving Time in effect. When
       servers in all geographic locations run the same transformation on any local date and time
       values (with the appropriate time zone argument), the stored data uses a consistent
       representation. Impala queries can use functions such as <codeph>EXTRACT()</codeph>,
       <codeph>MIN()</codeph>, <codeph>AVG()</codeph>, and so on to do time-series analysis on
       those timestamps.
     </p>

 <codeblock>[localhost:21000] > select now();
 +-------------------------------+
 | now()                         |
 +-------------------------------+
 | 2015-04-09 14:07:46.580465000 |
 +-------------------------------+
 [localhost:21000] > select to_utc_timestamp(now(), 'PDT');
 +--------------------------------+
 | to_utc_timestamp(now(), 'pdt') |
 +--------------------------------+
 | 2015-04-09 21:08:07.664547000  |
 +--------------------------------+
 </codeblock>

     <p>
       The converse function, <codeph>from_utc_timestamp()</codeph>, lets you take stored
       <codeph>TIMESTAMP</codeph> data or calculated results and convert back to local date and
       time for processing on the application side. The following example shows how you might
       represent some future date (such as the ending date and time of an auction) in UTC, and
       then convert back to local time when convenient for reporting or other processing. The
       final query in the example tests whether this arbitrary UTC date and time has passed yet,
       by converting it back to the local time zone and comparing it against the current date and
       time.
     </p>

 <codeblock>[localhost:21000] > select to_utc_timestamp(now() + interval 2 weeks, 'PDT');
 +---------------------------------------------------+
 | to_utc_timestamp(now() + interval 2 weeks, 'pdt') |
 +---------------------------------------------------+
 | 2015-04-23 21:08:34.152923000                     |
 +---------------------------------------------------+
 [localhost:21000] > select from_utc_timestamp('2015-04-23 21:08:34.152923000','PDT');
 +------------------------------------------------------------+
 | from_utc_timestamp('2015-04-23 21:08:34.152923000', 'pdt') |
 +------------------------------------------------------------+
 | 2015-04-23 14:08:34.152923000                              |
 +------------------------------------------------------------+
 [localhost:21000] > select from_utc_timestamp('2015-04-23 21:08:34.152923000','PDT') &lt; now();
 +--------------------------------------------------------------------+
 | from_utc_timestamp('2015-04-23 21:08:34.152923000', 'pdt') &lt; now() |
 +--------------------------------------------------------------------+
 | false                                                              |
 +--------------------------------------------------------------------+
 </codeblock>

     <p rev="2.2.0">
       If you have data files written by Hive, those <codeph>TIMESTAMP</codeph> values represent
       the local timezone of the host where the data was written, potentially leading to
       inconsistent results when processed by Impala. To avoid compatibility problems or having
       to code workarounds, you can specify one or both of these <cmdname>impalad</cmdname>
       startup flags: <codeph>--use_local_tz_for_unix_timestamp_conversions=true</codeph>
       <codeph>-convert_legacy_hive_parquet_utc_timestamps=true</codeph>. Although
       <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> is turned off by default to
       avoid performance overhead, where practical turn it on when processing
       <codeph>TIMESTAMP</codeph> columns in Parquet files written by Hive, to avoid unexpected
       behavior.
     </p>

     <p rev="2.2.0">
       The <codeph>--use_local_tz_for_unix_timestamp_conversions</codeph> setting affects
       conversions from <codeph>TIMESTAMP</codeph> to <codeph>BIGINT</codeph>, or from
       <codeph>BIGINT</codeph> to <codeph>TIMESTAMP</codeph>. By default, Impala treats all
       <codeph>TIMESTAMP</codeph> values as UTC, to simplify analysis of time-series data from
       different geographic regions. When you enable the
       <codeph>--use_local_tz_for_unix_timestamp_conversions</codeph> setting, these operations
       treat the input values as if they are in the local tie zone of the host doing the
       processing. See <xref
         href="impala_datetime_functions.xml#datetime_functions"/>
       for the list of functions affected by the
       <codeph>--use_local_tz_for_unix_timestamp_conversions</codeph> setting.
     </p>

     <p>
       The following sequence of examples shows how the interpretation of
       <codeph>TIMESTAMP</codeph> values in Parquet tables is affected by the setting of the
       <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> setting.
     </p>

     <p>
       Regardless of the <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> setting,
       <codeph>TIMESTAMP</codeph> columns in text tables can be written and read interchangeably
       by Impala and Hive:
     </p>

 <codeblock>Impala DDL and queries for text table:

 [localhost:21000] > create table t1 (x timestamp);
 [localhost:21000] > insert into t1 values (now()), (now() + interval 1 day);
 [localhost:21000] > select x from t1;
 +-------------------------------+
 | x                             |
 +-------------------------------+
 | 2015-04-07 15:43:02.892403000 |
 | 2015-04-08 15:43:02.892403000 |
 +-------------------------------+
 [localhost:21000] > select to_utc_timestamp(x, 'PDT') from t1;
 +-------------------------------+
 | to_utc_timestamp(x, 'pdt')    |
 +-------------------------------+
 | 2015-04-07 22:43:02.892403000 |
 | 2015-04-08 22:43:02.892403000 |
 +-------------------------------+

 Hive query for text table:

 hive> select * from t1;
 OK
 2015-04-07 15:43:02.892403
 2015-04-08 15:43:02.892403
 Time taken: 1.245 seconds, Fetched: 2 row(s)
 </codeblock>

     <p>
       When the table uses Parquet format, Impala expects any time zone adjustment to be applied
       prior to writing, while <codeph>TIMESTAMP</codeph> values written by Hive are adjusted to
       be in the UTC time zone. When Hive queries Parquet data files that it wrote, it adjusts
       the <codeph>TIMESTAMP</codeph> values back to the local time zone, while Impala does no
       conversion. Hive does no time zone conversion when it queries Impala-written Parquet
       files.
     </p>

 <codeblock>Impala DDL and queries for Parquet table:

 [localhost:21000] > create table p1 stored as parquet as select x from t1;
 +-------------------+
 | summary           |
 +-------------------+
 | Inserted 2 row(s) |
 +-------------------+
 [localhost:21000] > select x from p1;
 +-------------------------------+
 | x                             |
 +-------------------------------+
 | 2015-04-07 15:43:02.892403000 |
 | 2015-04-08 15:43:02.892403000 |
 +-------------------------------+

 Hive DDL and queries for Parquet table:

 hive> create table h1 (x timestamp) stored as parquet;
 OK
 hive> insert into h1 select * from p1;
 ...
 OK
 Time taken: 35.573 seconds
 hive> select x from p1;
 OK
 2015-04-07 15:43:02.892403
 2015-04-08 15:43:02.892403
 Time taken: 0.324 seconds, Fetched: 2 row(s)
 hive> select x from h1;
 OK
 2015-04-07 15:43:02.892403
 2015-04-08 15:43:02.892403
 Time taken: 0.197 seconds, Fetched: 2 row(s)
 </codeblock>

     <p>
       The discrepancy arises when Impala queries the Hive-created Parquet table. The underlying
       values in the <codeph>TIMESTAMP</codeph> column are different from the ones written by
       Impala, even though they were copied from one table to another by an <codeph>INSERT ...
       SELECT</codeph> statement in Hive. Hive did an implicit conversion from the local time
       zone to UTC as it wrote the values to Parquet.
     </p>

 <codeblock>Impala query for TIMESTAMP values from Impala-written and Hive-written data:

 [localhost:21000] > select * from p1;
 +-------------------------------+
 | x                             |
 +-------------------------------+
 | 2015-04-07 15:43:02.892403000 |
 | 2015-04-08 15:43:02.892403000 |
 +-------------------------------+
 Fetched 2 row(s) in 0.29s
 [localhost:21000] > select * from h1;
 +-------------------------------+
 | x                             |
 +-------------------------------+
 | 2015-04-07 22:43:02.892403000 |
 | 2015-04-08 22:43:02.892403000 |
 +-------------------------------+
 Fetched 2 row(s) in 0.41s

 Underlying integer values for Impala-written and Hive-written data:

 [localhost:21000] > select cast(x as bigint) from p1;
 +-------------------+
 | cast(x as bigint) |
 +-------------------+
 | 1428421382        |
 | 1428507782        |
 +-------------------+
 Fetched 2 row(s) in 0.38s
 [localhost:21000] > select cast(x as bigint) from h1;
 +-------------------+
 | cast(x as bigint) |
 +-------------------+
 | 1428446582        |
 | 1428532982        |
 +-------------------+
 Fetched 2 row(s) in 0.20s
 </codeblock>

     <p>
       When the <codeph>-convert_legacy_hive_parquet_utc_timestamps</codeph> setting is enabled,
       Impala recognizes the Parquet data files written by Hive, and applies the same
       UTC-to-local-timezone conversion logic during the query as Hive uses, making the contents
       of the Impala-written <codeph>P1</codeph> table and the Hive-written <codeph>H1</codeph>
       table appear identical, whether represented as <codeph>TIMESTAMP</codeph> values or the
       underlying <codeph>BIGINT</codeph> integers:
     </p>

 <codeblock>[localhost:21000] > select x from p1;
 +-------------------------------+
 | x                             |
 +-------------------------------+
 | 2015-04-07 15:43:02.892403000 |
 | 2015-04-08 15:43:02.892403000 |
 +-------------------------------+
 Fetched 2 row(s) in 0.37s
 [localhost:21000] > select x from h1;
 +-------------------------------+
 | x                             |
 +-------------------------------+
 | 2015-04-07 15:43:02.892403000 |
 | 2015-04-08 15:43:02.892403000 |
 +-------------------------------+
 Fetched 2 row(s) in 0.19s
 [localhost:21000] > select cast(x as bigint) from p1;
 +-------------------+
 | cast(x as bigint) |
 +-------------------+
 | 1428446582        |
 | 1428532982        |
 +-------------------+
 Fetched 2 row(s) in 0.29s
 [localhost:21000] > select cast(x as bigint) from h1;
 +-------------------+
 | cast(x as bigint) |
 +-------------------+
 | 1428446582        |
 | 1428532982        |
 +-------------------+
 Fetched 2 row(s) in 0.22s
 </codeblock>

     <p>
       <b>Conversions:</b>
     </p>

     <p conref="../shared/impala_common.xml#common/timestamp_conversions"
       conrefend="../shared/impala_common.xml#common/cast_string_to_timestamp"/>

     <p>
       <ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/>
     </p>

     <p>
       In Impala 1.3 and higher, the <codeph>FROM_UNIXTIME()</codeph> and
       <codeph>UNIX_TIMESTAMP()</codeph> functions allow a wider range of format strings, with
       more flexibility in element order, repetition of letter placeholders, and separator
       characters. In <keyword
         keyref="impala23_full"/> and higher, the
       <codeph>UNIX_TIMESTAMP()</codeph> function also allows a numeric timezone offset to be
       specified as part of the input string. See
       <xref
         href="impala_datetime_functions.xml#datetime_functions"/> for details.
     </p>

     <p conref="../shared/impala_common.xml#common/y2k38"/>

     <p>
       <b>Partitioning:</b>
     </p>

     <p>
       Although you cannot use a <codeph>TIMESTAMP</codeph> column as a partition key, you can
       extract the individual years, months, days, hours, and so on and partition based on those
       columns. Because the partition key column values are represented in HDFS directory names,
       rather than as fields in the data files themselves, you can also keep the original
       <codeph>TIMESTAMP</codeph> values if desired, without duplicating data or wasting storage
       space. See <xref
         href="impala_partitioning.xml#partition_key_columns"/> for more
       details on partitioning with date and time values.
     </p>

 <codeblock>[localhost:21000] &gt; create table timeline (event string) partitioned by (happened timestamp);
 ERROR: AnalysisException: Type 'TIMESTAMP' is not supported as partition-column type in column: happened
 </codeblock>

     <p conref="../shared/impala_common.xml#common/null_bad_timestamp_cast"/>

     <p conref="../shared/impala_common.xml#common/partitioning_worrisome"/>

     <p conref="../shared/impala_common.xml#common/hbase_ok"/>

     <p conref="../shared/impala_common.xml#common/parquet_ok"/>

     <p conref="../shared/impala_common.xml#common/text_bulky"/>

 <!--    <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> -->

     <p conref="../shared/impala_common.xml#common/internals_16_bytes"/>

     <p conref="../shared/impala_common.xml#common/added_forever"/>

     <p conref="../shared/impala_common.xml#common/column_stats_constant"/>

     <p conref="../shared/impala_common.xml#common/sqoop_blurb"/>

     <p conref="../shared/impala_common.xml#common/sqoop_timestamp_caveat"/>

     <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>

     <p>
       If you cast a <codeph>STRING</codeph> with an unrecognized format to a
       <codeph>TIMESTAMP</codeph>, the result is <codeph>NULL</codeph> rather than an error. Make
       sure to test your data pipeline to be sure any textual date and time values are in a
       format that Impala <codeph>TIMESTAMP</codeph> can recognize.
     </p>

     <p conref="../shared/impala_common.xml#common/avro_no_timestamp"/>

     <p conref="../shared/impala_common.xml#common/kudu_blurb"/>

     <p conref="../shared/impala_common.xml#common/kudu_timestamp_details"/>

     <p conref="../shared/impala_common.xml#common/example_blurb"/>

     <p>
       The following examples demonstrate using <codeph>TIMESTAMP</codeph> values with built-in
       functions:
     </p>

 <codeblock>select cast('1966-07-30' as timestamp);
 select cast('1985-09-25 17:45:30.005' as timestamp);
 select cast('08:30:00' as timestamp);
 select hour('1970-01-01 15:30:00');         -- Succeeds, returns 15.
 select hour('1970-01-01 15:30');            -- Returns NULL because seconds field required.
 select hour('1970-01-01 27:30:00');         -- Returns NULL because hour value out of range.
 select dayofweek('2004-06-13');             -- Returns 1, representing Sunday.
 select dayname('2004-06-13');               -- Returns 'Sunday'.
 select date_add('2004-06-13', 365);         -- Returns 2005-06-13 with zeros for hh:mm:ss fields.
 select day('2004-06-13');                   -- Returns 13.
 select datediff('1989-12-31','1984-09-01'); -- How many days between these 2 dates?
 select now();                               -- Returns current date and time in local timezone.
 </codeblock>

     <p>
       The following examples demonstrate using <codeph>TIMESTAMP</codeph> values with
       HDFS-backed tables:
     </p>

 <codeblock>create table dates_and_times (t timestamp);
 insert into dates_and_times values
   ('1966-07-30'), ('1985-09-25 17:45:30.005'), ('08:30:00'), (now());
 </codeblock>

     <p rev="IMPALA-5137">
       The following examples demonstrate using <codeph>TIMESTAMP</codeph> values with Kudu
       tables:
     </p>

 <codeblock rev="IMPALA-5137">create table timestamp_t (x int primary key, s string, t timestamp, b bigint)
   partition by hash (x) partitions 16
   stored as kudu;

 -- The default value of now() has microsecond precision, so the final 3 digits
 -- representing nanoseconds are all zero.
 insert into timestamp_t values (1, cast(now() as string), now(), unix_timestamp(now()));

 -- Values with 1-499 nanoseconds are rounded down in the Kudu TIMESTAMP column.
 insert into timestamp_t values (2, cast(now() + interval 100 nanoseconds as string), now() + interval 100 nanoseconds, unix_timestamp(now() + interval 100 nanoseconds));
 insert into timestamp_t values (3, cast(now() + interval 499 nanoseconds as string), now() + interval 499 nanoseconds, unix_timestamp(now() + interval 499 nanoseconds));

 -- Values with 500-999 nanoseconds are rounded up in the Kudu TIMESTAMP column.
 insert into timestamp_t values (4, cast(now() + interval 500 nanoseconds as string), now() + interval 500 nanoseconds, unix_timestamp(now() + interval 500 nanoseconds));
 insert into timestamp_t values (5, cast(now() + interval 501 nanoseconds as string), now() + interval 501 nanoseconds, unix_timestamp(now() + interval 501 nanoseconds));

 -- The string representation shows how underlying Impala TIMESTAMP can have nanosecond precision.
 -- The TIMESTAMP column shows how timestamps in a Kudu table are rounded to microsecond precision.
 -- The BIGINT column represents seconds past the epoch and so if not affected much by nanoseconds.
 select s, t, b from timestamp_t order by t;
 +-------------------------------+-------------------------------+------------+
 | s                             | t                             | b          |
 +-------------------------------+-------------------------------+------------+
 | 2017-05-31 15:30:05.107157000 | 2017-05-31 15:30:05.107157000 | 1496244605 |
 | 2017-05-31 15:30:28.868151100 | 2017-05-31 15:30:28.868151000 | 1496244628 |
 | 2017-05-31 15:34:33.674692499 | 2017-05-31 15:34:33.674692000 | 1496244873 |
 | 2017-05-31 15:35:04.769166500 | 2017-05-31 15:35:04.769167000 | 1496244904 |
 | 2017-05-31 15:35:33.033082501 | 2017-05-31 15:35:33.033083000 | 1496244933 |
 +-------------------------------+-------------------------------+------------+
 </codeblock>

     <p conref="../shared/impala_common.xml#common/related_info"/>

     <ul>
       <li>
 <!-- The Timestamp Literals topic is pretty brief. Consider adding more examples there. -->
         <xref href="impala_literals.xml#timestamp_literals"/>.
       </li>

       <li>
         To convert to or from different date formats, or perform date arithmetic, use the date
         and time functions described in
         <xref
           href="impala_datetime_functions.xml#datetime_functions"/>. In
         particular, the <codeph>from_unixtime()</codeph> function requires a case-sensitive
         format string such as <codeph>"yyyy-MM-dd HH:mm:ss.SSSS"</codeph>, matching one of the
         allowed variations of a <codeph>TIMESTAMP</codeph> value (date plus time, only date,
         only time, optional fractional seconds).
       </li>

       <li>
         See <xref href="impala_langref_unsupported.xml#langref_hiveql_delta"
         /> for
         details about differences in <codeph>TIMESTAMP</codeph> handling between Impala and
         Hive.
       </li>
     </ul>

   </conbody>

 </concept>