docs/source/python/timestamps.rst - arrow - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.

 **********
 Timestamps
 **********

 Arrow/Pandas Timestamps
 =======================

 Arrow timestamps are stored as a 64-bit integer with column metadata to
 associate a time unit (e.g. milliseconds, microseconds, or nanoseconds), and an
 optional time zone.  Pandas (`Timestamp`) uses a 64-bit integer representing
 nanoseconds and an optional time zone.
 Python/Pandas timestamp types without a associated time zone are referred to as
 "Time Zone Naive".  Python/Pandas timestamp types with an associated time zone are
 referred to as "Time Zone Aware".


 Timestamp Conversions
 =====================

 Pandas/Arrow ⇄ Spark
 --------------------

 Spark stores timestamps as 64-bit integers representing microseconds since
 the UNIX epoch.  It does not store any metadata about time zones with its
 timestamps.

 Spark interprets timestamps with the *session local time zone*, (i.e.
 ``spark.sql.session.timeZone``). If that time zone is undefined, Spark turns to
 the default system time zone. For simplicity's sake below, the session
 local time zone is always defined.

 This implies a few things when round-tripping timestamps:

 #.  Timezone information is lost (all timestamps that result from
     converting from spark to arrow/pandas are "time zone naive").
 #.  Timestamps are truncated to microseconds.
 #.  The session time zone might have unintuitive impacts on
     translation of timestamp values.

 Spark to Pandas (through Apache Arrow)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The following cases assume the Spark configuration
 ``spark.sql.execution.arrow.enabled`` is set to ``"true"``.

 ::

     >>> pdf = pd.DataFrame({'naive': [datetime(2019, 1, 1, 0)],
     ...                     'aware': [Timestamp(year=2019, month=1, day=1,
     ...                               nanosecond=500, tz=timezone(timedelta(hours=-8)))]})
     >>> pdf
            naive                               aware
            0 2018-10-01 2018-10-01 00:00:00.000000500-08:00

     >>> spark.conf.set("spark.sql.session.timeZone", "UTC")
     >>> utc_df = sqlContext.createDataFrame(pdf)
     >>> utf_df.show()
     +-------------------+-------------------+
     |              naive|              aware|
     +-------------------+-------------------+
     |2019-01-01 00:00:00|2019-01-01 08:00:00|
     +-------------------+-------------------+

 Note that conversion of the aware timestamp is shifted to reflect the time
 assuming UTC (it represents the same instant in time).  For naive
 timestamps, Spark treats them as being in the system local
 time zone and converts them UTC. Recall that internally, the schema
 for spark dataframe's does not store any time zone information with
 timestamps.

 Now if the session time zone is set to US Pacific Time (PST) we don't
 see any shift in the display of the aware time zone (it
 still represents the same instant in time):

 ::

     >>> spark.conf.set("spark.sql.session.timeZone", "US/Pacific")
     >>> pst_df = sqlContext.createDataFrame(pdf)
     >>> pst_df.show()
     +-------------------+-------------------+
     |              naive|              aware|
     +-------------------+-------------------+
     |2019-01-01 00:00:00|2019-01-01 00:00:00|
     +-------------------+-------------------+

 Looking again at utc_df.show() we see one of the impacts of the session time
 zone.  The naive timestamp was initially converted assuming UTC, the instant it
 reflects is actually earlier than the naive time zone from the PST converted
 data frame:

 ::

     >>> utc_df.show()
     +-------------------+-------------------+
     |              naive|              aware|
     +-------------------+-------------------+
     |2018-12-31 16:00:00|2019-01-01 00:00:00|
     +-------------------+-------------------+

 Spark to Pandas
 ~~~~~~~~~~~~~~~

 We can observe what happens when converting back to Arrow/Pandas.  Assuming the
 session time zone is still PST:

 ::

    >>> pst_df.show()
    +-------------------+-------------------+
    |              naive|              aware|
    +-------------------+-------------------+
    |2019-01-01 00:00:00|2019-01-01 00:00:00|
    +-------------------+-------------------+


     >>> pst_df.toPandas()
     naive      aware
     0 2019-01-01 2019-01-01
     >>> pst_df.toPandas().info()
     <class 'pandas.core.frame.DataFrame'>
     RangeIndex: 1 entries, 0 to 0
     Data columns (total 2 columns):
     naive    1 non-null datetime64[ns]
     aware    1 non-null datetime64[ns]
     dtypes: datetime64[ns](2)
     memory usage: 96.0 bytes

 Notice that, in addition to being a "time zone naive" timestamp, the 'aware'
 value will now differ when converting to an epoch offset.  Spark does the conversion
 by first converting to the session time zone (or system local time zone if
 session time zones isn't set) and then localizes to remove the time zone
 information.  This results in the timestamp being 8 hours before the original
 time:

 ::

   >>> pst_df.toPandas()['aware'][0]
   Timestamp('2019-01-01 00:00:00')
   >>> pdf['aware'][0]
   Timestamp('2019-01-01 00:00:00.000000500-0800', tz='UTC-08:00')
   >>> (pst_df.toPandas()['aware'][0].timestamp()-pdf['aware'][0].timestamp())/3600
   -8.0

 The same type of conversion happens with the data frame converted while
 the session time zone was UTC.  In this case both naive and aware
 represent different instants in time (the naive instant is due to
 the change in session time zone between creating data frames):

 ::

   >>> utc_df.show()
   +-------------------+-------------------+
   |              naive|              aware|
   +-------------------+-------------------+
   |2018-12-31 16:00:00|2019-01-01 00:00:00|
   +-------------------+-------------------+

   >>> utc_df.toPandas()
   naive      aware
   0 2018-12-31 16:00:00 2019-01-01

 Note that the surprising shift for aware doesn't happen
 when the session time zone is UTC (but the timestamps
 still become "time zone naive"):

 ::

   >>> spark.conf.set("spark.sql.session.timeZone", "UTC")
   >>> pst_df.show()
   +-------------------+-------------------+
   |              naive|              aware|
   +-------------------+-------------------+
   |2019-01-01 08:00:00|2019-01-01 08:00:00|
   +-------------------+-------------------+

   >>> pst_df.toPandas()['aware'][0]
   Timestamp('2019-01-01 08:00:00')
   >>> pdf['aware'][0]
   Timestamp('2019-01-01 00:00:00.000000500-0800', tz='UTC-08:00')
   >>> (pst_df.toPandas()['aware'][0].timestamp()-pdf['aware'][0].timestamp())/3600
   0.0
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.

	**********
	Timestamps
	**********

	Arrow/Pandas Timestamps
	=======================

	Arrow timestamps are stored as a 64-bit integer with column metadata to
	associate a time unit (e.g. milliseconds, microseconds, or nanoseconds), and an
	optional time zone. Pandas (`Timestamp`) uses a 64-bit integer representing
	nanoseconds and an optional time zone.
	Python/Pandas timestamp types without a associated time zone are referred to as
	"Time Zone Naive". Python/Pandas timestamp types with an associated time zone are
	referred to as "Time Zone Aware".


	Timestamp Conversions
	=====================

	Pandas/Arrow ⇄ Spark
	--------------------

	Spark stores timestamps as 64-bit integers representing microseconds since
	the UNIX epoch. It does not store any metadata about time zones with its
	timestamps.

	Spark interprets timestamps with the session local time zone, (i.e.
	``spark.sql.session.timeZone``). If that time zone is undefined, Spark turns to
	the default system time zone. For simplicity's sake below, the session
	local time zone is always defined.

	This implies a few things when round-tripping timestamps:

	#. Timezone information is lost (all timestamps that result from
	converting from spark to arrow/pandas are "time zone naive").
	#. Timestamps are truncated to microseconds.
	#. The session time zone might have unintuitive impacts on
	translation of timestamp values.

	Spark to Pandas (through Apache Arrow)
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	The following cases assume the Spark configuration
	``spark.sql.execution.arrow.enabled`` is set to ``"true"``.

	::

	>>> pdf = pd.DataFrame({'naive': [datetime(2019, 1, 1, 0)],
	... 'aware': [Timestamp(year=2019, month=1, day=1,
	... nanosecond=500, tz=timezone(timedelta(hours=-8)))]})
	>>> pdf
	naive aware
	0 2018-10-01 2018-10-01 00:00:00.000000500-08:00

	>>> spark.conf.set("spark.sql.session.timeZone", "UTC")
	>>> utc_df = sqlContext.createDataFrame(pdf)
	>>> utf_df.show()
	+-------------------+-------------------+
	\| naive\| aware\|
	+-------------------+-------------------+
	\|2019-01-01 00:00:00\|2019-01-01 08:00:00\|
	+-------------------+-------------------+

	Note that conversion of the aware timestamp is shifted to reflect the time
	assuming UTC (it represents the same instant in time). For naive
	timestamps, Spark treats them as being in the system local
	time zone and converts them UTC. Recall that internally, the schema
	for spark dataframe's does not store any time zone information with
	timestamps.

	Now if the session time zone is set to US Pacific Time (PST) we don't
	see any shift in the display of the aware time zone (it
	still represents the same instant in time):

	::

	>>> spark.conf.set("spark.sql.session.timeZone", "US/Pacific")
	>>> pst_df = sqlContext.createDataFrame(pdf)
	>>> pst_df.show()
	+-------------------+-------------------+
	\| naive\| aware\|
	+-------------------+-------------------+
	\|2019-01-01 00:00:00\|2019-01-01 00:00:00\|
	+-------------------+-------------------+

	Looking again at utc_df.show() we see one of the impacts of the session time
	zone. The naive timestamp was initially converted assuming UTC, the instant it
	reflects is actually earlier than the naive time zone from the PST converted
	data frame:

	::

	>>> utc_df.show()
	+-------------------+-------------------+
	\| naive\| aware\|
	+-------------------+-------------------+
	\|2018-12-31 16:00:00\|2019-01-01 00:00:00\|
	+-------------------+-------------------+

	Spark to Pandas
	~~~~~~~~~~~~~~~

	We can observe what happens when converting back to Arrow/Pandas. Assuming the
	session time zone is still PST:

	::

	>>> pst_df.show()
	+-------------------+-------------------+
	\| naive\| aware\|
	+-------------------+-------------------+
	\|2019-01-01 00:00:00\|2019-01-01 00:00:00\|
	+-------------------+-------------------+


	>>> pst_df.toPandas()
	naive aware
	0 2019-01-01 2019-01-01
	>>> pst_df.toPandas().info()
	<class 'pandas.core.frame.DataFrame'>
	RangeIndex: 1 entries, 0 to 0
	Data columns (total 2 columns):
	naive 1 non-null datetime64[ns]
	aware 1 non-null datetime64[ns]
	dtypes: datetime64[ns](2)
	memory usage: 96.0 bytes

	Notice that, in addition to being a "time zone naive" timestamp, the 'aware'
	value will now differ when converting to an epoch offset. Spark does the conversion
	by first converting to the session time zone (or system local time zone if
	session time zones isn't set) and then localizes to remove the time zone
	information. This results in the timestamp being 8 hours before the original
	time:

	::

	>>> pst_df.toPandas()['aware'][0]
	Timestamp('2019-01-01 00:00:00')
	>>> pdf['aware'][0]
	Timestamp('2019-01-01 00:00:00.000000500-0800', tz='UTC-08:00')
	>>> (pst_df.toPandas()['aware'][0].timestamp()-pdf['aware'][0].timestamp())/3600
	-8.0

	The same type of conversion happens with the data frame converted while
	the session time zone was UTC. In this case both naive and aware
	represent different instants in time (the naive instant is due to
	the change in session time zone between creating data frames):

	::

	>>> utc_df.show()
	+-------------------+-------------------+
	\| naive\| aware\|
	+-------------------+-------------------+
	\|2018-12-31 16:00:00\|2019-01-01 00:00:00\|
	+-------------------+-------------------+

	>>> utc_df.toPandas()
	naive aware
	0 2018-12-31 16:00:00 2019-01-01

	Note that the surprising shift for aware doesn't happen
	when the session time zone is UTC (but the timestamps
	still become "time zone naive"):

	::

	>>> spark.conf.set("spark.sql.session.timeZone", "UTC")
	>>> pst_df.show()
	+-------------------+-------------------+
	\| naive\| aware\|
	+-------------------+-------------------+
	\|2019-01-01 08:00:00\|2019-01-01 08:00:00\|
	+-------------------+-------------------+

	>>> pst_df.toPandas()['aware'][0]
	Timestamp('2019-01-01 08:00:00')
	>>> pdf['aware'][0]
	Timestamp('2019-01-01 00:00:00.000000500-0800', tz='UTC-08:00')
	>>> (pst_df.toPandas()['aware'][0].timestamp()-pdf['aware'][0].timestamp())/3600
	0.0