docs/sql-migration-guide-hive-compatibility.md - spark - Git at Google

 ---
 layout: global
 title: Compatibility with Apache Hive
 displayTitle: Compatibility with Apache Hive
 ---

 * Table of contents
 {:toc}

 Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs.
 Currently, Hive SerDes and UDFs are based on Hive 1.2.1,
 and Spark SQL can be connected to different versions of Hive Metastore
 (from 0.12.0 to 2.3.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)).

 #### Deploying in Existing Hive Warehouses

 The Spark SQL Thrift JDBC server is designed to be "out of the box" compatible with existing Hive
 installations. You do not need to modify your existing Hive Metastore or change the data placement
 or partitioning of your tables.

 ### Supported Hive Features

 Spark SQL supports the vast majority of Hive features, such as:

 * Hive query statements, including:
   * `SELECT`
   * `GROUP BY`
   * `ORDER BY`
   * `CLUSTER BY`
   * `SORT BY`
 * All Hive operators, including:
   * Relational operators (`=`, `⇔`, `==`, `<>`, `<`, `>`, `>=`, `<=`, etc)
   * Arithmetic operators (`+`, `-`, `*`, `/`, `%`, etc)
   * Logical operators (`AND`, `&&`, `OR`, `||`, etc)
   * Complex type constructors
   * Mathematical functions (`sign`, `ln`, `cos`, etc)
   * String functions (`instr`, `length`, `printf`, etc)
 * User defined functions (UDF)
 * User defined aggregation functions (UDAF)
 * User defined serialization formats (SerDes)
 * Window functions
 * Joins
   * `JOIN`
   * `{LEFT|RIGHT|FULL} OUTER JOIN`
   * `LEFT SEMI JOIN`
   * `CROSS JOIN`
 * Unions
 * Sub-queries
   * `SELECT col FROM ( SELECT a + b AS col from t1) t2`
 * Sampling
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
 * All Hive DDL Functions, including:
   * `CREATE TABLE`
   * `CREATE TABLE AS SELECT`
   * `ALTER TABLE`
 * Most Hive Data types, including:
   * `TINYINT`
   * `SMALLINT`
   * `INT`
   * `BIGINT`
   * `BOOLEAN`
   * `FLOAT`
   * `DOUBLE`
   * `STRING`
   * `BINARY`
   * `TIMESTAMP`
   * `DATE`
   * `ARRAY<>`
   * `MAP<>`
   * `STRUCT<>`

 ### Unsupported Hive Functionality

 Below is a list of Hive features that we don't support yet. Most of these features are rarely used
 in Hive deployments.

 **Major Hive Features**

 * Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
   doesn't support buckets yet.


 **Esoteric Hive Features**

 * `UNION` type
 * Unique join
 * Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at
   the moment and only supports populating the sizeInBytes field of the hive metastore.

 **Hive Input/Output Formats**

 * File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.
 * Hadoop archive

 **Hive Optimizations**

 A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are
 less important due to Spark SQL's in-memory computational model. Others are slotted for future
 releases of Spark SQL.

 * Block-level bitmap indexes and virtual columns (used to build indexes)
 * Automatically determine the number of reducers for joins and groupbys: Currently, in Spark SQL, you
   need to control the degree of parallelism post-shuffle using "`SET spark.sql.shuffle.partitions=[num_tasks];`".
 * Meta-data only query: For queries that can be answered by using only metadata, Spark SQL still
   launches tasks to compute the result.
 * Skew data flag: Spark SQL does not follow the skew data flags in Hive.
 * `STREAMTABLE` hint in join: Spark SQL does not follow the `STREAMTABLE` hint.
 * Merge multiple small files for query results: if the result output contains multiple small files,
   Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
   metadata. Spark SQL does not support that.

 **Hive UDF/UDTF/UDAF**

 Not all the APIs of the Hive UDF/UDTF/UDAF are supported by Spark SQL. Below are the unsupported APIs:

 * `getRequiredJars` and `getRequiredFiles` (`UDF` and `GenericUDF`) are functions to automatically
   include additional resources required by this UDF.
 * `initialize(StructObjectInspector)` in `GenericUDTF` is not supported yet. Spark SQL currently uses
   a deprecated interface `initialize(ObjectInspector[])` only.
 * `configure` (`GenericUDF`, `GenericUDTF`, and `GenericUDAFEvaluator`) is a function to initialize
   functions with `MapredContext`, which is inapplicable to Spark.
 * `close` (`GenericUDF` and `GenericUDAFEvaluator`) is a function to release associated resources.
   Spark SQL does not call this function when tasks finish.
 * `reset` (`GenericUDAFEvaluator`) is a function to re-initialize aggregation for reusing the same aggregation.
   Spark SQL currently does not support the reuse of aggregation.
 * `getWindowingEvaluator` (`GenericUDAFEvaluator`) is a function to optimize aggregation by evaluating
   an aggregate over a fixed window.

 ### Incompatible Hive UDF

 Below are the scenarios in which Hive and Spark generate different results:

 * `SQRT(n)` If n < 0, Hive returns null, Spark SQL returns NaN.
 * `ACOS(n)` If n < -1 or n > 1, Hive returns null, Spark SQL returns NaN.
 * `ASIN(n)` If n < -1 or n > 1, Hive returns null, Spark SQL returns NaN.
	---
	layout: global
	title: Compatibility with Apache Hive
	displayTitle: Compatibility with Apache Hive
	---

	* Table of contents
	{:toc}

	Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs.
	Currently, Hive SerDes and UDFs are based on Hive 1.2.1,
	and Spark SQL can be connected to different versions of Hive Metastore
	(from 0.12.0 to 2.3.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)).

	#### Deploying in Existing Hive Warehouses

	The Spark SQL Thrift JDBC server is designed to be "out of the box" compatible with existing Hive
	installations. You do not need to modify your existing Hive Metastore or change the data placement
	or partitioning of your tables.

	### Supported Hive Features

	Spark SQL supports the vast majority of Hive features, such as:

	* Hive query statements, including:
	* `SELECT`
	* `GROUP BY`
	* `ORDER BY`
	* `CLUSTER BY`
	* `SORT BY`
	* All Hive operators, including:
	* Relational operators (`=`, `⇔`, `==`, `<>`, `<`, `>`, `>=`, `<=`, etc)
	* Arithmetic operators (`+`, `-`, `*`, `/`, `%`, etc)
	* Logical operators (`AND`, `&&`, `OR`, `\|\|`, etc)
	* Complex type constructors
	* Mathematical functions (`sign`, `ln`, `cos`, etc)
	* String functions (`instr`, `length`, `printf`, etc)
	* User defined functions (UDF)
	* User defined aggregation functions (UDAF)
	* User defined serialization formats (SerDes)
	* Window functions
	* Joins
	* `JOIN`
	* `{LEFT\|RIGHT\|FULL} OUTER JOIN`
	* `LEFT SEMI JOIN`
	* `CROSS JOIN`
	* Unions
	* Sub-queries
	* `SELECT col FROM ( SELECT a + b AS col from t1) t2`
	* Sampling
	* Explain
	* Partitioned tables including dynamic partition insertion
	* View
	* All Hive DDL Functions, including:
	* `CREATE TABLE`
	* `CREATE TABLE AS SELECT`
	* `ALTER TABLE`
	* Most Hive Data types, including:
	* `TINYINT`
	* `SMALLINT`
	* `INT`
	* `BIGINT`
	* `BOOLEAN`
	* `FLOAT`
	* `DOUBLE`
	* `STRING`
	* `BINARY`
	* `TIMESTAMP`
	* `DATE`
	* `ARRAY<>`
	* `MAP<>`
	* `STRUCT<>`

	### Unsupported Hive Functionality

	Below is a list of Hive features that we don't support yet. Most of these features are rarely used
	in Hive deployments.

	Major Hive Features

	* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
	doesn't support buckets yet.


	Esoteric Hive Features

	* `UNION` type
	* Unique join
	* Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at
	the moment and only supports populating the sizeInBytes field of the hive metastore.

	Hive Input/Output Formats

	* File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.
	* Hadoop archive

	Hive Optimizations

	A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are
	less important due to Spark SQL's in-memory computational model. Others are slotted for future
	releases of Spark SQL.

	* Block-level bitmap indexes and virtual columns (used to build indexes)
	* Automatically determine the number of reducers for joins and groupbys: Currently, in Spark SQL, you
	need to control the degree of parallelism post-shuffle using "`SET spark.sql.shuffle.partitions=[num_tasks];`".
	* Meta-data only query: For queries that can be answered by using only metadata, Spark SQL still
	launches tasks to compute the result.
	* Skew data flag: Spark SQL does not follow the skew data flags in Hive.
	* `STREAMTABLE` hint in join: Spark SQL does not follow the `STREAMTABLE` hint.
	* Merge multiple small files for query results: if the result output contains multiple small files,
	Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
	metadata. Spark SQL does not support that.

	Hive UDF/UDTF/UDAF

	Not all the APIs of the Hive UDF/UDTF/UDAF are supported by Spark SQL. Below are the unsupported APIs:

	* `getRequiredJars` and `getRequiredFiles` (`UDF` and `GenericUDF`) are functions to automatically
	include additional resources required by this UDF.
	* `initialize(StructObjectInspector)` in `GenericUDTF` is not supported yet. Spark SQL currently uses
	a deprecated interface `initialize(ObjectInspector[])` only.
	* `configure` (`GenericUDF`, `GenericUDTF`, and `GenericUDAFEvaluator`) is a function to initialize
	functions with `MapredContext`, which is inapplicable to Spark.
	* `close` (`GenericUDF` and `GenericUDAFEvaluator`) is a function to release associated resources.
	Spark SQL does not call this function when tasks finish.
	* `reset` (`GenericUDAFEvaluator`) is a function to re-initialize aggregation for reusing the same aggregation.
	Spark SQL currently does not support the reuse of aggregation.
	* `getWindowingEvaluator` (`GenericUDAFEvaluator`) is a function to optimize aggregation by evaluating
	an aggregate over a fixed window.

	### Incompatible Hive UDF

	Below are the scenarios in which Hive and Spark generate different results:

	* `SQRT(n)` If n < 0, Hive returns null, Spark SQL returns NaN.
	* `ACOS(n)` If n < -1 or n > 1, Hive returns null, Spark SQL returns NaN.
	* `ASIN(n)` If n < -1 or n > 1, Hive returns null, Spark SQL returns NaN.