| --- |
| layout: global |
| title: Compatibility with Apache Hive |
| displayTitle: Compatibility with Apache Hive |
| --- |
| |
| * Table of contents |
| {:toc} |
| |
| Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. |
| Currently, Hive SerDes and UDFs are based on Hive 1.2.1, |
| and Spark SQL can be connected to different versions of Hive Metastore |
| (from 0.12.0 to 2.3.3. Also see [Interacting with Different Versions of Hive Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)). |
| |
| #### Deploying in Existing Hive Warehouses |
| |
| The Spark SQL Thrift JDBC server is designed to be "out of the box" compatible with existing Hive |
| installations. You do not need to modify your existing Hive Metastore or change the data placement |
| or partitioning of your tables. |
| |
| ### Supported Hive Features |
| |
| Spark SQL supports the vast majority of Hive features, such as: |
| |
| * Hive query statements, including: |
| * `SELECT` |
| * `GROUP BY` |
| * `ORDER BY` |
| * `CLUSTER BY` |
| * `SORT BY` |
| * All Hive operators, including: |
| * Relational operators (`=`, `⇔`, `==`, `<>`, `<`, `>`, `>=`, `<=`, etc) |
| * Arithmetic operators (`+`, `-`, `*`, `/`, `%`, etc) |
| * Logical operators (`AND`, `&&`, `OR`, `||`, etc) |
| * Complex type constructors |
| * Mathematical functions (`sign`, `ln`, `cos`, etc) |
| * String functions (`instr`, `length`, `printf`, etc) |
| * User defined functions (UDF) |
| * User defined aggregation functions (UDAF) |
| * User defined serialization formats (SerDes) |
| * Window functions |
| * Joins |
| * `JOIN` |
| * `{LEFT|RIGHT|FULL} OUTER JOIN` |
| * `LEFT SEMI JOIN` |
| * `CROSS JOIN` |
| * Unions |
| * Sub-queries |
| * `SELECT col FROM ( SELECT a + b AS col from t1) t2` |
| * Sampling |
| * Explain |
| * Partitioned tables including dynamic partition insertion |
| * View |
| * All Hive DDL Functions, including: |
| * `CREATE TABLE` |
| * `CREATE TABLE AS SELECT` |
| * `ALTER TABLE` |
| * Most Hive Data types, including: |
| * `TINYINT` |
| * `SMALLINT` |
| * `INT` |
| * `BIGINT` |
| * `BOOLEAN` |
| * `FLOAT` |
| * `DOUBLE` |
| * `STRING` |
| * `BINARY` |
| * `TIMESTAMP` |
| * `DATE` |
| * `ARRAY<>` |
| * `MAP<>` |
| * `STRUCT<>` |
| |
| ### Unsupported Hive Functionality |
| |
| Below is a list of Hive features that we don't support yet. Most of these features are rarely used |
| in Hive deployments. |
| |
| **Major Hive Features** |
| |
| * Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL |
| doesn't support buckets yet. |
| |
| |
| **Esoteric Hive Features** |
| |
| * `UNION` type |
| * Unique join |
| * Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at |
| the moment and only supports populating the sizeInBytes field of the hive metastore. |
| |
| **Hive Input/Output Formats** |
| |
| * File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat. |
| * Hadoop archive |
| |
| **Hive Optimizations** |
| |
| A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are |
| less important due to Spark SQL's in-memory computational model. Others are slotted for future |
| releases of Spark SQL. |
| |
| * Block-level bitmap indexes and virtual columns (used to build indexes) |
| * Automatically determine the number of reducers for joins and groupbys: Currently, in Spark SQL, you |
| need to control the degree of parallelism post-shuffle using "`SET spark.sql.shuffle.partitions=[num_tasks];`". |
| * Meta-data only query: For queries that can be answered by using only metadata, Spark SQL still |
| launches tasks to compute the result. |
| * Skew data flag: Spark SQL does not follow the skew data flags in Hive. |
| * `STREAMTABLE` hint in join: Spark SQL does not follow the `STREAMTABLE` hint. |
| * Merge multiple small files for query results: if the result output contains multiple small files, |
| Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS |
| metadata. Spark SQL does not support that. |
| |
| **Hive UDF/UDTF/UDAF** |
| |
| Not all the APIs of the Hive UDF/UDTF/UDAF are supported by Spark SQL. Below are the unsupported APIs: |
| |
| * `getRequiredJars` and `getRequiredFiles` (`UDF` and `GenericUDF`) are functions to automatically |
| include additional resources required by this UDF. |
| * `initialize(StructObjectInspector)` in `GenericUDTF` is not supported yet. Spark SQL currently uses |
| a deprecated interface `initialize(ObjectInspector[])` only. |
| * `configure` (`GenericUDF`, `GenericUDTF`, and `GenericUDAFEvaluator`) is a function to initialize |
| functions with `MapredContext`, which is inapplicable to Spark. |
| * `close` (`GenericUDF` and `GenericUDAFEvaluator`) is a function to release associated resources. |
| Spark SQL does not call this function when tasks finish. |
| * `reset` (`GenericUDAFEvaluator`) is a function to re-initialize aggregation for reusing the same aggregation. |
| Spark SQL currently does not support the reuse of aggregation. |
| * `getWindowingEvaluator` (`GenericUDAFEvaluator`) is a function to optimize aggregation by evaluating |
| an aggregate over a fixed window. |
| |
| ### Incompatible Hive UDF |
| |
| Below are the scenarios in which Hive and Spark generate different results: |
| |
| * `SQRT(n)` If n < 0, Hive returns null, Spark SQL returns NaN. |
| * `ACOS(n)` If n < -1 or n > 1, Hive returns null, Spark SQL returns NaN. |
| * `ASIN(n)` If n < -1 or n > 1, Hive returns null, Spark SQL returns NaN. |