Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently, Hive SerDes and UDFs are based on Hive 1.2.1, and Spark SQL can be connected to different versions of Hive Metastore (from 0.12.0 to 2.3.3. Also see Interacting with Different Versions of Hive Metastore).
The Spark SQL Thrift JDBC server is designed to be “out of the box” compatible with existing Hive installations. You do not need to modify your existing Hive Metastore or change the data placement or partitioning of your tables.
Spark SQL supports the vast majority of Hive features, such as:
SELECT
GROUP BY
ORDER BY
CLUSTER BY
SORT BY
=
, ⇔
, ==
, <>
, <
, >
, >=
, <=
, etc)+
, -
, *
, /
, %
, etc)AND
, &&
, OR
, ||
, etc)sign
, ln
, cos
, etc)instr
, length
, printf
, etc)JOIN
{LEFT|RIGHT|FULL} OUTER JOIN
LEFT SEMI JOIN
CROSS JOIN
SELECT col FROM ( SELECT a + b AS col from t1) t2
CREATE TABLE
CREATE TABLE AS SELECT
ALTER TABLE
TINYINT
SMALLINT
INT
BIGINT
BOOLEAN
FLOAT
DOUBLE
STRING
BINARY
TIMESTAMP
DATE
ARRAY<>
MAP<>
STRUCT<>
Below is a list of Hive features that we don't support yet. Most of these features are rarely used in Hive deployments.
Major Hive Features
Esoteric Hive Features
UNION
typeHive Input/Output Formats
Hive Optimizations
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL's in-memory computational model. Others are slotted for future releases of Spark SQL.
SET spark.sql.shuffle.partitions=[num_tasks];
”.STREAMTABLE
hint in join: Spark SQL does not follow the STREAMTABLE
hint.Hive UDF/UDTF/UDAF
Not all the APIs of the Hive UDF/UDTF/UDAF are supported by Spark SQL. Below are the unsupported APIs:
getRequiredJars
and getRequiredFiles
(UDF
and GenericUDF
) are functions to automatically include additional resources required by this UDF.initialize(StructObjectInspector)
in GenericUDTF
is not supported yet. Spark SQL currently uses a deprecated interface initialize(ObjectInspector[])
only.configure
(GenericUDF
, GenericUDTF
, and GenericUDAFEvaluator
) is a function to initialize functions with MapredContext
, which is inapplicable to Spark.close
(GenericUDF
and GenericUDAFEvaluator
) is a function to release associated resources. Spark SQL does not call this function when tasks finish.reset
(GenericUDAFEvaluator
) is a function to re-initialize aggregation for reusing the same aggregation. Spark SQL currently does not support the reuse of aggregation.getWindowingEvaluator
(GenericUDAFEvaluator
) is a function to optimize aggregation by evaluating an aggregate over a fixed window.Below are the scenarios in which Hive and Spark generate different results:
SQRT(n)
If n < 0, Hive returns null, Spark SQL returns NaN.ACOS(n)
If n < -1 or n > 1, Hive returns null, Spark SQL returns NaN.ASIN(n)
If n < -1 or n > 1, Hive returns null, Spark SQL returns NaN.