blob: 08244430a1998ce25924c3d7997a8f5309baba30 [file] [log] [blame]
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Compatibility with Apache Hive - Spark 2.4.6 Documentation</title>
<link rel="stylesheet" href="css/bootstrap.min.css">
<style>
body {
padding-top: 60px;
padding-bottom: 40px;
}
</style>
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="css/bootstrap-responsive.min.css">
<link rel="stylesheet" href="css/main.css">
<script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
<link rel="stylesheet" href="css/pygments-default.css">
<!-- Google analytics script -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-32518208-2']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<!--[if lt IE 7]>
<p class="chromeframe">You are using an outdated browser. <a href="https://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p>
<![endif]-->
<!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html -->
<div class="navbar navbar-fixed-top" id="topbar">
<div class="navbar-inner">
<div class="container">
<div class="brand"><a href="index.html">
<img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">2.4.6</span>
</div>
<ul class="nav">
<!--TODO(andyk): Add class="active" attribute to li some how.-->
<li><a href="index.html">Overview</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quick-start.html">Quick Start</a></li>
<li><a href="rdd-programming-guide.html">RDDs, Accumulators, Broadcasts Vars</a></li>
<li><a href="sql-programming-guide.html">SQL, DataFrames, and Datasets</a></li>
<li><a href="structured-streaming-programming-guide.html">Structured Streaming</a></li>
<li><a href="streaming-programming-guide.html">Spark Streaming (DStreams)</a></li>
<li><a href="ml-guide.html">MLlib (Machine Learning)</a></li>
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
<li><a href="sparkr.html">SparkR (R on Spark)</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
<li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li>
<li><a href="api/R/index.html">R</a></li>
<li><a href="api/sql/index.html">SQL, Built-in Functions</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="cluster-overview.html">Overview</a></li>
<li><a href="submitting-applications.html">Submitting Applications</a></li>
<li class="divider"></li>
<li><a href="spark-standalone.html">Spark Standalone</a></li>
<li><a href="running-on-mesos.html">Mesos</a></li>
<li><a href="running-on-yarn.html">YARN</a></li>
<li><a href="running-on-kubernetes.html">Kubernetes</a></li>
</ul>
</li>
<li class="dropdown">
<a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="configuration.html">Configuration</a></li>
<li><a href="monitoring.html">Monitoring</a></li>
<li><a href="tuning.html">Tuning Guide</a></li>
<li><a href="job-scheduling.html">Job Scheduling</a></li>
<li><a href="security.html">Security</a></li>
<li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
<li class="divider"></li>
<li><a href="building-spark.html">Building Spark</a></li>
<li><a href="https://spark.apache.org/contributing.html">Contributing to Spark</a></li>
<li><a href="https://spark.apache.org/third-party-projects.html">Third Party Projects</a></li>
</ul>
</li>
</ul>
<!--<p class="navbar-text pull-right"><span class="version-text">v2.4.6</span></p>-->
</div>
</div>
</div>
<div class="container-wrapper">
<div class="left-menu-wrapper">
<div class="left-menu">
<h3><a href="sql-programming-guide.html">Spark SQL Guide</a></h3>
<ul>
<li>
<a href="sql-getting-started.html">
Getting Started
</a>
</li>
<li>
<a href="sql-data-sources.html">
Data Sources
</a>
</li>
<li>
<a href="sql-performance-tuning.html">
Performance Tuning
</a>
</li>
<li>
<a href="sql-distributed-sql-engine.html">
Distributed SQL Engine
</a>
</li>
<li>
<a href="sql-pyspark-pandas-with-arrow.html">
PySpark Usage Guide for Pandas with Apache Arrow
</a>
</li>
<li>
<a href="sql-migration-guide.html">
Migration Guide
</a>
</li>
<ul>
<li>
<a href="sql-migration-guide-upgrade.html">
Spark SQL Upgrading Guide
</a>
</li>
<li>
<a href="sql-migration-guide-hive-compatibility.html">
<b>Compatibility with Apache Hive</b>
</a>
</li>
</ul>
<li>
<a href="sql-reference.html">
Reference
</a>
</li>
</ul>
</div>
</div>
<input id="nav-trigger" class="nav-trigger" checked type="checkbox">
<label for="nav-trigger"></label>
<div class="content-with-sidebar" id="content">
<h1 class="title">Compatibility with Apache Hive</h1>
<ul id="markdown-toc">
<li><a href="#deploying-in-existing-hive-warehouses" id="markdown-toc-deploying-in-existing-hive-warehouses">Deploying in Existing Hive Warehouses</a></li>
<li><a href="#supported-hive-features" id="markdown-toc-supported-hive-features">Supported Hive Features</a></li>
<li><a href="#unsupported-hive-functionality" id="markdown-toc-unsupported-hive-functionality">Unsupported Hive Functionality</a></li>
<li><a href="#incompatible-hive-udf" id="markdown-toc-incompatible-hive-udf">Incompatible Hive UDF</a></li>
</ul>
<p>Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs.
Currently, Hive SerDes and UDFs are based on Hive 1.2.1,
and Spark SQL can be connected to different versions of Hive Metastore
(from 0.12.0 to 2.3.3. Also see <a href="sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore">Interacting with Different Versions of Hive Metastore</a>).</p>
<h4 id="deploying-in-existing-hive-warehouses">Deploying in Existing Hive Warehouses</h4>
<p>The Spark SQL Thrift JDBC server is designed to be &#8220;out of the box&#8221; compatible with existing Hive
installations. You do not need to modify your existing Hive Metastore or change the data placement
or partitioning of your tables.</p>
<h3 id="supported-hive-features">Supported Hive Features</h3>
<p>Spark SQL supports the vast majority of Hive features, such as:</p>
<ul>
<li>Hive query statements, including:
<ul>
<li><code>SELECT</code></li>
<li><code>GROUP BY</code></li>
<li><code>ORDER BY</code></li>
<li><code>CLUSTER BY</code></li>
<li><code>SORT BY</code></li>
</ul>
</li>
<li>All Hive operators, including:
<ul>
<li>Relational operators (<code>=</code>, <code></code>, <code>==</code>, <code>&lt;&gt;</code>, <code>&lt;</code>, <code>&gt;</code>, <code>&gt;=</code>, <code>&lt;=</code>, etc)</li>
<li>Arithmetic operators (<code>+</code>, <code>-</code>, <code>*</code>, <code>/</code>, <code>%</code>, etc)</li>
<li>Logical operators (<code>AND</code>, <code>&amp;&amp;</code>, <code>OR</code>, <code>||</code>, etc)</li>
<li>Complex type constructors</li>
<li>Mathematical functions (<code>sign</code>, <code>ln</code>, <code>cos</code>, etc)</li>
<li>String functions (<code>instr</code>, <code>length</code>, <code>printf</code>, etc)</li>
</ul>
</li>
<li>User defined functions (UDF)</li>
<li>User defined aggregation functions (UDAF)</li>
<li>User defined serialization formats (SerDes)</li>
<li>Window functions</li>
<li>Joins
<ul>
<li><code>JOIN</code></li>
<li><code>{LEFT|RIGHT|FULL} OUTER JOIN</code></li>
<li><code>LEFT SEMI JOIN</code></li>
<li><code>CROSS JOIN</code></li>
</ul>
</li>
<li>Unions</li>
<li>Sub-queries
<ul>
<li><code>SELECT col FROM ( SELECT a + b AS col from t1) t2</code></li>
</ul>
</li>
<li>Sampling</li>
<li>Explain</li>
<li>Partitioned tables including dynamic partition insertion</li>
<li>View</li>
<li>All Hive DDL Functions, including:
<ul>
<li><code>CREATE TABLE</code></li>
<li><code>CREATE TABLE AS SELECT</code></li>
<li><code>ALTER TABLE</code></li>
</ul>
</li>
<li>Most Hive Data types, including:
<ul>
<li><code>TINYINT</code></li>
<li><code>SMALLINT</code></li>
<li><code>INT</code></li>
<li><code>BIGINT</code></li>
<li><code>BOOLEAN</code></li>
<li><code>FLOAT</code></li>
<li><code>DOUBLE</code></li>
<li><code>STRING</code></li>
<li><code>BINARY</code></li>
<li><code>TIMESTAMP</code></li>
<li><code>DATE</code></li>
<li><code>ARRAY&lt;&gt;</code></li>
<li><code>MAP&lt;&gt;</code></li>
<li><code>STRUCT&lt;&gt;</code></li>
</ul>
</li>
</ul>
<h3 id="unsupported-hive-functionality">Unsupported Hive Functionality</h3>
<p>Below is a list of Hive features that we don&#8217;t support yet. Most of these features are rarely used
in Hive deployments.</p>
<p><strong>Major Hive Features</strong></p>
<ul>
<li>Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
doesn&#8217;t support buckets yet.</li>
</ul>
<p><strong>Esoteric Hive Features</strong></p>
<ul>
<li><code>UNION</code> type</li>
<li>Unique join</li>
<li>Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at
the moment and only supports populating the sizeInBytes field of the hive metastore.</li>
</ul>
<p><strong>Hive Input/Output Formats</strong></p>
<ul>
<li>File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.</li>
<li>Hadoop archive</li>
</ul>
<p><strong>Hive Optimizations</strong></p>
<p>A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are
less important due to Spark SQL&#8217;s in-memory computational model. Others are slotted for future
releases of Spark SQL.</p>
<ul>
<li>Block-level bitmap indexes and virtual columns (used to build indexes)</li>
<li>Automatically determine the number of reducers for joins and groupbys: Currently, in Spark SQL, you
need to control the degree of parallelism post-shuffle using &#8220;<code>SET spark.sql.shuffle.partitions=[num_tasks];</code>&#8221;.</li>
<li>Meta-data only query: For queries that can be answered by using only metadata, Spark SQL still
launches tasks to compute the result.</li>
<li>Skew data flag: Spark SQL does not follow the skew data flags in Hive.</li>
<li><code>STREAMTABLE</code> hint in join: Spark SQL does not follow the <code>STREAMTABLE</code> hint.</li>
<li>Merge multiple small files for query results: if the result output contains multiple small files,
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
metadata. Spark SQL does not support that.</li>
</ul>
<p><strong>Hive UDF/UDTF/UDAF</strong></p>
<p>Not all the APIs of the Hive UDF/UDTF/UDAF are supported by Spark SQL. Below are the unsupported APIs:</p>
<ul>
<li><code>getRequiredJars</code> and <code>getRequiredFiles</code> (<code>UDF</code> and <code>GenericUDF</code>) are functions to automatically
include additional resources required by this UDF.</li>
<li><code>initialize(StructObjectInspector)</code> in <code>GenericUDTF</code> is not supported yet. Spark SQL currently uses
a deprecated interface <code>initialize(ObjectInspector[])</code> only.</li>
<li><code>configure</code> (<code>GenericUDF</code>, <code>GenericUDTF</code>, and <code>GenericUDAFEvaluator</code>) is a function to initialize
functions with <code>MapredContext</code>, which is inapplicable to Spark.</li>
<li><code>close</code> (<code>GenericUDF</code> and <code>GenericUDAFEvaluator</code>) is a function to release associated resources.
Spark SQL does not call this function when tasks finish.</li>
<li><code>reset</code> (<code>GenericUDAFEvaluator</code>) is a function to re-initialize aggregation for reusing the same aggregation.
Spark SQL currently does not support the reuse of aggregation.</li>
<li><code>getWindowingEvaluator</code> (<code>GenericUDAFEvaluator</code>) is a function to optimize aggregation by evaluating
an aggregate over a fixed window.</li>
</ul>
<h3 id="incompatible-hive-udf">Incompatible Hive UDF</h3>
<p>Below are the scenarios in which Hive and Spark generate different results:</p>
<ul>
<li><code>SQRT(n)</code> If n &lt; 0, Hive returns null, Spark SQL returns NaN.</li>
<li><code>ACOS(n)</code> If n &lt; -1 or n &gt; 1, Hive returns null, Spark SQL returns NaN.</li>
<li><code>ASIN(n)</code> If n &lt; -1 or n &gt; 1, Hive returns null, Spark SQL returns NaN.</li>
</ul>
</div>
<!-- /container -->
</div>
<script src="js/vendor/jquery-1.12.4.min.js"></script>
<script src="js/vendor/bootstrap.min.js"></script>
<script src="js/vendor/anchor.min.js"></script>
<script src="js/main.js"></script>
<!-- MathJax Section -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<script>
// Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS.
// We could use "//cdn.mathjax...", but that won't support "file://".
(function(d, script) {
script = d.createElement('script');
script.type = 'text/javascript';
script.async = true;
script.onload = function(){
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
processEscapes: true,
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
}
});
};
script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
'cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js' +
'?config=TeX-AMS-MML_HTMLorMML';
d.getElementsByTagName('head')[0].appendChild(script);
}(document));
</script>
</body>
</html>