Hoodie registers the dataset into the Hive metastore backed by HoodieInputFormat
. This makes the data accessible to Hive & Spark & Presto automatically. To be able to perform normal SQL queries on such a dataset, we need to get the individual query engines to call HoodieInputFormat.getSplits()
, during query planning such that the right versions of files are exposed to it.
In the following sections, we cover the configs needed across different query engines to achieve this.
{% include callout.html content=“Instructions are currently only for Copy-on-write storage” type=“info” %}
For HiveServer2 access, install the hoodie-hadoop-mr-x.y.z-SNAPSHOT.jar into the aux jars path and we should be able to recognize the Hoodie tables and query them correctly.
For beeline access, the hive.input.format
variable needs to be set to the fully qualified path name of the inputformat com.uber.hoodie.hadoop.HoodieInputFormat
For Tez, additionally the hive.tez.input.format
needs to be set to org.apache.hadoop.hive.ql.io.HiveInputFormat
There are two ways of running Spark SQL on Hoodie datasets.
First method involves, setting spark.sql.hive.convertMetastoreParquet=false
, forcing Spark to fallback to using the Hive Serde to read the data (planning/executions is still Spark). This turns off optimizations in Spark towards Parquet reading, which we will address in the next method based on path filters. However benchmarks have not revealed any real performance degradation with Hoodie & SparkSQL, compared to native support.
{% include callout.html content="Get involved to improve this integration here and here " type=“info” %}
Sample command is provided below to spin up Spark Shell
$ spark-shell --jars hoodie-hadoop-mr-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g --master yarn-client scala> sqlContext.sql("select count(*) from uber.trips where datestr = '2016-10-02'").show()
For scheduled Spark jobs, a dependency to hoodie-hadoop-mr and hoodie-client modules needs to be added and the same config needs to be set on SparkConf
or conveniently via HoodieReadClient.addHoodieSupport(conf)
{% include callout.html content="Don‘t instantiate a HoodieWriteClient against a table you don’t own. Hoodie is a single writer & multiple reader system as of now. You may accidentally cause incidents otherwise. " type=“warning” %}
The second method uses a new feature in Spark 2.x, which allows for the work of HoodieInputFormat to be done via a path filter as below. This method uses Spark built-in optimizations for reading Parquet files, just like queries on non-hoodie tables.
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);
Presto requires a patch (until the PR is merged) and the hoodie-hadoop-mr jar to be placed into <presto_install>/plugin/hive-hadoop2/
.
{% include callout.html content=“Get involved to improve this integration here” type=“info” %}