docs/sql_queries.md

title: SQL Queries keywords: sql hive spark presto sidebar: mydoc_sidebar permalink: sql_queries.html toc: false summary: In this page, we go over how to enable SQL queries on Hoodie built tables.

Hoodie registers the dataset into the Hive metastore backed by HoodieInputFormat. This makes the data accessible to Hive & Spark & Presto automatically. To be able to perform normal SQL queries on such a dataset, we need to get the individual query engines to call HoodieInputFormat.getSplits(), during query planning such that the right versions of files are exposed to it.

In the following sections, we cover the configs needed across different query engines to achieve this.

{% include callout.html content=“Instructions are currently only for Copy-on-write storage” type=“info” %}

Hive

For HiveServer2 access, install the hoodie-hadoop-mr-x.y.z-SNAPSHOT.jar into the aux jars path and we should be able to recognize the Hoodie tables and query them correctly.

For beeline access, the hive.input.format variable needs to be set to the fully qualified path name of the inputformat com.uber.hoodie.hadoop.HoodieInputFormat For Tez, additionally the hive.tez.input.format needs to be set to org.apache.hadoop.hive.ql.io.HiveInputFormat

Spark

There are two ways of running Spark SQL on Hoodie datasets.

First method involves, setting spark.sql.hive.convertMetastoreParquet=false, forcing Spark to fallback to using the Hive Serde to read the data (planning/executions is still Spark). This turns off optimizations in Spark towards Parquet reading, which we will address in the next method based on path filters. However benchmarks have not revealed any real performance degradation with Hoodie & SparkSQL, compared to native support.

{% include callout.html content="Get involved to improve this integration here and here " type=“info” %}

Sample command is provided below to spin up Spark Shell

$ spark-shell --jars hoodie-hadoop-mr-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf  --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g  --master yarn-client

scala> sqlContext.sql("select count(*) from uber.trips where datestr = '2016-10-02'").show()

For scheduled Spark jobs, a dependency to hoodie-hadoop-mr and hoodie-client modules needs to be added and the same config needs to be set on SparkConf or conveniently via HoodieReadClient.addHoodieSupport(conf)

{% include callout.html content="Don‘t instantiate a HoodieWriteClient against a table you don’t own. Hoodie is a single writer & multiple reader system as of now. You may accidentally cause incidents otherwise. " type=“warning” %}

The second method uses a new feature in Spark 2.x, which allows for the work of HoodieInputFormat to be done via a path filter as below. This method uses Spark built-in optimizations for reading Parquet files, just like queries on non-hoodie tables.

spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);

Presto

Presto requires a patch (until the PR is merged) and the hoodie-hadoop-mr jar to be placed into <presto_install>/plugin/hive-hadoop2/.

{% include callout.html content=“Get involved to improve this integration here” type=“info” %}