This module provides support for executing relational queries expressed in either SQL or a LINQ-like Scala DSL.
Spark SQL is broken up into four subprojects:
In order to create new hive test cases (i.e. a test suite based on HiveComparisonTest
), you will need to setup your development environment based on the following instructions.
If you are working with Hive 0.12.0, you will need to set several environmental variables as follows.
export HIVE_HOME="<path to>/hive/build/dist" export HIVE_DEV_HOME="<path to>/hive/" export HADOOP_HOME="<path to>/hadoop-1.0.4"
If you are working with Hive 0.13.1, the following steps are needed:
HIVE_HOME
with export HIVE_HOME="<path to hive>"
. Please do not set HIVE_DEV_HOME
(See SPARK-4119).HADOOP_HOME
with export HADOOP_HOME="<path to hadoop>"
$HIVE_HOME/lib
.$HIVE_HOME/lib
.val testTempDir = Utils.createTempDir() // We have to use kryo to let Hive correctly serialize some plans. sql("set hive.plan.serialization.format=kryo") // Explicitly set fs to local fs. sql(s"set fs.default.name=file://$testTempDir/") // Ask Hive to run jobs in-process as a single map and reduce task. sql("set mapred.job.tracker=local")
An interactive scala console can be invoked by running build/sbt hive/console
. From here you can execute queries with HiveQl and manipulate DataFrame by using DSL.
catalyst$ build/sbt hive/console [info] Starting scala interpreter... import org.apache.spark.sql.catalyst.analysis._ import org.apache.spark.sql.catalyst.dsl._ import org.apache.spark.sql.catalyst.errors._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ import org.apache.spark.sql.catalyst.util._ import org.apache.spark.sql.execution import org.apache.spark.sql.functions._ import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive.test.TestHive._ import org.apache.spark.sql.types._ Type in expressions to have them evaluated. Type :help for more information. scala> val query = sql("SELECT * FROM (SELECT * FROM src) a") query: org.apache.spark.sql.DataFrame = org.apache.spark.sql.DataFrame@74448eed
Query results are DataFrames
and can be operated as such.
scala> query.collect() res2: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86], [311,val_311], [27,val_27]...
You can also build further queries on top of these DataFrames
using the query DSL.
scala> query.where(query("key") > 30).select(avg(query("key"))).collect() res3: Array[org.apache.spark.sql.Row] = Array([274.79025423728814])