| --- |
| layout: global |
| title: Getting Started |
| displayTitle: Getting Started |
| license: | |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --- |
| |
| * Table of contents |
| {:toc} |
| |
| ## Starting Point: SparkSession |
| |
| <div class="codetabs"> |
| <div data-lang="scala" markdown="1"> |
| |
| The entry point into all functionality in Spark is the [`SparkSession`](api/scala/org/apache/spark/sql/SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: |
| |
| {% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| |
| The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: |
| |
| {% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} |
| </div> |
| |
| <div data-lang="python" markdown="1"> |
| |
| The entry point into all functionality in Spark is the [`SparkSession`](api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder`: |
| |
| {% include_example init_session python/sql/basic.py %} |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| |
| The entry point into all functionality in Spark is the [`SparkSession`](api/R/reference/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`: |
| |
| {% include_example init_session r/RSparkSQLExample.R %} |
| |
| Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around. |
| </div> |
| </div> |
| |
| `SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to |
| write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. |
| To use these features, you do not need to have an existing Hive setup. |
| |
| ## Creating DataFrames |
| |
| <div class="codetabs"> |
| <div data-lang="scala" markdown="1"> |
| With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), |
| from a Hive table, or from [Spark data sources](sql-data-sources.html). |
| |
| As an example, the following creates a DataFrame based on the content of a JSON file: |
| |
| {% include_example create_df scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), |
| from a Hive table, or from [Spark data sources](sql-data-sources.html). |
| |
| As an example, the following creates a DataFrame based on the content of a JSON file: |
| |
| {% include_example create_df java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} |
| </div> |
| |
| <div data-lang="python" markdown="1"> |
| With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), |
| from a Hive table, or from [Spark data sources](sql-data-sources.html). |
| |
| As an example, the following creates a DataFrame based on the content of a JSON file: |
| |
| {% include_example create_df python/sql/basic.py %} |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| With a `SparkSession`, applications can create DataFrames from a local R data.frame, |
| from a Hive table, or from [Spark data sources](sql-data-sources.html). |
| |
| As an example, the following creates a DataFrame based on the content of a JSON file: |
| |
| {% include_example create_df r/RSparkSQLExample.R %} |
| |
| </div> |
| </div> |
| |
| |
| ## Untyped Dataset Operations (aka DataFrame Operations) |
| |
| DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/org/apache/spark/sql/Dataset.html), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html) and [R](api/R/reference/SparkDataFrame.html). |
| |
| As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets. |
| |
| Here we include some basic examples of structured data processing using Datasets: |
| |
| <div class="codetabs"> |
| <div data-lang="scala" markdown="1"> |
| {% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| |
| For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/org/apache/spark/sql/Dataset.html). |
| |
| In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/org/apache/spark/sql/functions$.html). |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| |
| {% include_example untyped_ops java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} |
| |
| For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/java/org/apache/spark/sql/Dataset.html). |
| |
| In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html). |
| </div> |
| |
| <div data-lang="python" markdown="1"> |
| In Python, it's possible to access a DataFrame's columns either by attribute |
| (`df.age`) or by indexing (`df['age']`). While the former is convenient for |
| interactive data exploration, users are highly encouraged to use the |
| latter form, which is future proof and won't break with column names that |
| are also attributes on the DataFrame class. |
| |
| {% include_example untyped_ops python/sql/basic.py %} |
| For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/reference/pyspark.sql.html#dataframe-apis). |
| |
| In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/reference/pyspark.sql.html#functions). |
| |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| |
| {% include_example untyped_ops r/RSparkSQLExample.R %} |
| |
| For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/reference/index.html). |
| |
| In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/reference/SparkDataFrame.html). |
| |
| </div> |
| |
| </div> |
| |
| ## Running SQL Queries Programmatically |
| |
| <div class="codetabs"> |
| <div data-lang="scala" markdown="1"> |
| The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`. |
| |
| {% include_example run_sql scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `Dataset<Row>`. |
| |
| {% include_example run_sql java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} |
| </div> |
| |
| <div data-lang="python" markdown="1"> |
| The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`. |
| |
| {% include_example run_sql python/sql/basic.py %} |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`. |
| |
| {% include_example run_sql r/RSparkSQLExample.R %} |
| |
| </div> |
| </div> |
| |
| |
| ## Global Temporary View |
| |
| Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it |
| terminates. If you want to have a temporary view that is shared among all sessions and keep alive |
| until the Spark application terminates, you can create a global temporary view. Global temporary |
| view is tied to a system preserved database `global_temp`, and we must use the qualified name to |
| refer it, e.g. `SELECT * FROM global_temp.view1`. |
| |
| <div class="codetabs"> |
| <div data-lang="scala" markdown="1"> |
| {% include_example global_temp_view scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| {% include_example global_temp_view java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} |
| </div> |
| |
| <div data-lang="python" markdown="1"> |
| {% include_example global_temp_view python/sql/basic.py %} |
| </div> |
| |
| <div data-lang="SQL" markdown="1"> |
| |
| {% highlight sql %} |
| |
| CREATE GLOBAL TEMPORARY VIEW temp_view AS SELECT a + 1, b * 2 FROM tbl |
| |
| SELECT * FROM global_temp.temp_view |
| |
| {% endhighlight %} |
| |
| </div> |
| </div> |
| |
| |
| ## Creating Datasets |
| |
| Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use |
| a specialized [Encoder](api/scala/org/apache/spark/sql/Encoder.html) to serialize the objects |
| for processing or transmitting over the network. While both encoders and standard serialization are |
| responsible for turning an object into bytes, encoders are code generated dynamically and use a format |
| that allows Spark to perform many operations like filtering, sorting and hashing without deserializing |
| the bytes back into an object. |
| |
| <div class="codetabs"> |
| <div data-lang="scala" markdown="1"> |
| {% include_example create_ds scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| {% include_example create_ds java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} |
| </div> |
| </div> |
| |
| ## Interoperating with RDDs |
| |
| Spark SQL supports two different methods for converting existing RDDs into Datasets. The first |
| method uses reflection to infer the schema of an RDD that contains specific types of objects. This |
| reflection-based approach leads to more concise code and works well when you already know the schema |
| while writing your Spark application. |
| |
| The second method for creating Datasets is through a programmatic interface that allows you to |
| construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows |
| you to construct Datasets when the columns and their types are not known until runtime. |
| |
| ### Inferring the Schema Using Reflection |
| <div class="codetabs"> |
| |
| <div data-lang="scala" markdown="1"> |
| |
| The Scala interface for Spark SQL supports automatically converting an RDD containing case classes |
| to a DataFrame. The case class |
| defines the schema of the table. The names of the arguments to the case class are read using |
| reflection and become the names of the columns. Case classes can also be nested or contain complex |
| types such as `Seq`s or `Array`s. This RDD can be implicitly converted to a DataFrame and then be |
| registered as a table. Tables can be used in subsequent SQL statements. |
| |
| {% include_example schema_inferring scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| |
| Spark SQL supports automatically converting an RDD of |
| [JavaBeans](http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly) into a DataFrame. |
| The `BeanInfo`, obtained using reflection, defines the schema of the table. Currently, Spark SQL |
| does not support JavaBeans that contain `Map` field(s). Nested JavaBeans and `List` or `Array` |
| fields are supported though. You can create a JavaBean by creating a class that implements |
| Serializable and has getters and setters for all of its fields. |
| |
| {% include_example schema_inferring java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} |
| </div> |
| |
| <div data-lang="python" markdown="1"> |
| |
| Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of |
| key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, |
| and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files. |
| |
| {% include_example schema_inferring python/sql/basic.py %} |
| </div> |
| |
| </div> |
| |
| ### Programmatically Specifying the Schema |
| |
| <div class="codetabs"> |
| |
| <div data-lang="scala" markdown="1"> |
| |
| When case classes cannot be defined ahead of time (for example, |
| the structure of records is encoded in a string, or a text dataset will be parsed |
| and fields will be projected differently for different users), |
| a `DataFrame` can be created programmatically with three steps. |
| |
| 1. Create an RDD of `Row`s from the original RDD; |
| 2. Create the schema represented by a `StructType` matching the structure of |
| `Row`s in the RDD created in Step 1. |
| 3. Apply the schema to the RDD of `Row`s via `createDataFrame` method provided |
| by `SparkSession`. |
| |
| For example: |
| |
| {% include_example programmatic_schema scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| |
| When JavaBean classes cannot be defined ahead of time (for example, |
| the structure of records is encoded in a string, or a text dataset will be parsed and |
| fields will be projected differently for different users), |
| a `Dataset<Row>` can be created programmatically with three steps. |
| |
| 1. Create an RDD of `Row`s from the original RDD; |
| 2. Create the schema represented by a `StructType` matching the structure of |
| `Row`s in the RDD created in Step 1. |
| 3. Apply the schema to the RDD of `Row`s via `createDataFrame` method provided |
| by `SparkSession`. |
| |
| For example: |
| |
| {% include_example programmatic_schema java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %} |
| </div> |
| |
| <div data-lang="python" markdown="1"> |
| |
| When a dictionary of kwargs cannot be defined ahead of time (for example, |
| the structure of records is encoded in a string, or a text dataset will be parsed and |
| fields will be projected differently for different users), |
| a `DataFrame` can be created programmatically with three steps. |
| |
| 1. Create an RDD of tuples or lists from the original RDD; |
| 2. Create the schema represented by a `StructType` matching the structure of |
| tuples or lists in the RDD created in the step 1. |
| 3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`. |
| |
| For example: |
| |
| {% include_example programmatic_schema python/sql/basic.py %} |
| </div> |
| |
| </div> |
| |
| ## Scalar Functions |
| |
| Scalar functions are functions that return a single value per row, as opposed to aggregation functions, which return a value for a group of rows. Spark SQL supports a variety of [Built-in Scalar Functions](sql-ref-functions.html#scalar-functions). It also supports [User Defined Scalar Functions](sql-ref-functions-udf-scalar.html). |
| |
| ## Aggregate Functions |
| |
| Aggregate functions are functions that return a single value on a group of rows. The [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common aggregations such as `count()`, `count_distinct()`, `avg()`, `max()`, `min()`, etc. |
| Users are not limited to the predefined aggregate functions and can create their own. For more details |
| about user defined aggregate functions, please refer to the documentation of |
| [User Defined Aggregate Functions](sql-ref-functions-udf-aggregate.html). |