| <!DOCTYPE html> |
| <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--> |
| <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]--> |
| <!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]--> |
| <!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]--> |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Spark SQL Programming Guide - Spark 1.2.1 Documentation</title> |
| <meta name="description" content=""> |
| |
| |
| |
| <link rel="stylesheet" href="css/bootstrap.min.css"> |
| <style> |
| body { |
| padding-top: 60px; |
| padding-bottom: 40px; |
| } |
| </style> |
| <meta name="viewport" content="width=device-width"> |
| <link rel="stylesheet" href="css/bootstrap-responsive.min.css"> |
| <link rel="stylesheet" href="css/main.css"> |
| |
| <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script> |
| |
| <link rel="stylesheet" href="css/pygments-default.css"> |
| |
| |
| <!-- Google analytics script --> |
| <script type="text/javascript"> |
| var _gaq = _gaq || []; |
| _gaq.push(['_setAccount', 'UA-32518208-2']); |
| _gaq.push(['_trackPageview']); |
| |
| (function() { |
| var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; |
| ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; |
| var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); |
| })(); |
| </script> |
| |
| |
| </head> |
| <body> |
| <!--[if lt IE 7]> |
| <p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p> |
| <![endif]--> |
| |
| <!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html --> |
| |
| <div class="navbar navbar-fixed-top" id="topbar"> |
| <div class="navbar-inner"> |
| <div class="container"> |
| <div class="brand"><a href="index.html"> |
| <img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">1.2.1</span> |
| </div> |
| <ul class="nav"> |
| <!--TODO(andyk): Add class="active" attribute to li some how.--> |
| <li><a href="index.html">Overview</a></li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a> |
| <ul class="dropdown-menu"> |
| <li><a href="quick-start.html">Quick Start</a></li> |
| <li><a href="programming-guide.html">Spark Programming Guide</a></li> |
| <li class="divider"></li> |
| <li><a href="streaming-programming-guide.html">Spark Streaming</a></li> |
| <li><a href="sql-programming-guide.html">Spark SQL</a></li> |
| <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li> |
| <li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li> |
| <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a> |
| <ul class="dropdown-menu"> |
| <li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li> |
| <li><a href="api/java/index.html">Java</a></li> |
| <li><a href="api/python/index.html">Python</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a> |
| <ul class="dropdown-menu"> |
| <li><a href="cluster-overview.html">Overview</a></li> |
| <li><a href="submitting-applications.html">Submitting Applications</a></li> |
| <li class="divider"></li> |
| <li><a href="spark-standalone.html">Spark Standalone</a></li> |
| <li><a href="running-on-mesos.html">Mesos</a></li> |
| <li><a href="running-on-yarn.html">YARN</a></li> |
| <li class="divider"></li> |
| <li><a href="ec2-scripts.html">Amazon EC2</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a> |
| <ul class="dropdown-menu"> |
| <li><a href="configuration.html">Configuration</a></li> |
| <li><a href="monitoring.html">Monitoring</a></li> |
| <li><a href="tuning.html">Tuning Guide</a></li> |
| <li><a href="job-scheduling.html">Job Scheduling</a></li> |
| <li><a href="security.html">Security</a></li> |
| <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li> |
| <li><a href="hadoop-third-party-distributions.html">3<sup>rd</sup>-Party Hadoop Distros</a></li> |
| <li class="divider"></li> |
| <li><a href="building-spark.html">Building Spark</a></li> |
| <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li> |
| <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects">Supplemental Projects</a></li> |
| </ul> |
| </li> |
| </ul> |
| <!--<p class="navbar-text pull-right"><span class="version-text">v1.2.1</span></p>--> |
| </div> |
| </div> |
| </div> |
| |
| <div class="container" id="content"> |
| |
| <h1 class="title">Spark SQL Programming Guide</h1> |
| |
| |
| <ul id="markdown-toc"> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#getting-started">Getting Started</a></li> |
| <li><a href="#data-sources">Data Sources</a> <ul> |
| <li><a href="#rdds">RDDs</a> <ul> |
| <li><a href="#inferring-the-schema-using-reflection">Inferring the Schema Using Reflection</a></li> |
| <li><a href="#programmatically-specifying-the-schema">Programmatically Specifying the Schema</a></li> |
| </ul> |
| </li> |
| <li><a href="#parquet-files">Parquet Files</a> <ul> |
| <li><a href="#loading-data-programmatically">Loading Data Programmatically</a></li> |
| <li><a href="#configuration">Configuration</a></li> |
| </ul> |
| </li> |
| <li><a href="#json-datasets">JSON Datasets</a></li> |
| <li><a href="#hive-tables">Hive Tables</a></li> |
| </ul> |
| </li> |
| <li><a href="#performance-tuning">Performance Tuning</a> <ul> |
| <li><a href="#caching-data-in-memory">Caching Data In Memory</a></li> |
| <li><a href="#other-configuration-options">Other Configuration Options</a></li> |
| </ul> |
| </li> |
| <li><a href="#other-sql-interfaces">Other SQL Interfaces</a> <ul> |
| <li><a href="#running-the-thrift-jdbcodbc-server">Running the Thrift JDBC/ODBC server</a></li> |
| <li><a href="#running-the-spark-sql-cli">Running the Spark SQL CLI</a></li> |
| </ul> |
| </li> |
| <li><a href="#compatibility-with-other-systems">Compatibility with Other Systems</a> <ul> |
| <li><a href="#migration-guide-for-shark-user">Migration Guide for Shark User</a> <ul> |
| <li><a href="#scheduling">Scheduling</a></li> |
| <li><a href="#reducer-number">Reducer number</a></li> |
| <li><a href="#caching">Caching</a></li> |
| </ul> |
| </li> |
| <li><a href="#compatibility-with-apache-hive">Compatibility with Apache Hive</a> <ul> |
| <li><a href="#deploying-in-existing-hive-warehouses">Deploying in Existing Hive Warehouses</a></li> |
| <li><a href="#supported-hive-features">Supported Hive Features</a></li> |
| <li><a href="#unsupported-hive-functionality">Unsupported Hive Functionality</a></li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| <li><a href="#writing-language-integrated-relational-queries">Writing Language-Integrated Relational Queries</a></li> |
| <li><a href="#spark-sql-datatype-reference">Spark SQL DataType Reference</a></li> |
| </ul> |
| |
| <h1 id="overview">Overview</h1> |
| |
| <div class="codetabs"> |
| <div data-lang="scala"> |
| |
| <p>Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using |
| Spark. At the core of this component is a new type of RDD, |
| <a href="api/scala/index.html#org.apache.spark.sql.SchemaRDD">SchemaRDD</a>. SchemaRDDs are composed of |
| <a href="api/scala/index.html#org.apache.spark.sql.package@Row:org.apache.spark.sql.catalyst.expressions.Row.type">Row</a> objects, along with |
| a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table |
| in a traditional relational database. A SchemaRDD can be created from an existing RDD, a <a href="http://parquet.io">Parquet</a> |
| file, a JSON dataset, or by running HiveQL against data stored in <a href="http://hive.apache.org/">Apache Hive</a>.</p> |
| |
| <p>All of the examples on this page use sample data included in the Spark distribution and can be run in the <code>spark-shell</code>.</p> |
| |
| </div> |
| |
| <div data-lang="java"> |
| <p>Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using |
| Spark. At the core of this component is a new type of RDD, |
| <a href="api/scala/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD">JavaSchemaRDD</a>. JavaSchemaRDDs are composed of |
| <a href="api/scala/index.html#org.apache.spark.sql.api.java.Row">Row</a> objects, along with |
| a schema that describes the data types of each column in the row. A JavaSchemaRDD is similar to a table |
| in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, a <a href="http://parquet.io">Parquet</a> |
| file, a JSON dataset, or by running HiveQL against data stored in <a href="http://hive.apache.org/">Apache Hive</a>.</p> |
| </div> |
| |
| <div data-lang="python"> |
| |
| <p>Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using |
| Spark. At the core of this component is a new type of RDD, |
| <a href="api/python/pyspark.sql.SchemaRDD-class.html">SchemaRDD</a>. SchemaRDDs are composed of |
| <a href="api/python/pyspark.sql.Row-class.html">Row</a> objects, along with |
| a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table |
| in a traditional relational database. A SchemaRDD can be created from an existing RDD, a <a href="http://parquet.io">Parquet</a> |
| file, a JSON dataset, or by running HiveQL against data stored in <a href="http://hive.apache.org/">Apache Hive</a>.</p> |
| |
| <p>All of the examples on this page use sample data included in the Spark distribution and can be run in the <code>pyspark</code> shell.</p> |
| </div> |
| </div> |
| |
| <p><strong>Spark SQL is currently an alpha component. While we will minimize API changes, some APIs may change in future releases.</strong></p> |
| |
| <hr /> |
| |
| <h1 id="getting-started">Getting Started</h1> |
| |
| <div class="codetabs"> |
| <div data-lang="scala"> |
| |
| <p>The entry point into all relational functionality in Spark is the |
| <a href="api/scala/index.html#org.apache.spark.sql.SQLContext">SQLContext</a> class, or one of its |
| descendants. To create a basic SQLContext, all you need is a SparkContext.</p> |
| |
| <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">sc</span><span class="k">:</span> <span class="kt">SparkContext</span> <span class="c1">// An existing SparkContext.</span> |
| <span class="k">val</span> <span class="n">sqlContext</span> <span class="k">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">.</span><span class="nc">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">)</span> |
| |
| <span class="c1">// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.</span> |
| <span class="k">import</span> <span class="nn">sqlContext.createSchemaRDD</span></code></pre></div> |
| |
| <p>In addition to the basic SQLContext, you can also create a HiveContext, which provides a |
| superset of the functionality provided by the basic SQLContext. Additional features include |
| the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the |
| ability to read data from Hive tables. To use a HiveContext, you do not need to have an |
| existing Hive setup, and all of the data sources available to a SQLContext are still available. |
| HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default |
| Spark build. If these dependencies are not a problem for your application then using HiveContext |
| is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to |
| feature parity with a HiveContext.</p> |
| |
| </div> |
| |
| <div data-lang="java"> |
| |
| <p>The entry point into all relational functionality in Spark is the |
| <a href="api/scala/index.html#org.apache.spark.sql.api.java.JavaSQLContext">JavaSQLContext</a> class, or one |
| of its descendants. To create a basic JavaSQLContext, all you need is a JavaSparkContext.</p> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="n">JavaSparkContext</span> <span class="n">sc</span> <span class="o">=</span> <span class="o">...;</span> <span class="c1">// An existing JavaSparkContext.</span> |
| <span class="n">JavaSQLContext</span> <span class="n">sqlContext</span> <span class="o">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">sql</span><span class="o">.</span><span class="na">api</span><span class="o">.</span><span class="na">java</span><span class="o">.</span><span class="na">JavaSQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">);</span></code></pre></div> |
| |
| <p>In addition to the basic SQLContext, you can also create a HiveContext, which provides a strict |
| super set of the functionality provided by the basic SQLContext. Additional features include |
| the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the |
| ability to read data from Hive tables. To use a HiveContext, you do not need to have an |
| existing Hive setup, and all of the data sources available to a SQLContext are still available. |
| HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default |
| Spark build. If these dependencies are not a problem for your application then using HiveContext |
| is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to |
| feature parity with a HiveContext.</p> |
| |
| </div> |
| |
| <div data-lang="python"> |
| |
| <p>The entry point into all relational functionality in Spark is the |
| <a href="api/python/pyspark.sql.SQLContext-class.html">SQLContext</a> class, or one |
| of its decedents. To create a basic SQLContext, all you need is a SparkContext.</p> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SQLContext</span> |
| <span class="n">sqlContext</span> <span class="o">=</span> <span class="n">SQLContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span></code></pre></div> |
| |
| <p>In addition to the basic SQLContext, you can also create a HiveContext, which provides a strict |
| super set of the functionality provided by the basic SQLContext. Additional features include |
| the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the |
| ability to read data from Hive tables. To use a HiveContext, you do not need to have an |
| existing Hive setup, and all of the data sources available to a SQLContext are still available. |
| HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default |
| Spark build. If these dependencies are not a problem for your application then using HiveContext |
| is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to |
| feature parity with a HiveContext.</p> |
| |
| </div> |
| |
| </div> |
| |
| <p>The specific variant of SQL that is used to parse queries can also be selected using the |
| <code>spark.sql.dialect</code> option. This parameter can be changed using either the <code>setConf</code> method on |
| a SQLContext or by using a <code>SET key=value</code> command in SQL. For a SQLContext, the only dialect |
| available is “sql” which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the |
| default is “hiveql”, though “sql” is also available. Since the HiveQL parser is much more complete, |
| this is recommended for most use cases.</p> |
| |
| <h1 id="data-sources">Data Sources</h1> |
| |
| <p>Spark SQL supports operating on a variety of data sources through the <code>SchemaRDD</code> interface. |
| A SchemaRDD can be operated on as normal RDDs and can also be registered as a temporary table. |
| Registering a SchemaRDD as a table allows you to run SQL queries over its data. This section |
| describes the various methods for loading data into a SchemaRDD.</p> |
| |
| <h2 id="rdds">RDDs</h2> |
| |
| <p>Spark SQL supports two different methods for converting existing RDDs into SchemaRDDs. The first |
| method uses reflection to infer the schema of an RDD that contains specific types of objects. This |
| reflection based approach leads to more concise code and works well when you already know the schema |
| while writing your Spark application.</p> |
| |
| <p>The second method for creating SchemaRDDs is through a programmatic interface that allows you to |
| construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows |
| you to construct SchemaRDDs when the columns and their types are not known until runtime.</p> |
| |
| <h3 id="inferring-the-schema-using-reflection">Inferring the Schema Using Reflection</h3> |
| <div class="codetabs"> |
| |
| <div data-lang="scala"> |
| |
| <p>The Scala interaface for Spark SQL supports automatically converting an RDD containing case classes |
| to a SchemaRDD. The case class |
| defines the schema of the table. The names of the arguments to the case class are read using |
| reflection and become the names of the columns. Case classes can also be nested or contain complex |
| types such as Sequences or Arrays. This RDD can be implicitly converted to a SchemaRDD and then be |
| registered as a table. Tables can be used in subsequent SQL statements.</p> |
| |
| <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// sc is an existing SparkContext.</span> |
| <span class="k">val</span> <span class="n">sqlContext</span> <span class="k">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">.</span><span class="nc">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">)</span> |
| <span class="c1">// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.</span> |
| <span class="k">import</span> <span class="nn">sqlContext.createSchemaRDD</span> |
| |
| <span class="c1">// Define the schema using a case class.</span> |
| <span class="c1">// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,</span> |
| <span class="c1">// you can use custom classes that implement the Product interface.</span> |
| <span class="k">case</span> <span class="k">class</span> <span class="nc">Person</span><span class="o">(</span><span class="n">name</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">age</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> |
| |
| <span class="c1">// Create an RDD of Person objects and register it as a table.</span> |
| <span class="k">val</span> <span class="n">people</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">","</span><span class="o">)).</span><span class="n">map</span><span class="o">(</span><span class="n">p</span> <span class="k">=></span> <span class="nc">Person</span><span class="o">(</span><span class="n">p</span><span class="o">(</span><span class="mi">0</span><span class="o">),</span> <span class="n">p</span><span class="o">(</span><span class="mi">1</span><span class="o">).</span><span class="n">trim</span><span class="o">.</span><span class="n">toInt</span><span class="o">))</span> |
| <span class="n">people</span><span class="o">.</span><span class="n">registerTempTable</span><span class="o">(</span><span class="s">"people"</span><span class="o">)</span> |
| |
| <span class="c1">// SQL statements can be run by using the sql methods provided by sqlContext.</span> |
| <span class="k">val</span> <span class="n">teenagers</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="o">)</span> |
| |
| <span class="c1">// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.</span> |
| <span class="c1">// The columns of a row in the result can be accessed by ordinal.</span> |
| <span class="n">teenagers</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">t</span> <span class="k">=></span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">t</span><span class="o">(</span><span class="mi">0</span><span class="o">)).</span><span class="n">collect</span><span class="o">().</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">)</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="java"> |
| |
| <p>Spark SQL supports automatically converting an RDD of <a href="http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly">JavaBeans</a> |
| into a Schema RDD. The BeanInfo, obtained using reflection, defines the schema of the table. |
| Currently, Spark SQL does not support JavaBeans that contain |
| nested or contain complex types such as Lists or Arrays. You can create a JavaBean by creating a |
| class that implements Serializable and has getters and setters for all of its fields.</p> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kd">public</span> <span class="kd">static</span> <span class="kd">class</span> <span class="nc">Person</span> <span class="kd">implements</span> <span class="n">Serializable</span> <span class="o">{</span> |
| <span class="kd">private</span> <span class="n">String</span> <span class="n">name</span><span class="o">;</span> |
| <span class="kd">private</span> <span class="kt">int</span> <span class="n">age</span><span class="o">;</span> |
| |
| <span class="kd">public</span> <span class="n">String</span> <span class="nf">getName</span><span class="o">()</span> <span class="o">{</span> |
| <span class="k">return</span> <span class="n">name</span><span class="o">;</span> |
| <span class="o">}</span> |
| |
| <span class="kd">public</span> <span class="kt">void</span> <span class="nf">setName</span><span class="o">(</span><span class="n">String</span> <span class="n">name</span><span class="o">)</span> <span class="o">{</span> |
| <span class="k">this</span><span class="o">.</span><span class="na">name</span> <span class="o">=</span> <span class="n">name</span><span class="o">;</span> |
| <span class="o">}</span> |
| |
| <span class="kd">public</span> <span class="kt">int</span> <span class="nf">getAge</span><span class="o">()</span> <span class="o">{</span> |
| <span class="k">return</span> <span class="n">age</span><span class="o">;</span> |
| <span class="o">}</span> |
| |
| <span class="kd">public</span> <span class="kt">void</span> <span class="nf">setAge</span><span class="o">(</span><span class="kt">int</span> <span class="n">age</span><span class="o">)</span> <span class="o">{</span> |
| <span class="k">this</span><span class="o">.</span><span class="na">age</span> <span class="o">=</span> <span class="n">age</span><span class="o">;</span> |
| <span class="o">}</span> |
| <span class="o">}</span></code></pre></div> |
| |
| <p>A schema can be applied to an existing RDD by calling <code>applySchema</code> and providing the Class object |
| for the JavaBean.</p> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// sc is an existing JavaSparkContext.</span> |
| <span class="n">JavaSQLContext</span> <span class="n">sqlContext</span> <span class="o">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">sql</span><span class="o">.</span><span class="na">api</span><span class="o">.</span><span class="na">java</span><span class="o">.</span><span class="na">JavaSQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">);</span> |
| |
| <span class="c1">// Load a text file and convert each line to a JavaBean.</span> |
| <span class="n">JavaRDD</span><span class="o"><</span><span class="n">Person</span><span class="o">></span> <span class="n">people</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="na">textFile</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="o">).</span><span class="na">map</span><span class="o">(</span> |
| <span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Person</span><span class="o">>()</span> <span class="o">{</span> |
| <span class="kd">public</span> <span class="n">Person</span> <span class="nf">call</span><span class="o">(</span><span class="n">String</span> <span class="n">line</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="o">{</span> |
| <span class="n">String</span><span class="o">[]</span> <span class="n">parts</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">","</span><span class="o">);</span> |
| |
| <span class="n">Person</span> <span class="n">person</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">Person</span><span class="o">();</span> |
| <span class="n">person</span><span class="o">.</span><span class="na">setName</span><span class="o">(</span><span class="n">parts</span><span class="o">[</span><span class="mi">0</span><span class="o">]);</span> |
| <span class="n">person</span><span class="o">.</span><span class="na">setAge</span><span class="o">(</span><span class="n">Integer</span><span class="o">.</span><span class="na">parseInt</span><span class="o">(</span><span class="n">parts</span><span class="o">[</span><span class="mi">1</span><span class="o">].</span><span class="na">trim</span><span class="o">()));</span> |
| |
| <span class="k">return</span> <span class="n">person</span><span class="o">;</span> |
| <span class="o">}</span> |
| <span class="o">});</span> |
| |
| <span class="c1">// Apply a schema to an RDD of JavaBeans and register it as a table.</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">schemaPeople</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">applySchema</span><span class="o">(</span><span class="n">people</span><span class="o">,</span> <span class="n">Person</span><span class="o">.</span><span class="na">class</span><span class="o">);</span> |
| <span class="n">schemaPeople</span><span class="o">.</span><span class="na">registerTempTable</span><span class="o">(</span><span class="s">"people"</span><span class="o">);</span> |
| |
| <span class="c1">// SQL can be run over RDDs that have been registered as tables.</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">teenagers</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">sql</span><span class="o">(</span><span class="s">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="o">)</span> |
| |
| <span class="c1">// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.</span> |
| <span class="c1">// The columns of a row in the result can be accessed by ordinal.</span> |
| <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">teenagerNames</span> <span class="o">=</span> <span class="n">teenagers</span><span class="o">.</span><span class="na">map</span><span class="o">(</span><span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">Row</span><span class="o">,</span> <span class="n">String</span><span class="o">>()</span> <span class="o">{</span> |
| <span class="kd">public</span> <span class="n">String</span> <span class="nf">call</span><span class="o">(</span><span class="n">Row</span> <span class="n">row</span><span class="o">)</span> <span class="o">{</span> |
| <span class="k">return</span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">row</span><span class="o">.</span><span class="na">getString</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span> |
| <span class="o">}</span> |
| <span class="o">}).</span><span class="na">collect</span><span class="o">();</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="python"> |
| |
| <p>Spark SQL can convert an RDD of Row objects to a SchemaRDD, inferring the datatypes. Rows are constructed by passing a list of |
| key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, |
| and the types are inferred by looking at the first row. Since we currently only look at the first |
| row, it is important that there is no missing data in the first row of the RDD. In future versions we |
| plan to more completely infer the schema by looking at more data, similar to the inference that is |
| performed on JSON files.</p> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># sc is an existing SparkContext.</span> |
| <span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SQLContext</span><span class="p">,</span> <span class="n">Row</span> |
| <span class="n">sqlContext</span> <span class="o">=</span> <span class="n">SQLContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span> |
| |
| <span class="c"># Load a text file and convert each line to a Row.</span> |
| <span class="n">lines</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="p">)</span> |
| <span class="n">parts</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="n">l</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">","</span><span class="p">))</span> |
| <span class="n">people</span> <span class="o">=</span> <span class="n">parts</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">age</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">1</span><span class="p">])))</span> |
| |
| <span class="c"># Infer the schema, and register the SchemaRDD as a table.</span> |
| <span class="n">schemaPeople</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">inferSchema</span><span class="p">(</span><span class="n">people</span><span class="p">)</span> |
| <span class="n">schemaPeople</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> |
| |
| <span class="c"># SQL can be run over SchemaRDDs that have been registered as a table.</span> |
| <span class="n">teenagers</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="p">)</span> |
| |
| <span class="c"># The results of SQL queries are RDDs and support all the normal RDD operations.</span> |
| <span class="n">teenNames</span> <span class="o">=</span> <span class="n">teenagers</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">p</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> |
| <span class="k">for</span> <span class="n">teenName</span> <span class="ow">in</span> <span class="n">teenNames</span><span class="o">.</span><span class="n">collect</span><span class="p">():</span> |
| <span class="k">print</span> <span class="n">teenName</span></code></pre></div> |
| |
| </div> |
| |
| </div> |
| |
| <h3 id="programmatically-specifying-the-schema">Programmatically Specifying the Schema</h3> |
| |
| <div class="codetabs"> |
| |
| <div data-lang="scala"> |
| |
| <p>When case classes cannot be defined ahead of time (for example, |
| the structure of records is encoded in a string, or a text dataset will be parsed |
| and fields will be projected differently for different users), |
| a <code>SchemaRDD</code> can be created programmatically with three steps.</p> |
| |
| <ol> |
| <li>Create an RDD of <code>Row</code>s from the original RDD;</li> |
| <li>Create the schema represented by a <code>StructType</code> matching the structure of |
| <code>Row</code>s in the RDD created in Step 1.</li> |
| <li>Apply the schema to the RDD of <code>Row</code>s via <code>applySchema</code> method provided |
| by <code>SQLContext</code>.</li> |
| </ol> |
| |
| <p>For example:</p> |
| |
| <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// sc is an existing SparkContext.</span> |
| <span class="k">val</span> <span class="n">sqlContext</span> <span class="k">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">.</span><span class="nc">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">)</span> |
| |
| <span class="c1">// Create an RDD</span> |
| <span class="k">val</span> <span class="n">people</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="o">)</span> |
| |
| <span class="c1">// The schema is encoded in a string</span> |
| <span class="k">val</span> <span class="n">schemaString</span> <span class="k">=</span> <span class="s">"name age"</span> |
| |
| <span class="c1">// Import Spark SQL data types and Row.</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.sql._</span> |
| |
| <span class="c1">// Generate the schema based on the string of schema</span> |
| <span class="k">val</span> <span class="n">schema</span> <span class="k">=</span> |
| <span class="nc">StructType</span><span class="o">(</span> |
| <span class="n">schemaString</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">fieldName</span> <span class="k">=></span> <span class="nc">StructField</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="nc">StringType</span><span class="o">,</span> <span class="kc">true</span><span class="o">)))</span> |
| |
| <span class="c1">// Convert records of the RDD (people) to Rows.</span> |
| <span class="k">val</span> <span class="n">rowRDD</span> <span class="k">=</span> <span class="n">people</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">","</span><span class="o">)).</span><span class="n">map</span><span class="o">(</span><span class="n">p</span> <span class="k">=></span> <span class="nc">Row</span><span class="o">(</span><span class="n">p</span><span class="o">(</span><span class="mi">0</span><span class="o">),</span> <span class="n">p</span><span class="o">(</span><span class="mi">1</span><span class="o">).</span><span class="n">trim</span><span class="o">))</span> |
| |
| <span class="c1">// Apply the schema to the RDD.</span> |
| <span class="k">val</span> <span class="n">peopleSchemaRDD</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">applySchema</span><span class="o">(</span><span class="n">rowRDD</span><span class="o">,</span> <span class="n">schema</span><span class="o">)</span> |
| |
| <span class="c1">// Register the SchemaRDD as a table.</span> |
| <span class="n">peopleSchemaRDD</span><span class="o">.</span><span class="n">registerTempTable</span><span class="o">(</span><span class="s">"people"</span><span class="o">)</span> |
| |
| <span class="c1">// SQL statements can be run by using the sql methods provided by sqlContext.</span> |
| <span class="k">val</span> <span class="n">results</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"SELECT name FROM people"</span><span class="o">)</span> |
| |
| <span class="c1">// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.</span> |
| <span class="c1">// The columns of a row in the result can be accessed by ordinal.</span> |
| <span class="n">results</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">t</span> <span class="k">=></span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">t</span><span class="o">(</span><span class="mi">0</span><span class="o">)).</span><span class="n">collect</span><span class="o">().</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">)</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="java"> |
| |
| <p>When JavaBean classes cannot be defined ahead of time (for example, |
| the structure of records is encoded in a string, or a text dataset will be parsed and |
| fields will be projected differently for different users), |
| a <code>SchemaRDD</code> can be created programmatically with three steps.</p> |
| |
| <ol> |
| <li>Create an RDD of <code>Row</code>s from the original RDD;</li> |
| <li>Create the schema represented by a <code>StructType</code> matching the structure of |
| <code>Row</code>s in the RDD created in Step 1.</li> |
| <li>Apply the schema to the RDD of <code>Row</code>s via <code>applySchema</code> method provided |
| by <code>JavaSQLContext</code>.</li> |
| </ol> |
| |
| <p>For example:</p> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// Import factory methods provided by DataType.</span> |
| <span class="kn">import</span> <span class="nn">org.apache.spark.sql.api.java.DataType</span> |
| <span class="c1">// Import StructType and StructField</span> |
| <span class="kn">import</span> <span class="nn">org.apache.spark.sql.api.java.StructType</span> |
| <span class="kn">import</span> <span class="nn">org.apache.spark.sql.api.java.StructField</span> |
| <span class="c1">// Import Row.</span> |
| <span class="kn">import</span> <span class="nn">org.apache.spark.sql.api.java.Row</span> |
| |
| <span class="c1">// sc is an existing JavaSparkContext.</span> |
| <span class="n">JavaSQLContext</span> <span class="n">sqlContext</span> <span class="o">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">sql</span><span class="o">.</span><span class="na">api</span><span class="o">.</span><span class="na">java</span><span class="o">.</span><span class="na">JavaSQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">);</span> |
| |
| <span class="c1">// Load a text file and convert each line to a JavaBean.</span> |
| <span class="n">JavaRDD</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">people</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="na">textFile</span><span class="o">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="o">);</span> |
| |
| <span class="c1">// The schema is encoded in a string</span> |
| <span class="n">String</span> <span class="n">schemaString</span> <span class="o">=</span> <span class="s">"name age"</span><span class="o">;</span> |
| |
| <span class="c1">// Generate the schema based on the string of schema</span> |
| <span class="n">List</span><span class="o"><</span><span class="n">StructField</span><span class="o">></span> <span class="n">fields</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">StructField</span><span class="o">>();</span> |
| <span class="k">for</span> <span class="o">(</span><span class="n">String</span> <span class="nl">fieldName:</span> <span class="n">schemaString</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">" "</span><span class="o">))</span> <span class="o">{</span> |
| <span class="n">fields</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">DataType</span><span class="o">.</span><span class="na">createStructField</span><span class="o">(</span><span class="n">fieldName</span><span class="o">,</span> <span class="n">DataType</span><span class="o">.</span><span class="na">StringType</span><span class="o">,</span> <span class="kc">true</span><span class="o">));</span> |
| <span class="o">}</span> |
| <span class="n">StructType</span> <span class="n">schema</span> <span class="o">=</span> <span class="n">DataType</span><span class="o">.</span><span class="na">createStructType</span><span class="o">(</span><span class="n">fields</span><span class="o">);</span> |
| |
| <span class="c1">// Convert records of the RDD (people) to Rows.</span> |
| <span class="n">JavaRDD</span><span class="o"><</span><span class="n">Row</span><span class="o">></span> <span class="n">rowRDD</span> <span class="o">=</span> <span class="n">people</span><span class="o">.</span><span class="na">map</span><span class="o">(</span> |
| <span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Row</span><span class="o">>()</span> <span class="o">{</span> |
| <span class="kd">public</span> <span class="n">Row</span> <span class="nf">call</span><span class="o">(</span><span class="n">String</span> <span class="n">record</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="o">{</span> |
| <span class="n">String</span><span class="o">[]</span> <span class="n">fields</span> <span class="o">=</span> <span class="n">record</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">","</span><span class="o">);</span> |
| <span class="k">return</span> <span class="n">Row</span><span class="o">.</span><span class="na">create</span><span class="o">(</span><span class="n">fields</span><span class="o">[</span><span class="mi">0</span><span class="o">],</span> <span class="n">fields</span><span class="o">[</span><span class="mi">1</span><span class="o">].</span><span class="na">trim</span><span class="o">());</span> |
| <span class="o">}</span> |
| <span class="o">});</span> |
| |
| <span class="c1">// Apply the schema to the RDD.</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">peopleSchemaRDD</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">applySchema</span><span class="o">(</span><span class="n">rowRDD</span><span class="o">,</span> <span class="n">schema</span><span class="o">);</span> |
| |
| <span class="c1">// Register the SchemaRDD as a table.</span> |
| <span class="n">peopleSchemaRDD</span><span class="o">.</span><span class="na">registerTempTable</span><span class="o">(</span><span class="s">"people"</span><span class="o">);</span> |
| |
| <span class="c1">// SQL can be run over RDDs that have been registered as tables.</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">results</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">sql</span><span class="o">(</span><span class="s">"SELECT name FROM people"</span><span class="o">);</span> |
| |
| <span class="c1">// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.</span> |
| <span class="c1">// The columns of a row in the result can be accessed by ordinal.</span> |
| <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">names</span> <span class="o">=</span> <span class="n">results</span><span class="o">.</span><span class="na">map</span><span class="o">(</span><span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">Row</span><span class="o">,</span> <span class="n">String</span><span class="o">>()</span> <span class="o">{</span> |
| <span class="kd">public</span> <span class="n">String</span> <span class="nf">call</span><span class="o">(</span><span class="n">Row</span> <span class="n">row</span><span class="o">)</span> <span class="o">{</span> |
| <span class="k">return</span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">row</span><span class="o">.</span><span class="na">getString</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span> |
| <span class="o">}</span> |
| <span class="o">}).</span><span class="na">collect</span><span class="o">();</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="python"> |
| |
| <p>When a dictionary of kwargs cannot be defined ahead of time (for example, |
| the structure of records is encoded in a string, or a text dataset will be parsed and |
| fields will be projected differently for different users), |
| a <code>SchemaRDD</code> can be created programmatically with three steps.</p> |
| |
| <ol> |
| <li>Create an RDD of tuples or lists from the original RDD;</li> |
| <li>Create the schema represented by a <code>StructType</code> matching the structure of |
| tuples or lists in the RDD created in the step 1.</li> |
| <li>Apply the schema to the RDD via <code>applySchema</code> method provided by <code>SQLContext</code>.</li> |
| </ol> |
| |
| <p>For example:</p> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Import SQLContext and data types</span> |
| <span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="o">*</span> |
| |
| <span class="c"># sc is an existing SparkContext.</span> |
| <span class="n">sqlContext</span> <span class="o">=</span> <span class="n">SQLContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span> |
| |
| <span class="c"># Load a text file and convert each line to a tuple.</span> |
| <span class="n">lines</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s">"examples/src/main/resources/people.txt"</span><span class="p">)</span> |
| <span class="n">parts</span> <span class="o">=</span> <span class="n">lines</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="n">l</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">","</span><span class="p">))</span> |
| <span class="n">people</span> <span class="o">=</span> <span class="n">parts</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">p</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()))</span> |
| |
| <span class="c"># The schema is encoded in a string.</span> |
| <span class="n">schemaString</span> <span class="o">=</span> <span class="s">"name age"</span> |
| |
| <span class="n">fields</span> <span class="o">=</span> <span class="p">[</span><span class="n">StructField</span><span class="p">(</span><span class="n">field_name</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">)</span> <span class="k">for</span> <span class="n">field_name</span> <span class="ow">in</span> <span class="n">schemaString</span><span class="o">.</span><span class="n">split</span><span class="p">()]</span> |
| <span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">(</span><span class="n">fields</span><span class="p">)</span> |
| |
| <span class="c"># Apply the schema to the RDD.</span> |
| <span class="n">schemaPeople</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">applySchema</span><span class="p">(</span><span class="n">people</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span> |
| |
| <span class="c"># Register the SchemaRDD as a table.</span> |
| <span class="n">schemaPeople</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> |
| |
| <span class="c"># SQL can be run over SchemaRDDs that have been registered as a table.</span> |
| <span class="n">results</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT name FROM people"</span><span class="p">)</span> |
| |
| <span class="c"># The results of SQL queries are RDDs and support all the normal RDD operations.</span> |
| <span class="n">names</span> <span class="o">=</span> <span class="n">results</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">p</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> |
| <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">names</span><span class="o">.</span><span class="n">collect</span><span class="p">():</span> |
| <span class="k">print</span> <span class="n">name</span></code></pre></div> |
| |
| </div> |
| |
| </div> |
| |
| <h2 id="parquet-files">Parquet Files</h2> |
| |
| <p><a href="http://parquet.io">Parquet</a> is a columnar format that is supported by many other data processing systems. |
| Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema |
| of the original data.</p> |
| |
| <h3 id="loading-data-programmatically">Loading Data Programmatically</h3> |
| |
| <p>Using the data from the above example:</p> |
| |
| <div class="codetabs"> |
| |
| <div data-lang="scala"> |
| |
| <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// sqlContext from the previous example is used in this example.</span> |
| <span class="c1">// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.</span> |
| <span class="k">import</span> <span class="nn">sqlContext.createSchemaRDD</span> |
| |
| <span class="k">val</span> <span class="n">people</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Person</span><span class="o">]</span> <span class="k">=</span> <span class="o">...</span> <span class="c1">// An RDD of case class objects, from the previous example.</span> |
| |
| <span class="c1">// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.</span> |
| <span class="n">people</span><span class="o">.</span><span class="n">saveAsParquetFile</span><span class="o">(</span><span class="s">"people.parquet"</span><span class="o">)</span> |
| |
| <span class="c1">// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.</span> |
| <span class="c1">// The result of loading a Parquet file is also a SchemaRDD.</span> |
| <span class="k">val</span> <span class="n">parquetFile</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">parquetFile</span><span class="o">(</span><span class="s">"people.parquet"</span><span class="o">)</span> |
| |
| <span class="c1">//Parquet files can also be registered as tables and then used in SQL statements.</span> |
| <span class="n">parquetFile</span><span class="o">.</span><span class="n">registerTempTable</span><span class="o">(</span><span class="s">"parquetFile"</span><span class="o">)</span> |
| <span class="k">val</span> <span class="n">teenagers</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19"</span><span class="o">)</span> |
| <span class="n">teenagers</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">t</span> <span class="k">=></span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">t</span><span class="o">(</span><span class="mi">0</span><span class="o">)).</span><span class="n">collect</span><span class="o">().</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">)</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="java"> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// sqlContext from the previous example is used in this example.</span> |
| |
| <span class="n">JavaSchemaRDD</span> <span class="n">schemaPeople</span> <span class="o">=</span> <span class="o">...</span> <span class="c1">// The JavaSchemaRDD from the previous example.</span> |
| |
| <span class="c1">// JavaSchemaRDDs can be saved as Parquet files, maintaining the schema information.</span> |
| <span class="n">schemaPeople</span><span class="o">.</span><span class="na">saveAsParquetFile</span><span class="o">(</span><span class="s">"people.parquet"</span><span class="o">);</span> |
| |
| <span class="c1">// Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.</span> |
| <span class="c1">// The result of loading a parquet file is also a JavaSchemaRDD.</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">parquetFile</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">parquetFile</span><span class="o">(</span><span class="s">"people.parquet"</span><span class="o">);</span> |
| |
| <span class="c1">//Parquet files can also be registered as tables and then used in SQL statements.</span> |
| <span class="n">parquetFile</span><span class="o">.</span><span class="na">registerTempTable</span><span class="o">(</span><span class="s">"parquetFile"</span><span class="o">);</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">teenagers</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">sql</span><span class="o">(</span><span class="s">"SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19"</span><span class="o">);</span> |
| <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">teenagerNames</span> <span class="o">=</span> <span class="n">teenagers</span><span class="o">.</span><span class="na">map</span><span class="o">(</span><span class="k">new</span> <span class="n">Function</span><span class="o"><</span><span class="n">Row</span><span class="o">,</span> <span class="n">String</span><span class="o">>()</span> <span class="o">{</span> |
| <span class="kd">public</span> <span class="n">String</span> <span class="nf">call</span><span class="o">(</span><span class="n">Row</span> <span class="n">row</span><span class="o">)</span> <span class="o">{</span> |
| <span class="k">return</span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">row</span><span class="o">.</span><span class="na">getString</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span> |
| <span class="o">}</span> |
| <span class="o">}).</span><span class="na">collect</span><span class="o">();</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="python"> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># sqlContext from the previous example is used in this example.</span> |
| |
| <span class="n">schemaPeople</span> <span class="c"># The SchemaRDD from the previous example.</span> |
| |
| <span class="c"># SchemaRDDs can be saved as Parquet files, maintaining the schema information.</span> |
| <span class="n">schemaPeople</span><span class="o">.</span><span class="n">saveAsParquetFile</span><span class="p">(</span><span class="s">"people.parquet"</span><span class="p">)</span> |
| |
| <span class="c"># Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.</span> |
| <span class="c"># The result of loading a parquet file is also a SchemaRDD.</span> |
| <span class="n">parquetFile</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">parquetFile</span><span class="p">(</span><span class="s">"people.parquet"</span><span class="p">)</span> |
| |
| <span class="c"># Parquet files can also be registered as tables and then used in SQL statements.</span> |
| <span class="n">parquetFile</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s">"parquetFile"</span><span class="p">);</span> |
| <span class="n">teenagers</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19"</span><span class="p">)</span> |
| <span class="n">teenNames</span> <span class="o">=</span> <span class="n">teenagers</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">p</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> |
| <span class="k">for</span> <span class="n">teenName</span> <span class="ow">in</span> <span class="n">teenNames</span><span class="o">.</span><span class="n">collect</span><span class="p">():</span> |
| <span class="k">print</span> <span class="n">teenName</span></code></pre></div> |
| |
| </div> |
| |
| </div> |
| |
| <h3 id="configuration">Configuration</h3> |
| |
| <p>Configuration of Parquet can be done using the <code>setConf</code> method on SQLContext or by running |
| <code>SET key=value</code> commands using SQL.</p> |
| |
| <table class="table"> |
| <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> |
| <tr> |
| <td><code>spark.sql.parquet.binaryAsString</code></td> |
| <td>false</td> |
| <td> |
| Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do |
| not differentiate between binary data and strings when writing out the Parquet schema. This |
| flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems. |
| </td> |
| </tr> |
| <tr> |
| <td><code>spark.sql.parquet.cacheMetadata</code></td> |
| <td>true</td> |
| <td> |
| Turns on caching of Parquet schema metadata. Can speed up querying of static data. |
| </td> |
| </tr> |
| <tr> |
| <td><code>spark.sql.parquet.compression.codec</code></td> |
| <td>gzip</td> |
| <td> |
| Sets the compression codec use when writing Parquet files. Acceptable values include: |
| uncompressed, snappy, gzip, lzo. |
| </td> |
| </tr> |
| <tr> |
| <td><code>spark.sql.parquet.filterPushdown</code></td> |
| <td>false</td> |
| <td> |
| Turn on Parquet filter pushdown optimization. This feature is turned off by default because of a known |
| bug in Paruet 1.6.0rc3 (<a href="https://issues.apache.org/jira/browse/PARQUET-136">PARQUET-136</a>). |
| However, if your table doesn't contain any nullable string or binary columns, it's still safe to turn |
| this feature on. |
| </td> |
| </tr> |
| <tr> |
| <td><code>spark.sql.hive.convertMetastoreParquet</code></td> |
| <td>true</td> |
| <td> |
| When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in |
| support. |
| </td> |
| </tr> |
| </table> |
| |
| <h2 id="json-datasets">JSON Datasets</h2> |
| <div class="codetabs"> |
| |
| <div data-lang="scala"> |
| <p>Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD. |
| This conversion can be done using one of two methods in a SQLContext:</p> |
| |
| <ul> |
| <li><code>jsonFile</code> - loads data from a directory of JSON files where each line of the files is a JSON object.</li> |
| <li><code>jsonRDD</code> - loads data from an existing RDD where each element of the RDD is a string containing a JSON object.</li> |
| </ul> |
| |
| <p>Note that the file that is offered as <em>jsonFile</em> is not a typical JSON file. Each |
| line must contain a separate, self-contained valid JSON object. As a consequence, |
| a regular multi-line JSON file will most often fail.</p> |
| |
| <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// sc is an existing SparkContext.</span> |
| <span class="k">val</span> <span class="n">sqlContext</span> <span class="k">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">.</span><span class="nc">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">)</span> |
| |
| <span class="c1">// A JSON dataset is pointed to by path.</span> |
| <span class="c1">// The path can be either a single text file or a directory storing text files.</span> |
| <span class="k">val</span> <span class="n">path</span> <span class="k">=</span> <span class="s">"examples/src/main/resources/people.json"</span> |
| <span class="c1">// Create a SchemaRDD from the file(s) pointed to by path</span> |
| <span class="k">val</span> <span class="n">people</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">jsonFile</span><span class="o">(</span><span class="n">path</span><span class="o">)</span> |
| |
| <span class="c1">// The inferred schema can be visualized using the printSchema() method.</span> |
| <span class="n">people</span><span class="o">.</span><span class="n">printSchema</span><span class="o">()</span> |
| <span class="c1">// root</span> |
| <span class="c1">// |-- age: integer (nullable = true)</span> |
| <span class="c1">// |-- name: string (nullable = true)</span> |
| |
| <span class="c1">// Register this SchemaRDD as a table.</span> |
| <span class="n">people</span><span class="o">.</span><span class="n">registerTempTable</span><span class="o">(</span><span class="s">"people"</span><span class="o">)</span> |
| |
| <span class="c1">// SQL statements can be run by using the sql methods provided by sqlContext.</span> |
| <span class="k">val</span> <span class="n">teenagers</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="o">)</span> |
| |
| <span class="c1">// Alternatively, a SchemaRDD can be created for a JSON dataset represented by</span> |
| <span class="c1">// an RDD[String] storing one JSON object per string.</span> |
| <span class="k">val</span> <span class="n">anotherPeopleRDD</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="o">(</span> |
| <span class="s">"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}"""</span> <span class="o">::</span> <span class="nc">Nil</span><span class="o">)</span> |
| <span class="k">val</span> <span class="n">anotherPeople</span> <span class="k">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">jsonRDD</span><span class="o">(</span><span class="n">anotherPeopleRDD</span><span class="o">)</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="java"> |
| <p>Spark SQL can automatically infer the schema of a JSON dataset and load it as a JavaSchemaRDD. |
| This conversion can be done using one of two methods in a JavaSQLContext :</p> |
| |
| <ul> |
| <li><code>jsonFile</code> - loads data from a directory of JSON files where each line of the files is a JSON object.</li> |
| <li><code>jsonRDD</code> - loads data from an existing RDD where each element of the RDD is a string containing a JSON object.</li> |
| </ul> |
| |
| <p>Note that the file that is offered as <em>jsonFile</em> is not a typical JSON file. Each |
| line must contain a separate, self-contained valid JSON object. As a consequence, |
| a regular multi-line JSON file will most often fail.</p> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// sc is an existing JavaSparkContext.</span> |
| <span class="n">JavaSQLContext</span> <span class="n">sqlContext</span> <span class="o">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">sql</span><span class="o">.</span><span class="na">api</span><span class="o">.</span><span class="na">java</span><span class="o">.</span><span class="na">JavaSQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">);</span> |
| |
| <span class="c1">// A JSON dataset is pointed to by path.</span> |
| <span class="c1">// The path can be either a single text file or a directory storing text files.</span> |
| <span class="n">String</span> <span class="n">path</span> <span class="o">=</span> <span class="s">"examples/src/main/resources/people.json"</span><span class="o">;</span> |
| <span class="c1">// Create a JavaSchemaRDD from the file(s) pointed to by path</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">people</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">jsonFile</span><span class="o">(</span><span class="n">path</span><span class="o">);</span> |
| |
| <span class="c1">// The inferred schema can be visualized using the printSchema() method.</span> |
| <span class="n">people</span><span class="o">.</span><span class="na">printSchema</span><span class="o">();</span> |
| <span class="c1">// root</span> |
| <span class="c1">// |-- age: integer (nullable = true)</span> |
| <span class="c1">// |-- name: string (nullable = true)</span> |
| |
| <span class="c1">// Register this JavaSchemaRDD as a table.</span> |
| <span class="n">people</span><span class="o">.</span><span class="na">registerTempTable</span><span class="o">(</span><span class="s">"people"</span><span class="o">);</span> |
| |
| <span class="c1">// SQL statements can be run by using the sql methods provided by sqlContext.</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">teenagers</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">sql</span><span class="o">(</span><span class="s">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="o">);</span> |
| |
| <span class="c1">// Alternatively, a JavaSchemaRDD can be created for a JSON dataset represented by</span> |
| <span class="c1">// an RDD[String] storing one JSON object per string.</span> |
| <span class="n">List</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">jsonData</span> <span class="o">=</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span> |
| <span class="s">"{\"name\":\"Yin\",\"address\":{\"city\":\"Columbus\",\"state\":\"Ohio\"}}"</span><span class="o">);</span> |
| <span class="n">JavaRDD</span><span class="o"><</span><span class="n">String</span><span class="o">></span> <span class="n">anotherPeopleRDD</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="na">parallelize</span><span class="o">(</span><span class="n">jsonData</span><span class="o">);</span> |
| <span class="n">JavaSchemaRDD</span> <span class="n">anotherPeople</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">jsonRDD</span><span class="o">(</span><span class="n">anotherPeopleRDD</span><span class="o">);</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="python"> |
| <p>Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD. |
| This conversion can be done using one of two methods in a SQLContext:</p> |
| |
| <ul> |
| <li><code>jsonFile</code> - loads data from a directory of JSON files where each line of the files is a JSON object.</li> |
| <li><code>jsonRDD</code> - loads data from an existing RDD where each element of the RDD is a string containing a JSON object.</li> |
| </ul> |
| |
| <p>Note that the file that is offered as <em>jsonFile</em> is not a typical JSON file. Each |
| line must contain a separate, self-contained valid JSON object. As a consequence, |
| a regular multi-line JSON file will most often fail.</p> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># sc is an existing SparkContext.</span> |
| <span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SQLContext</span> |
| <span class="n">sqlContext</span> <span class="o">=</span> <span class="n">SQLContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span> |
| |
| <span class="c"># A JSON dataset is pointed to by path.</span> |
| <span class="c"># The path can be either a single text file or a directory storing text files.</span> |
| <span class="n">path</span> <span class="o">=</span> <span class="s">"examples/src/main/resources/people.json"</span> |
| <span class="c"># Create a SchemaRDD from the file(s) pointed to by path</span> |
| <span class="n">people</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">jsonFile</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> |
| |
| <span class="c"># The inferred schema can be visualized using the printSchema() method.</span> |
| <span class="n">people</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span> |
| <span class="c"># root</span> |
| <span class="c"># |-- age: integer (nullable = true)</span> |
| <span class="c"># |-- name: string (nullable = true)</span> |
| |
| <span class="c"># Register this SchemaRDD as a table.</span> |
| <span class="n">people</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span> |
| |
| <span class="c"># SQL statements can be run by using the sql methods provided by sqlContext.</span> |
| <span class="n">teenagers</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT name FROM people WHERE age >= 13 AND age <= 19"</span><span class="p">)</span> |
| |
| <span class="c"># Alternatively, a SchemaRDD can be created for a JSON dataset represented by</span> |
| <span class="c"># an RDD[String] storing one JSON object per string.</span> |
| <span class="n">anotherPeopleRDD</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span> |
| <span class="s">'{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}'</span><span class="p">])</span> |
| <span class="n">anotherPeople</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">jsonRDD</span><span class="p">(</span><span class="n">anotherPeopleRDD</span><span class="p">)</span></code></pre></div> |
| |
| </div> |
| |
| </div> |
| |
| <h2 id="hive-tables">Hive Tables</h2> |
| |
| <p>Spark SQL also supports reading and writing data stored in <a href="http://hive.apache.org/">Apache Hive</a>. |
| However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. |
| Hive support is enabled by adding the <code>-Phive</code> and <code>-Phive-thriftserver</code> flags to Spark’s build. |
| This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present |
| on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries |
| (SerDes) in order to access data stored in Hive.</p> |
| |
| <p>Configuration of Hive is done by placing your <code>hive-site.xml</code> file in <code>conf/</code>.</p> |
| |
| <div class="codetabs"> |
| |
| <div data-lang="scala"> |
| |
| <p>When working with Hive one must construct a <code>HiveContext</code>, which inherits from <code>SQLContext</code>, and |
| adds support for finding tables in the MetaStore and writing queries using HiveQL. Users who do |
| not have an existing Hive deployment can still create a HiveContext. When not configured by the |
| hive-site.xml, the context automatically creates <code>metastore_db</code> and <code>warehouse</code> in the current |
| directory.</p> |
| |
| <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// sc is an existing SparkContext.</span> |
| <span class="k">val</span> <span class="n">sqlContext</span> <span class="k">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">.</span><span class="n">hive</span><span class="o">.</span><span class="nc">HiveContext</span><span class="o">(</span><span class="n">sc</span><span class="o">)</span> |
| |
| <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"CREATE TABLE IF NOT EXISTS src (key INT, value STRING)"</span><span class="o">)</span> |
| <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src"</span><span class="o">)</span> |
| |
| <span class="c1">// Queries are expressed in HiveQL</span> |
| <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="o">(</span><span class="s">"FROM src SELECT key, value"</span><span class="o">).</span><span class="n">collect</span><span class="o">().</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">)</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="java"> |
| |
| <p>When working with Hive one must construct a <code>JavaHiveContext</code>, which inherits from <code>JavaSQLContext</code>, and |
| adds support for finding tables in the MetaStore and writing queries using HiveQL. In addition to |
| the <code>sql</code> method a <code>JavaHiveContext</code> also provides an <code>hql</code> methods, which allows queries to be |
| expressed in HiveQL.</p> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="c1">// sc is an existing JavaSparkContext.</span> |
| <span class="n">JavaHiveContext</span> <span class="n">sqlContext</span> <span class="o">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">spark</span><span class="o">.</span><span class="na">sql</span><span class="o">.</span><span class="na">hive</span><span class="o">.</span><span class="na">api</span><span class="o">.</span><span class="na">java</span><span class="o">.</span><span class="na">HiveContext</span><span class="o">(</span><span class="n">sc</span><span class="o">);</span> |
| |
| <span class="n">sqlContext</span><span class="o">.</span><span class="na">sql</span><span class="o">(</span><span class="s">"CREATE TABLE IF NOT EXISTS src (key INT, value STRING)"</span><span class="o">);</span> |
| <span class="n">sqlContext</span><span class="o">.</span><span class="na">sql</span><span class="o">(</span><span class="s">"LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src"</span><span class="o">);</span> |
| |
| <span class="c1">// Queries are expressed in HiveQL.</span> |
| <span class="n">Row</span><span class="o">[]</span> <span class="n">results</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="na">sql</span><span class="o">(</span><span class="s">"FROM src SELECT key, value"</span><span class="o">).</span><span class="na">collect</span><span class="o">();</span></code></pre></div> |
| |
| </div> |
| |
| <div data-lang="python"> |
| |
| <p>When working with Hive one must construct a <code>HiveContext</code>, which inherits from <code>SQLContext</code>, and |
| adds support for finding tables in the MetaStore and writing queries using HiveQL. In addition to |
| the <code>sql</code> method a <code>HiveContext</code> also provides an <code>hql</code> methods, which allows queries to be |
| expressed in HiveQL.</p> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># sc is an existing SparkContext.</span> |
| <span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">HiveContext</span> |
| <span class="n">sqlContext</span> <span class="o">=</span> <span class="n">HiveContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span> |
| |
| <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"CREATE TABLE IF NOT EXISTS src (key INT, value STRING)"</span><span class="p">)</span> |
| <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src"</span><span class="p">)</span> |
| |
| <span class="c"># Queries can be expressed in HiveQL.</span> |
| <span class="n">results</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"FROM src SELECT key, value"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span></code></pre></div> |
| |
| </div> |
| </div> |
| |
| <h1 id="performance-tuning">Performance Tuning</h1> |
| |
| <p>For some workloads it is possible to improve performance by either caching data in memory, or by |
| turning on some experimental options.</p> |
| |
| <h2 id="caching-data-in-memory">Caching Data In Memory</h2> |
| |
| <p>Spark SQL can cache tables using an in-memory columnar format by calling <code>sqlContext.cacheTable("tableName")</code>. |
| Then Spark SQL will scan only required columns and will automatically tune compression to minimize |
| memory usage and GC pressure. You can call <code>sqlContext.uncacheTable("tableName")</code> to remove the table from memory.</p> |
| |
| <p>Note that if you call <code>schemaRDD.cache()</code> rather than <code>sqlContext.cacheTable(...)</code>, tables will <em>not</em> be cached using |
| the in-memory columnar format, and therefore <code>sqlContext.cacheTable(...)</code> is strongly recommended for this use case.</p> |
| |
| <p>Configuration of in-memory caching can be done using the <code>setConf</code> method on SQLContext or by running |
| <code>SET key=value</code> commands using SQL.</p> |
| |
| <table class="table"> |
| <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> |
| <tr> |
| <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td> |
| <td>true</td> |
| <td> |
| When set to true Spark SQL will automatically select a compression codec for each column based |
| on statistics of the data. |
| </td> |
| </tr> |
| <tr> |
| <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td> |
| <td>10000</td> |
| <td> |
| Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization |
| and compression, but risk OOMs when caching data. |
| </td> |
| </tr> |
| |
| </table> |
| |
| <h2 id="other-configuration-options">Other Configuration Options</h2> |
| |
| <p>The following options can also be used to tune the performance of query execution. It is possible |
| that these options will be deprecated in future release as more optimizations are performed automatically.</p> |
| |
| <table class="table"> |
| <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> |
| <tr> |
| <td><code>spark.sql.autoBroadcastJoinThreshold</code></td> |
| <td>10485760 (10 MB)</td> |
| <td> |
| Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when |
| performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently |
| statistics are only supported for Hive Metastore tables where the command |
| `ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan` has been run. |
| </td> |
| </tr> |
| <tr> |
| <td><code>spark.sql.codegen</code></td> |
| <td>false</td> |
| <td> |
| When true, code will be dynamically generated at runtime for expression evaluation in a specific |
| query. For some queries with complicated expression this option can lead to significant speed-ups. |
| However, for simple queries this can actually slow down query execution. |
| </td> |
| </tr> |
| <tr> |
| <td><code>spark.sql.shuffle.partitions</code></td> |
| <td>200</td> |
| <td> |
| Configures the number of partitions to use when shuffling data for joins or aggregations. |
| </td> |
| </tr> |
| </table> |
| |
| <h1 id="other-sql-interfaces">Other SQL Interfaces</h1> |
| |
| <p>Spark SQL also supports interfaces for running SQL queries directly without the need to write any |
| code.</p> |
| |
| <h2 id="running-the-thrift-jdbcodbc-server">Running the Thrift JDBC/ODBC server</h2> |
| |
| <p>The Thrift JDBC/ODBC server implemented here corresponds to the <a href="https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2"><code>HiveServer2</code></a> |
| in Hive 0.12. You can test the JDBC server with the beeline script that comes with either Spark or Hive 0.12.</p> |
| |
| <p>To start the JDBC/ODBC server, run the following in the Spark directory:</p> |
| |
| <pre><code>./sbin/start-thriftserver.sh |
| </code></pre> |
| |
| <p>This script accepts all <code>bin/spark-submit</code> command line options, plus a <code>--hiveconf</code> option to |
| specify Hive properties. You may run <code>./sbin/start-thriftserver.sh --help</code> for a complete list of |
| all available options. By default, the server listens on localhost:10000. You may override this |
| bahaviour via either environment variables, i.e.:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">export </span><span class="nv">HIVE_SERVER2_THRIFT_PORT</span><span class="o">=</span><listening-port> |
| <span class="nb">export </span><span class="nv">HIVE_SERVER2_THRIFT_BIND_HOST</span><span class="o">=</span><listening-host> |
| ./sbin/start-thriftserver.sh <span class="se">\</span> |
| --master <master-uri> <span class="se">\</span> |
| ...</code></pre></div> |
| |
| <p>or system properties:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">./sbin/start-thriftserver.sh <span class="se">\</span> |
| --hiveconf hive.server2.thrift.port<span class="o">=</span><listening-port> <span class="se">\</span> |
| --hiveconf hive.server2.thrift.bind.host<span class="o">=</span><listening-host> <span class="se">\</span> |
| --master <master-uri> |
| ...</code></pre></div> |
| |
| <p>Now you can use beeline to test the Thrift JDBC/ODBC server:</p> |
| |
| <pre><code>./bin/beeline |
| </code></pre> |
| |
| <p>Connect to the JDBC/ODBC server in beeline with:</p> |
| |
| <pre><code>beeline> !connect jdbc:hive2://localhost:10000 |
| </code></pre> |
| |
| <p>Beeline will ask you for a username and password. In non-secure mode, simply enter the username on |
| your machine and a blank password. For secure mode, please follow the instructions given in the |
| <a href="https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients">beeline documentation</a>.</p> |
| |
| <p>Configuration of Hive is done by placing your <code>hive-site.xml</code> file in <code>conf/</code>.</p> |
| |
| <p>You may also use the beeline script that comes with Hive.</p> |
| |
| <h2 id="running-the-spark-sql-cli">Running the Spark SQL CLI</h2> |
| |
| <p>The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute |
| queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.</p> |
| |
| <p>To start the Spark SQL CLI, run the following in the Spark directory:</p> |
| |
| <pre><code>./bin/spark-sql |
| </code></pre> |
| |
| <p>Configuration of Hive is done by placing your <code>hive-site.xml</code> file in <code>conf/</code>. |
| You may run <code>./bin/spark-sql --help</code> for a complete list of all available |
| options.</p> |
| |
| <h1 id="compatibility-with-other-systems">Compatibility with Other Systems</h1> |
| |
| <h2 id="migration-guide-for-shark-user">Migration Guide for Shark User</h2> |
| |
| <h3 id="scheduling">Scheduling</h3> |
| <p>To set a <a href="job-scheduling.html#fair-scheduler-pools">Fair Scheduler</a> pool for a JDBC client session, |
| users can set the <code>spark.sql.thriftserver.scheduler.pool</code> variable:</p> |
| |
| <pre><code>SET spark.sql.thriftserver.scheduler.pool=accounting; |
| </code></pre> |
| |
| <h3 id="reducer-number">Reducer number</h3> |
| |
| <p>In Shark, default reducer number is 1 and is controlled by the property <code>mapred.reduce.tasks</code>. Spark |
| SQL deprecates this property in favor of <code>spark.sql.shuffle.partitions</code>, whose default value |
| is 200. Users may customize this property via <code>SET</code>:</p> |
| |
| <pre><code>SET spark.sql.shuffle.partitions=10; |
| SELECT page, count(*) c |
| FROM logs_last_month_cached |
| GROUP BY page ORDER BY c DESC LIMIT 10; |
| </code></pre> |
| |
| <p>You may also put this property in <code>hive-site.xml</code> to override the default value.</p> |
| |
| <p>For now, the <code>mapred.reduce.tasks</code> property is still recognized, and is converted to |
| <code>spark.sql.shuffle.partitions</code> automatically.</p> |
| |
| <h3 id="caching">Caching</h3> |
| |
| <p>The <code>shark.cache</code> table property no longer exists, and tables whose name end with <code>_cached</code> are no |
| longer automatically cached. Instead, we provide <code>CACHE TABLE</code> and <code>UNCACHE TABLE</code> statements to |
| let user control table caching explicitly:</p> |
| |
| <pre><code>CACHE TABLE logs_last_month; |
| UNCACHE TABLE logs_last_month; |
| </code></pre> |
| |
| <p><strong>NOTE:</strong> <code>CACHE TABLE tbl</code> is lazy, similar to <code>.cache</code> on an RDD. This command only marks <code>tbl</code> to ensure that |
| partitions are cached when calculated but doesn’t actually cache it until a query that touches <code>tbl</code> is executed. |
| To force the table to be cached, you may simply count the table immediately after executing <code>CACHE TABLE</code>:</p> |
| |
| <pre><code>CACHE TABLE logs_last_month; |
| SELECT COUNT(1) FROM logs_last_month; |
| </code></pre> |
| |
| <p>Several caching related features are not supported yet:</p> |
| |
| <ul> |
| <li>User defined partition level cache eviction policy</li> |
| <li>RDD reloading</li> |
| <li>In-memory cache write through policy</li> |
| </ul> |
| |
| <h2 id="compatibility-with-apache-hive">Compatibility with Apache Hive</h2> |
| |
| <p>Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently Spark |
| SQL is based on Hive 0.12.0 and 0.13.1.</p> |
| |
| <h4 id="deploying-in-existing-hive-warehouses">Deploying in Existing Hive Warehouses</h4> |
| |
| <p>The Spark SQL Thrift JDBC server is designed to be “out of the box” compatible with existing Hive |
| installations. You do not need to modify your existing Hive Metastore or change the data placement |
| or partitioning of your tables.</p> |
| |
| <h3 id="supported-hive-features">Supported Hive Features</h3> |
| |
| <p>Spark SQL supports the vast majority of Hive features, such as:</p> |
| |
| <ul> |
| <li>Hive query statements, including: |
| <ul> |
| <li><code>SELECT</code></li> |
| <li><code>GROUP BY</code></li> |
| <li><code>ORDER BY</code></li> |
| <li><code>CLUSTER BY</code></li> |
| <li><code>SORT BY</code></li> |
| </ul> |
| </li> |
| <li>All Hive operators, including: |
| <ul> |
| <li>Relational operators (<code>=</code>, <code>⇔</code>, <code>==</code>, <code><></code>, <code><</code>, <code>></code>, <code>>=</code>, <code><=</code>, etc)</li> |
| <li>Arithmetic operators (<code>+</code>, <code>-</code>, <code>*</code>, <code>/</code>, <code>%</code>, etc)</li> |
| <li>Logical operators (<code>AND</code>, <code>&&</code>, <code>OR</code>, <code>||</code>, etc)</li> |
| <li>Complex type constructors</li> |
| <li>Mathematical functions (<code>sign</code>, <code>ln</code>, <code>cos</code>, etc)</li> |
| <li>String functions (<code>instr</code>, <code>length</code>, <code>printf</code>, etc)</li> |
| </ul> |
| </li> |
| <li>User defined functions (UDF)</li> |
| <li>User defined aggregation functions (UDAF)</li> |
| <li>User defined serialization formats (SerDes)</li> |
| <li>Joins |
| <ul> |
| <li><code>JOIN</code></li> |
| <li><code>{LEFT|RIGHT|FULL} OUTER JOIN</code></li> |
| <li><code>LEFT SEMI JOIN</code></li> |
| <li><code>CROSS JOIN</code></li> |
| </ul> |
| </li> |
| <li>Unions</li> |
| <li>Sub-queries |
| <ul> |
| <li><code>SELECT col FROM ( SELECT a + b AS col from t1) t2</code></li> |
| </ul> |
| </li> |
| <li>Sampling</li> |
| <li>Explain</li> |
| <li>Partitioned tables</li> |
| <li>View</li> |
| <li>All Hive DDL Functions, including: |
| <ul> |
| <li><code>CREATE TABLE</code></li> |
| <li><code>CREATE TABLE AS SELECT</code></li> |
| <li><code>ALTER TABLE</code></li> |
| </ul> |
| </li> |
| <li>Most Hive Data types, including: |
| <ul> |
| <li><code>TINYINT</code></li> |
| <li><code>SMALLINT</code></li> |
| <li><code>INT</code></li> |
| <li><code>BIGINT</code></li> |
| <li><code>BOOLEAN</code></li> |
| <li><code>FLOAT</code></li> |
| <li><code>DOUBLE</code></li> |
| <li><code>STRING</code></li> |
| <li><code>BINARY</code></li> |
| <li><code>TIMESTAMP</code></li> |
| <li><code>DATE</code></li> |
| <li><code>ARRAY<></code></li> |
| <li><code>MAP<></code></li> |
| <li><code>STRUCT<></code></li> |
| </ul> |
| </li> |
| </ul> |
| |
| <h3 id="unsupported-hive-functionality">Unsupported Hive Functionality</h3> |
| |
| <p>Below is a list of Hive features that we don’t support yet. Most of these features are rarely used |
| in Hive deployments.</p> |
| |
| <p><strong>Major Hive Features</strong></p> |
| |
| <ul> |
| <li>Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL |
| doesn’t support buckets yet.</li> |
| </ul> |
| |
| <p><strong>Esoteric Hive Features</strong></p> |
| |
| <ul> |
| <li>Tables with partitions using different input formats: In Spark SQL, all table partitions need to |
| have the same input format.</li> |
| <li>Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions |
| (e.g. condition “<code>key < 10</code>”), Spark SQL will output wrong result for the <code>NULL</code> tuple.</li> |
| <li><code>UNION</code> type and <code>DATE</code> type</li> |
| <li>Unique join</li> |
| <li>Single query multi insert</li> |
| <li>Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at |
| the moment and only supports populating the sizeInBytes field of the hive metastore.</li> |
| </ul> |
| |
| <p><strong>Hive Input/Output Formats</strong></p> |
| |
| <ul> |
| <li>File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.</li> |
| <li>Hadoop archive</li> |
| </ul> |
| |
| <p><strong>Hive Optimizations</strong></p> |
| |
| <p>A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are |
| less important due to Spark SQL’s in-memory computational model. Others are slotted for future |
| releases of Spark SQL.</p> |
| |
| <ul> |
| <li>Block level bitmap indexes and virtual columns (used to build indexes)</li> |
| <li>Automatically convert a join to map join: For joining a large table with multiple small tables, |
| Hive automatically converts the join into a map join. We are adding this auto conversion in the |
| next release.</li> |
| <li>Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you |
| need to control the degree of parallelism post-shuffle using “<code>SET spark.sql.shuffle.partitions=[num_tasks];</code>”.</li> |
| <li>Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still |
| launches tasks to compute the result.</li> |
| <li>Skew data flag: Spark SQL does not follow the skew data flags in Hive.</li> |
| <li><code>STREAMTABLE</code> hint in join: Spark SQL does not follow the <code>STREAMTABLE</code> hint.</li> |
| <li>Merge multiple small files for query results: if the result output contains multiple small files, |
| Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS |
| metadata. Spark SQL does not support that.</li> |
| </ul> |
| |
| <h1 id="writing-language-integrated-relational-queries">Writing Language-Integrated Relational Queries</h1> |
| |
| <p><strong>Language-Integrated queries are experimental and currently only supported in Scala.</strong></p> |
| |
| <p>Spark SQL also supports a domain specific language for writing queries. Once again, |
| using the data from the above examples:</p> |
| |
| <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// sc is an existing SparkContext.</span> |
| <span class="k">val</span> <span class="n">sqlContext</span> <span class="k">=</span> <span class="k">new</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="o">.</span><span class="nc">SQLContext</span><span class="o">(</span><span class="n">sc</span><span class="o">)</span> |
| <span class="c1">// Importing the SQL context gives access to all the public SQL functions and implicit conversions.</span> |
| <span class="k">import</span> <span class="nn">sqlContext._</span> |
| <span class="k">val</span> <span class="n">people</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Person</span><span class="o">]</span> <span class="k">=</span> <span class="o">...</span> <span class="c1">// An RDD of case class objects, from the first example.</span> |
| |
| <span class="c1">// The following is the same as 'SELECT name FROM people WHERE age >= 10 AND age <= 19'</span> |
| <span class="k">val</span> <span class="n">teenagers</span> <span class="k">=</span> <span class="n">people</span><span class="o">.</span><span class="n">where</span><span class="o">(</span><span class="-Symbol">'age</span> <span class="o">>=</span> <span class="mi">10</span><span class="o">).</span><span class="n">where</span><span class="o">(</span><span class="-Symbol">'age</span> <span class="o"><=</span> <span class="mi">19</span><span class="o">).</span><span class="n">select</span><span class="o">(</span><span class="-Symbol">'name</span><span class="o">)</span> |
| <span class="n">teenagers</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">t</span> <span class="k">=></span> <span class="s">"Name: "</span> <span class="o">+</span> <span class="n">t</span><span class="o">(</span><span class="mi">0</span><span class="o">)).</span><span class="n">collect</span><span class="o">().</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">)</span></code></pre></div> |
| |
| <p>The DSL uses Scala symbols to represent columns in the underlying table, which are identifiers |
| prefixed with a tick (<code>'</code>). Implicit conversions turn these symbols into expressions that are |
| evaluated by the SQL execution engine. A full list of the functions supported can be found in the |
| <a href="api/scala/index.html#org.apache.spark.sql.SchemaRDD">ScalaDoc</a>.</p> |
| |
| <!-- TODO: Include the table of operations here. --> |
| |
| <h1 id="spark-sql-datatype-reference">Spark SQL DataType Reference</h1> |
| |
| <ul> |
| <li>Numeric types |
| <ul> |
| <li><code>ByteType</code>: Represents 1-byte signed integer numbers. |
| The range of numbers is from <code>-128</code> to <code>127</code>.</li> |
| <li><code>ShortType</code>: Represents 2-byte signed integer numbers. |
| The range of numbers is from <code>-32768</code> to <code>32767</code>.</li> |
| <li><code>IntegerType</code>: Represents 4-byte signed integer numbers. |
| The range of numbers is from <code>-2147483648</code> to <code>2147483647</code>.</li> |
| <li><code>LongType</code>: Represents 8-byte signed integer numbers. |
| The range of numbers is from <code>-9223372036854775808</code> to <code>9223372036854775807</code>.</li> |
| <li><code>FloatType</code>: Represents 4-byte single-precision floating point numbers.</li> |
| <li><code>DoubleType</code>: Represents 8-byte double-precision floating point numbers.</li> |
| <li><code>DecimalType</code>: Represents arbitrary-precision signed decimal numbers. Backed internally by <code>java.math.BigDecimal</code>. A <code>BigDecimal</code> consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.</li> |
| </ul> |
| </li> |
| <li>String type |
| <ul> |
| <li><code>StringType</code>: Represents character string values.</li> |
| </ul> |
| </li> |
| <li>Binary type |
| <ul> |
| <li><code>BinaryType</code>: Represents byte sequence values.</li> |
| </ul> |
| </li> |
| <li>Boolean type |
| <ul> |
| <li><code>BooleanType</code>: Represents boolean values.</li> |
| </ul> |
| </li> |
| <li>Datetime type |
| <ul> |
| <li><code>TimestampType</code>: Represents values comprising values of fields year, month, day, |
| hour, minute, and second.</li> |
| <li><code>DateType</code>: Represents values comprising values of fields year, month, day.</li> |
| </ul> |
| </li> |
| <li>Complex types |
| <ul> |
| <li><code>ArrayType(elementType, containsNull)</code>: Represents values comprising a sequence of |
| elements with the type of <code>elementType</code>. <code>containsNull</code> is used to indicate if |
| elements in a <code>ArrayType</code> value can have <code>null</code> values.</li> |
| <li><code>MapType(keyType, valueType, valueContainsNull)</code>: |
| Represents values comprising a set of key-value pairs. The data type of keys are |
| described by <code>keyType</code> and the data type of values are described by <code>valueType</code>. |
| For a <code>MapType</code> value, keys are not allowed to have <code>null</code> values. <code>valueContainsNull</code> |
| is used to indicate if values of a <code>MapType</code> value can have <code>null</code> values.</li> |
| <li><code>StructType(fields)</code>: Represents values with the structure described by |
| a sequence of <code>StructField</code>s (<code>fields</code>). |
| <ul> |
| <li><code>StructField(name, dataType, nullable)</code>: Represents a field in a <code>StructType</code>. |
| The name of a field is indicated by <code>name</code>. The data type of a field is indicated |
| by <code>dataType</code>. <code>nullable</code> is used to indicate if values of this fields can have |
| <code>null</code> values.</li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| <div class="codetabs"> |
| <div data-lang="scala"> |
| |
| <p>All data types of Spark SQL are located in the package <code>org.apache.spark.sql</code>. |
| You can access them by doing</p> |
| |
| <div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">import</span> <span class="nn">org.apache.spark.sql._</span></code></pre></div> |
| |
| <table class="table"> |
| <tr> |
| <th style="width:20%">Data type</th> |
| <th style="width:40%">Value type in Scala</th> |
| <th>API to access or create a data type</th></tr> |
| <tr> |
| <td> <b>ByteType</b> </td> |
| <td> Byte </td> |
| <td> |
| ByteType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ShortType</b> </td> |
| <td> Short </td> |
| <td> |
| ShortType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>IntegerType</b> </td> |
| <td> Int </td> |
| <td> |
| IntegerType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>LongType</b> </td> |
| <td> Long </td> |
| <td> |
| LongType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>FloatType</b> </td> |
| <td> Float </td> |
| <td> |
| FloatType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DoubleType</b> </td> |
| <td> Double </td> |
| <td> |
| DoubleType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DecimalType</b> </td> |
| <td> scala.math.BigDecimal </td> |
| <td> |
| DecimalType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StringType</b> </td> |
| <td> String </td> |
| <td> |
| StringType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BinaryType</b> </td> |
| <td> Array[Byte] </td> |
| <td> |
| BinaryType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BooleanType</b> </td> |
| <td> Boolean </td> |
| <td> |
| BooleanType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>TimestampType</b> </td> |
| <td> java.sql.Timestamp </td> |
| <td> |
| TimestampType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DateType</b> </td> |
| <td> java.sql.Date </td> |
| <td> |
| DateType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ArrayType</b> </td> |
| <td> scala.collection.Seq </td> |
| <td> |
| ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>MapType</b> </td> |
| <td> scala.collection.Map </td> |
| <td> |
| MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructType</b> </td> |
| <td> org.apache.spark.sql.Row </td> |
| <td> |
| StructType(<i>fields</i>)<br /> |
| <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same |
| name are not allowed. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructField</b> </td> |
| <td> The value type in Scala of the data type of this field |
| (For example, Int for a StructField with the data type IntegerType) </td> |
| <td> |
| StructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>) |
| </td> |
| </tr> |
| </table> |
| |
| </div> |
| |
| <div data-lang="java"> |
| |
| <p>All data types of Spark SQL are located in the package of |
| <code>org.apache.spark.sql.api.java</code>. To access or create a data type, |
| please use factory methods provided in |
| <code>org.apache.spark.sql.api.java.DataType</code>.</p> |
| |
| <table class="table"> |
| <tr> |
| <th style="width:20%">Data type</th> |
| <th style="width:40%">Value type in Java</th> |
| <th>API to access or create a data type</th></tr> |
| <tr> |
| <td> <b>ByteType</b> </td> |
| <td> byte or Byte </td> |
| <td> |
| DataType.ByteType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ShortType</b> </td> |
| <td> short or Short </td> |
| <td> |
| DataType.ShortType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>IntegerType</b> </td> |
| <td> int or Integer </td> |
| <td> |
| DataType.IntegerType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>LongType</b> </td> |
| <td> long or Long </td> |
| <td> |
| DataType.LongType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>FloatType</b> </td> |
| <td> float or Float </td> |
| <td> |
| DataType.FloatType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DoubleType</b> </td> |
| <td> double or Double </td> |
| <td> |
| DataType.DoubleType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DecimalType</b> </td> |
| <td> java.math.BigDecimal </td> |
| <td> |
| DataType.DecimalType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StringType</b> </td> |
| <td> String </td> |
| <td> |
| DataType.StringType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BinaryType</b> </td> |
| <td> byte[] </td> |
| <td> |
| DataType.BinaryType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BooleanType</b> </td> |
| <td> boolean or Boolean </td> |
| <td> |
| DataType.BooleanType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>TimestampType</b> </td> |
| <td> java.sql.Timestamp </td> |
| <td> |
| DataType.TimestampType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DateType</b> </td> |
| <td> java.sql.Date </td> |
| <td> |
| DataType.DateType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ArrayType</b> </td> |
| <td> java.util.List </td> |
| <td> |
| DataType.createArrayType(<i>elementType</i>)<br /> |
| <b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br /> |
| DataType.createArrayType(<i>elementType</i>, <i>containsNull</i>). |
| </td> |
| </tr> |
| <tr> |
| <td> <b>MapType</b> </td> |
| <td> java.util.Map </td> |
| <td> |
| DataType.createMapType(<i>keyType</i>, <i>valueType</i>)<br /> |
| <b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br /> |
| DataType.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br /> |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructType</b> </td> |
| <td> org.apache.spark.sql.api.java.Row </td> |
| <td> |
| DataType.createStructType(<i>fields</i>)<br /> |
| <b>Note:</b> <i>fields</i> is a List or an array of StructFields. |
| Also, two fields with the same name are not allowed. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructField</b> </td> |
| <td> The value type in Java of the data type of this field |
| (For example, int for a StructField with the data type IntegerType) </td> |
| <td> |
| DataType.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>) |
| </td> |
| </tr> |
| </table> |
| |
| </div> |
| |
| <div data-lang="python"> |
| |
| <p>All data types of Spark SQL are located in the package of <code>pyspark.sql</code>. |
| You can access them by doing</p> |
| |
| <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="o">*</span></code></pre></div> |
| |
| <table class="table"> |
| <tr> |
| <th style="width:20%">Data type</th> |
| <th style="width:40%">Value type in Python</th> |
| <th>API to access or create a data type</th></tr> |
| <tr> |
| <td> <b>ByteType</b> </td> |
| <td> |
| int or long <br /> |
| <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of -128 to 127. |
| </td> |
| <td> |
| ByteType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ShortType</b> </td> |
| <td> |
| int or long <br /> |
| <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of -32768 to 32767. |
| </td> |
| <td> |
| ShortType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>IntegerType</b> </td> |
| <td> int or long </td> |
| <td> |
| IntegerType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>LongType</b> </td> |
| <td> |
| long <br /> |
| <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of |
| -9223372036854775808 to 9223372036854775807. |
| Otherwise, please convert data to decimal.Decimal and use DecimalType. |
| </td> |
| <td> |
| LongType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>FloatType</b> </td> |
| <td> |
| float <br /> |
| <b>Note:</b> Numbers will be converted to 4-byte single-precision floating |
| point numbers at runtime. |
| </td> |
| <td> |
| FloatType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DoubleType</b> </td> |
| <td> float </td> |
| <td> |
| DoubleType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DecimalType</b> </td> |
| <td> decimal.Decimal </td> |
| <td> |
| DecimalType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StringType</b> </td> |
| <td> string </td> |
| <td> |
| StringType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BinaryType</b> </td> |
| <td> bytearray </td> |
| <td> |
| BinaryType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BooleanType</b> </td> |
| <td> bool </td> |
| <td> |
| BooleanType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>TimestampType</b> </td> |
| <td> datetime.datetime </td> |
| <td> |
| TimestampType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DateType</b> </td> |
| <td> datetime.date </td> |
| <td> |
| DateType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ArrayType</b> </td> |
| <td> list, tuple, or array </td> |
| <td> |
| ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>MapType</b> </td> |
| <td> dict </td> |
| <td> |
| MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructType</b> </td> |
| <td> list or tuple </td> |
| <td> |
| StructType(<i>fields</i>)<br /> |
| <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same |
| name are not allowed. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructField</b> </td> |
| <td> The value type in Python of the data type of this field |
| (For example, Int for a StructField with the data type IntegerType) </td> |
| <td> |
| StructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>) |
| </td> |
| </tr> |
| </table> |
| |
| </div> |
| |
| </div> |
| |
| |
| |
| </div> <!-- /container --> |
| |
| <script src="js/vendor/jquery-1.8.0.min.js"></script> |
| <script src="js/vendor/bootstrap.min.js"></script> |
| <script src="js/main.js"></script> |
| |
| <!-- MathJax Section --> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| TeX: { equationNumbers: { autoNumber: "AMS" } } |
| }); |
| </script> |
| <script> |
| // Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS. |
| // We could use "//cdn.mathjax...", but that won't support "file://". |
| (function(d, script) { |
| script = d.createElement('script'); |
| script.type = 'text/javascript'; |
| script.async = true; |
| script.onload = function(){ |
| MathJax.Hub.Config({ |
| tex2jax: { |
| inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ], |
| displayMath: [ ["$$","$$"], ["\\[", "\\]"] ], |
| processEscapes: true, |
| skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] |
| } |
| }); |
| }; |
| script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') + |
| 'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; |
| d.getElementsByTagName('head')[0].appendChild(script); |
| }(document)); |
| </script> |
| </body> |
| </html> |