docs/sql-performance-tuning.md

layout: global title: Performance Tuning displayTitle: Performance Tuning

Table of contents {:toc}

For some workloads, it is possible to improve performance by either caching data in memory, or by turning on some experimental options.

Caching Data In Memory

Spark SQL can cache tables using an in-memory columnar format by calling spark.catalog.cacheTable("tableName") or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call spark.catalog.uncacheTable("tableName") to remove the table from memory.

Configuration of in-memory caching can be done using the setConf method on SparkSession or by running SET key=value commands using SQL.

Other Configuration Options

The following options can also be used to tune the performance of query execution. It is possible that these options will be deprecated in future release as more optimizations are performed automatically.

Broadcast Hint for SQL Queries

The BROADCAST hint guides Spark to broadcast each specified table when joining them with another table or view. When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is preferred, even if the statistics is above the configuration spark.sql.autoBroadcastJoinThreshold. When both sides of a join are specified, Spark broadcasts the one having the lower statistics. Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. full outer join) support BHJ. When the broadcast nested loop join is selected, we still respect the hint.

{% highlight scala %} import org.apache.spark.sql.functions.broadcast broadcast(spark.table(“src”)).join(spark.table(“records”), “key”).show() {% endhighlight %}

{% highlight java %} import static org.apache.spark.sql.functions.broadcast; broadcast(spark.table(“src”)).join(spark.table(“records”), “key”).show(); {% endhighlight %}

{% highlight python %} from pyspark.sql.functions import broadcast broadcast(spark.table(“src”)).join(spark.table(“records”), “key”).show() {% endhighlight %}

{% highlight r %} src <- sql(“SELECT * FROM src”) records <- sql(“SELECT * FROM records”) head(join(broadcast(src), records, src$key == records$key)) {% endhighlight %}

{% highlight sql %} -- We accept BROADCAST, BROADCASTJOIN and MAPJOIN for broadcast hint SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key {% endhighlight %}