blob: 2174c250d32d6aefc6752cadca9e98fa1f6e4f0c [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>pyspark.sql module &#8212; PySpark 2.2.1 documentation</title>
<link rel="stylesheet" href="_static/nature.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/pyspark.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: './',
VERSION: '2.2.1',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true,
SOURCELINK_SUFFIX: '.txt'
};
</script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="_static/pyspark.js"></script>
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="pyspark.streaming module" href="pyspark.streaming.html" />
<link rel="prev" title="pyspark package" href="pyspark.html" />
</head>
<body>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="pyspark.streaming.html" title="pyspark.streaming module"
accesskey="N">next</a></li>
<li class="right" >
<a href="pyspark.html" title="pyspark package"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">PySpark 2.2.1 documentation</a> &#187;</li>
<li class="nav-item nav-item-1"><a href="pyspark.html" accesskey="U">pyspark package</a> &#187;</li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="pyspark-sql-module">
<h1>pyspark.sql module<a class="headerlink" href="#pyspark-sql-module" title="Permalink to this headline"></a></h1>
<div class="section" id="module-pyspark.sql">
<span id="module-context"></span><h2>Module Context<a class="headerlink" href="#module-pyspark.sql" title="Permalink to this headline"></a></h2>
<p>Important classes of Spark SQL and DataFrames:</p>
<blockquote>
<div><ul class="simple">
<li><a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.SparkSession</span></code></a>
Main entry point for <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> and SQL functionality.</li>
<li><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.DataFrame</span></code></a>
A distributed collection of data grouped into named columns.</li>
<li><a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.Column</span></code></a>
A column expression in a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</li>
<li><a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.Row</span></code></a>
A row of data in a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</li>
<li><a class="reference internal" href="#pyspark.sql.GroupedData" title="pyspark.sql.GroupedData"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.GroupedData</span></code></a>
Aggregation methods, returned by <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.groupBy()</span></code></a>.</li>
<li><a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions" title="pyspark.sql.DataFrameNaFunctions"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.DataFrameNaFunctions</span></code></a>
Methods for handling missing data (null values).</li>
<li><a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions" title="pyspark.sql.DataFrameStatFunctions"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.DataFrameStatFunctions</span></code></a>
Methods for statistics functionality.</li>
<li><a class="reference internal" href="#module-pyspark.sql.functions" title="pyspark.sql.functions"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.functions</span></code></a>
List of built-in functions available for <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</li>
<li><a class="reference internal" href="#module-pyspark.sql.types" title="pyspark.sql.types"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types</span></code></a>
List of data types available.</li>
<li><a class="reference internal" href="#pyspark.sql.Window" title="pyspark.sql.Window"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.Window</span></code></a>
For working with window functions.</li>
</ul>
</div></blockquote>
<dl class="class">
<dt id="pyspark.sql.SparkSession">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">SparkSession</code><span class="sig-paren">(</span><em>sparkContext</em>, <em>jsparkSession=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession" title="Permalink to this definition"></a></dt>
<dd><p>The entry point to programming Spark with the Dataset and DataFrame API.</p>
<p>A SparkSession can be used create <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, register <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as
tables, execute SQL over tables, cache tables, and read parquet files.
To create a SparkSession, use the following builder pattern:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span> \
<span class="gp">... </span> <span class="o">.</span><span class="n">master</span><span class="p">(</span><span class="s2">&quot;local&quot;</span><span class="p">)</span> \
<span class="gp">... </span> <span class="o">.</span><span class="n">appName</span><span class="p">(</span><span class="s2">&quot;Word Count&quot;</span><span class="p">)</span> \
<span class="gp">... </span> <span class="o">.</span><span class="n">config</span><span class="p">(</span><span class="s2">&quot;spark.some.config.option&quot;</span><span class="p">,</span> <span class="s2">&quot;some-value&quot;</span><span class="p">)</span> \
<span class="gp">... </span> <span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
</pre></div>
</div>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.builder">
<code class="descname">builder</code><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition"></a></dt>
<dd><p>A class attribute having a <a class="reference internal" href="#pyspark.sql.SparkSession.Builder" title="pyspark.sql.SparkSession.Builder"><code class="xref py py-class docutils literal"><span class="pre">Builder</span></code></a> to construct <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> instances</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.SparkSession.Builder">
<em class="property">class </em><code class="descname">Builder</code><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.Builder"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.Builder" title="Permalink to this definition"></a></dt>
<dd><p>Builder for <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a>.</p>
<dl class="method">
<dt id="pyspark.sql.SparkSession.Builder.appName">
<code class="descname">appName</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.Builder.appName"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.Builder.appName" title="Permalink to this definition"></a></dt>
<dd><p>Sets a name for the application, which will be shown in the Spark web UI.</p>
<p>If no application name is set, a randomly generated name will be used.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>name</strong> – an application name</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.Builder.config">
<code class="descname">config</code><span class="sig-paren">(</span><em>key=None</em>, <em>value=None</em>, <em>conf=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.Builder.config"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.Builder.config" title="Permalink to this definition"></a></dt>
<dd><p>Sets a config option. Options set using this method are automatically propagated to
both <code class="xref py py-class docutils literal"><span class="pre">SparkConf</span></code> and <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a>’s own configuration.</p>
<p>For an existing SparkConf, use <cite>conf</cite> parameter.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.conf</span> <span class="k">import</span> <span class="n">SparkConf</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">config</span><span class="p">(</span><span class="n">conf</span><span class="o">=</span><span class="n">SparkConf</span><span class="p">())</span>
<span class="go">&lt;pyspark.sql.session...</span>
</pre></div>
</div>
<p>For a (key, value) pair, you can omit parameter names.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">config</span><span class="p">(</span><span class="s2">&quot;spark.some.config.option&quot;</span><span class="p">,</span> <span class="s2">&quot;some-value&quot;</span><span class="p">)</span>
<span class="go">&lt;pyspark.sql.session...</span>
</pre></div>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>key</strong> – a key name string for configuration property</li>
<li><strong>value</strong> – a value for configuration property</li>
<li><strong>conf</strong> – an instance of <code class="xref py py-class docutils literal"><span class="pre">SparkConf</span></code></li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.Builder.enableHiveSupport">
<code class="descname">enableHiveSupport</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.Builder.enableHiveSupport"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.Builder.enableHiveSupport" title="Permalink to this definition"></a></dt>
<dd><p>Enables Hive support, including connectivity to a persistent Hive metastore, support
for Hive serdes, and Hive user-defined functions.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.Builder.getOrCreate">
<code class="descname">getOrCreate</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.Builder.getOrCreate"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.Builder.getOrCreate" title="Permalink to this definition"></a></dt>
<dd><p>Gets an existing <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> or, if there is no existing one, creates a
new one based on the options set in this builder.</p>
<p>This method first checks whether there is a valid global default SparkSession, and if
yes, return that one. If no valid global default SparkSession exists, the method
creates a new SparkSession and assigns the newly created SparkSession as the global
default.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">s1</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">config</span><span class="p">(</span><span class="s2">&quot;k1&quot;</span><span class="p">,</span> <span class="s2">&quot;v1&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">s1</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&quot;k1&quot;</span><span class="p">)</span> <span class="o">==</span> <span class="n">s1</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">getConf</span><span class="p">()</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&quot;k1&quot;</span><span class="p">)</span> <span class="o">==</span> <span class="s2">&quot;v1&quot;</span>
<span class="go">True</span>
</pre></div>
</div>
<p>In case an existing SparkSession is returned, the config options specified
in this builder will be applied to the existing SparkSession.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">s2</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">config</span><span class="p">(</span><span class="s2">&quot;k2&quot;</span><span class="p">,</span> <span class="s2">&quot;v2&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">s1</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&quot;k1&quot;</span><span class="p">)</span> <span class="o">==</span> <span class="n">s2</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&quot;k1&quot;</span><span class="p">)</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">s1</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&quot;k2&quot;</span><span class="p">)</span> <span class="o">==</span> <span class="n">s2</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&quot;k2&quot;</span><span class="p">)</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.Builder.master">
<code class="descname">master</code><span class="sig-paren">(</span><em>master</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.Builder.master"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.Builder.master" title="Permalink to this definition"></a></dt>
<dd><p>Sets the Spark master URL to connect to, such as “local” to run locally, “local[4]”
to run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone
cluster.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>master</strong> – a url for spark master</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.catalog">
<code class="descname">catalog</code><a class="headerlink" href="#pyspark.sql.SparkSession.catalog" title="Permalink to this definition"></a></dt>
<dd><p>Interface through which the user may create, drop, alter or query underlying
databases, tables, functions etc.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.conf">
<code class="descname">conf</code><a class="headerlink" href="#pyspark.sql.SparkSession.conf" title="Permalink to this definition"></a></dt>
<dd><p>Runtime configuration interface for Spark.</p>
<p>This is the interface through which the user can get and set all Spark and Hadoop
configurations that are relevant to Spark SQL. When getting the value of a config,
this defaults to the value set in the underlying <code class="xref py py-class docutils literal"><span class="pre">SparkContext</span></code>, if any.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.createDataFrame">
<code class="descname">createDataFrame</code><span class="sig-paren">(</span><em>data</em>, <em>schema=None</em>, <em>samplingRatio=None</em>, <em>verifySchema=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.createDataFrame"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.createDataFrame" title="Permalink to this definition"></a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> from an <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code>, a list or a <code class="xref py py-class docutils literal"><span class="pre">pandas.DataFrame</span></code>.</p>
<p>When <code class="docutils literal"><span class="pre">schema</span></code> is a list of column names, the type of each column
will be inferred from <code class="docutils literal"><span class="pre">data</span></code>.</p>
<p>When <code class="docutils literal"><span class="pre">schema</span></code> is <code class="docutils literal"><span class="pre">None</span></code>, it will try to infer the schema (column names and types)
from <code class="docutils literal"><span class="pre">data</span></code>, which should be an RDD of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>,
or <code class="xref py py-class docutils literal"><span class="pre">namedtuple</span></code>, or <code class="xref py py-class docutils literal"><span class="pre">dict</span></code>.</p>
<p>When <code class="docutils literal"><span class="pre">schema</span></code> is <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> or a datatype string, it must match
the real data, or an exception will be thrown at runtime. If the given schema is not
<a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a>, it will be wrapped into a
<a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> as its only field, and the field name will be “value”,
each record will also be wrapped into a tuple, which can be converted to row later.</p>
<p>If schema inference is needed, <code class="docutils literal"><span class="pre">samplingRatio</span></code> is used to determined the ratio of
rows used for schema inference. The first row will be used if <code class="docutils literal"><span class="pre">samplingRatio</span></code> is <code class="docutils literal"><span class="pre">None</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>data</strong> – an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean,
etc.), or <code class="xref py py-class docutils literal"><span class="pre">list</span></code>, or <code class="xref py py-class docutils literal"><span class="pre">pandas.DataFrame</span></code>.</li>
<li><strong>schema</strong> – a <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> or a datatype string or a list of
column names, default is <code class="docutils literal"><span class="pre">None</span></code>. The data type string format equals to
<a class="reference internal" href="#pyspark.sql.types.DataType.simpleString" title="pyspark.sql.types.DataType.simpleString"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType.simpleString</span></code></a>, except that top level struct type can
omit the <code class="docutils literal"><span class="pre">struct&lt;&gt;</span></code> and atomic types use <code class="docutils literal"><span class="pre">typeName()</span></code> as their format, e.g. use
<code class="docutils literal"><span class="pre">byte</span></code> instead of <code class="docutils literal"><span class="pre">tinyint</span></code> for <a class="reference internal" href="#pyspark.sql.types.ByteType" title="pyspark.sql.types.ByteType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.ByteType</span></code></a>. We can also use
<code class="docutils literal"><span class="pre">int</span></code> as a short name for <code class="docutils literal"><span class="pre">IntegerType</span></code>.</li>
<li><strong>samplingRatio</strong> – the sample ratio of rows used for inferring</li>
<li><strong>verifySchema</strong> – verify data types of every row against schema.</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></p>
</td>
</tr>
</tbody>
</table>
<div class="versionchanged">
<p><span class="versionmodified">Changed in version 2.1: </span>Added verifySchema.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">l</span> <span class="o">=</span> <span class="p">[(</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_1=&#39;Alice&#39;, _2=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">d</span> <span class="o">=</span> <span class="p">[{</span><span class="s1">&#39;name&#39;</span><span class="p">:</span> <span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">:</span> <span class="mi">1</span><span class="p">}]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=1, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">rdd</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_1=&#39;Alice&#39;, _2=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">Person</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">person</span> <span class="o">=</span> <span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">r</span><span class="p">:</span> <span class="n">Person</span><span class="p">(</span><span class="o">*</span><span class="n">r</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">person</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span>
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">)])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">toPandas</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">pandas</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]]))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(0=1, 1=2)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="s2">&quot;a: string, b: int&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(a=&#39;Alice&#39;, b=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rdd</span> <span class="o">=</span> <span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="s2">&quot;int&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(value=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="s2">&quot;boolean&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="gt">Traceback (most recent call last):</span>
<span class="o">...</span>
<span class="gr">Py4JJavaError</span>: <span class="n">...</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.newSession">
<code class="descname">newSession</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.newSession"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.newSession" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new SparkSession as new session, that has separate SQLConf,
registered temporary views and UDFs, but shared SparkContext and
table cache.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.range">
<code class="descname">range</code><span class="sig-paren">(</span><em>start</em>, <em>end=None</em>, <em>step=1</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.range"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.range" title="Permalink to this definition"></a></dt>
<dd><p>Create a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with single <a class="reference internal" href="#pyspark.sql.types.LongType" title="pyspark.sql.types.LongType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.LongType</span></code></a> column named
<code class="docutils literal"><span class="pre">id</span></code>, containing elements in a range from <code class="docutils literal"><span class="pre">start</span></code> to <code class="docutils literal"><span class="pre">end</span></code> (exclusive) with
step value <code class="docutils literal"><span class="pre">step</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>start</strong> – the start value</li>
<li><strong>end</strong> – the end value (exclusive)</li>
<li><strong>step</strong> – the incremental step (default: 1)</li>
<li><strong>numPartitions</strong> – the number of partitions of the DataFrame</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(id=1), Row(id=3), Row(id=5)]</span>
</pre></div>
</div>
<p>If only one argument is specified, it will be used as the end value.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(id=0), Row(id=1), Row(id=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.read">
<code class="descname">read</code><a class="headerlink" href="#pyspark.sql.SparkSession.read" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameReader" title="pyspark.sql.DataFrameReader"><code class="xref py py-class docutils literal"><span class="pre">DataFrameReader</span></code></a> that can be used to read data
in as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrameReader" title="pyspark.sql.DataFrameReader"><code class="xref py py-class docutils literal"><span class="pre">DataFrameReader</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.readStream">
<code class="descname">readStream</code><a class="headerlink" href="#pyspark.sql.SparkSession.readStream" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">DataStreamReader</span></code> that can be used to read data streams
as a streaming <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><code class="xref py py-class docutils literal"><span class="pre">DataStreamReader</span></code></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.sparkContext">
<code class="descname">sparkContext</code><a class="headerlink" href="#pyspark.sql.SparkSession.sparkContext" title="Permalink to this definition"></a></dt>
<dd><p>Returns the underlying <code class="xref py py-class docutils literal"><span class="pre">SparkContext</span></code>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.sql">
<code class="descname">sql</code><span class="sig-paren">(</span><em>sqlQuery</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.sql"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.sql" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> representing the result of the given query.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT field1 AS f1, field2 as f2 from table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(f1=1, f2=&#39;row1&#39;), Row(f1=2, f2=&#39;row2&#39;), Row(f1=3, f2=&#39;row3&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.stop">
<code class="descname">stop</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.stop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.stop" title="Permalink to this definition"></a></dt>
<dd><p>Stop the underlying <code class="xref py py-class docutils literal"><span class="pre">SparkContext</span></code>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.streams">
<code class="descname">streams</code><a class="headerlink" href="#pyspark.sql.SparkSession.streams" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">StreamingQueryManager</span></code> that allows managing all the
<code class="xref py py-class docutils literal"><span class="pre">StreamingQuery</span></code> StreamingQueries active on <cite>this</cite> context.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><code class="xref py py-class docutils literal"><span class="pre">StreamingQueryManager</span></code></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SparkSession.table">
<code class="descname">table</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/session.html#SparkSession.table"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SparkSession.table" title="Permalink to this definition"></a></dt>
<dd><p>Returns the specified table as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.udf">
<code class="descname">udf</code><a class="headerlink" href="#pyspark.sql.SparkSession.udf" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.UDFRegistration" title="pyspark.sql.UDFRegistration"><code class="xref py py-class docutils literal"><span class="pre">UDFRegistration</span></code></a> for UDF registration.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.UDFRegistration" title="pyspark.sql.UDFRegistration"><code class="xref py py-class docutils literal"><span class="pre">UDFRegistration</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SparkSession.version">
<code class="descname">version</code><a class="headerlink" href="#pyspark.sql.SparkSession.version" title="Permalink to this definition"></a></dt>
<dd><p>The version of Spark on which this application is running.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.SQLContext">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">SQLContext</code><span class="sig-paren">(</span><em>sparkContext</em>, <em>sparkSession=None</em>, <em>jsqlContext=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext" title="Permalink to this definition"></a></dt>
<dd><p>The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x.</p>
<p>As of Spark 2.0, this is replaced by <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a>. However, we are keeping the class
here for backward compatibility.</p>
<p>A SQLContext can be used create <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, register <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as
tables, execute SQL over tables, cache tables, and read parquet files.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>sparkContext</strong> – The <code class="xref py py-class docutils literal"><span class="pre">SparkContext</span></code> backing this SQLContext.</li>
<li><strong>sparkSession</strong> – The <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> around which this SQLContext wraps.</li>
<li><strong>jsqlContext</strong> – An optional JVM Scala SQLContext. If set, we do not instantiate a new
SQLContext in the JVM, instead we make all calls to this object.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.SQLContext.cacheTable">
<code class="descname">cacheTable</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.cacheTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.cacheTable" title="Permalink to this definition"></a></dt>
<dd><p>Caches the specified table in-memory.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.clearCache">
<code class="descname">clearCache</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.clearCache"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.clearCache" title="Permalink to this definition"></a></dt>
<dd><p>Removes all cached tables from the in-memory cache.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.createDataFrame">
<code class="descname">createDataFrame</code><span class="sig-paren">(</span><em>data</em>, <em>schema=None</em>, <em>samplingRatio=None</em>, <em>verifySchema=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.createDataFrame"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.createDataFrame" title="Permalink to this definition"></a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> from an <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code>, a list or a <code class="xref py py-class docutils literal"><span class="pre">pandas.DataFrame</span></code>.</p>
<p>When <code class="docutils literal"><span class="pre">schema</span></code> is a list of column names, the type of each column
will be inferred from <code class="docutils literal"><span class="pre">data</span></code>.</p>
<p>When <code class="docutils literal"><span class="pre">schema</span></code> is <code class="docutils literal"><span class="pre">None</span></code>, it will try to infer the schema (column names and types)
from <code class="docutils literal"><span class="pre">data</span></code>, which should be an RDD of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>,
or <code class="xref py py-class docutils literal"><span class="pre">namedtuple</span></code>, or <code class="xref py py-class docutils literal"><span class="pre">dict</span></code>.</p>
<p>When <code class="docutils literal"><span class="pre">schema</span></code> is <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> or a datatype string it must match
the real data, or an exception will be thrown at runtime. If the given schema is not
<a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a>, it will be wrapped into a
<a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> as its only field, and the field name will be “value”,
each record will also be wrapped into a tuple, which can be converted to row later.</p>
<p>If schema inference is needed, <code class="docutils literal"><span class="pre">samplingRatio</span></code> is used to determined the ratio of
rows used for schema inference. The first row will be used if <code class="docutils literal"><span class="pre">samplingRatio</span></code> is <code class="docutils literal"><span class="pre">None</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>data</strong> – an RDD of any kind of SQL data representation(e.g. <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>,
<code class="xref py py-class docutils literal"><span class="pre">tuple</span></code>, <code class="docutils literal"><span class="pre">int</span></code>, <code class="docutils literal"><span class="pre">boolean</span></code>, etc.), or <code class="xref py py-class docutils literal"><span class="pre">list</span></code>, or
<code class="xref py py-class docutils literal"><span class="pre">pandas.DataFrame</span></code>.</li>
<li><strong>schema</strong> – a <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> or a datatype string or a list of
column names, default is None. The data type string format equals to
<a class="reference internal" href="#pyspark.sql.types.DataType.simpleString" title="pyspark.sql.types.DataType.simpleString"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType.simpleString</span></code></a>, except that top level struct type can
omit the <code class="docutils literal"><span class="pre">struct&lt;&gt;</span></code> and atomic types use <code class="docutils literal"><span class="pre">typeName()</span></code> as their format, e.g. use
<code class="docutils literal"><span class="pre">byte</span></code> instead of <code class="docutils literal"><span class="pre">tinyint</span></code> for <a class="reference internal" href="#pyspark.sql.types.ByteType" title="pyspark.sql.types.ByteType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.ByteType</span></code></a>.
We can also use <code class="docutils literal"><span class="pre">int</span></code> as a short name for <a class="reference internal" href="#pyspark.sql.types.IntegerType" title="pyspark.sql.types.IntegerType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.IntegerType</span></code></a>.</li>
<li><strong>samplingRatio</strong> – the sample ratio of rows used for inferring</li>
<li><strong>verifySchema</strong> – verify data types of every row against schema.</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></p>
</td>
</tr>
</tbody>
</table>
<div class="versionchanged">
<p><span class="versionmodified">Changed in version 2.0: </span>The <code class="docutils literal"><span class="pre">schema</span></code> parameter can be a <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> or a
datatype string after 2.0.
If it’s not a <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a>, it will be wrapped into a
<a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> and each record will also be wrapped into a tuple.</p>
</div>
<div class="versionchanged">
<p><span class="versionmodified">Changed in version 2.1: </span>Added verifySchema.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">l</span> <span class="o">=</span> <span class="p">[(</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_1=&#39;Alice&#39;, _2=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">d</span> <span class="o">=</span> <span class="p">[{</span><span class="s1">&#39;name&#39;</span><span class="p">:</span> <span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">:</span> <span class="mi">1</span><span class="p">}]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=1, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">rdd</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_1=&#39;Alice&#39;, _2=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">Person</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">person</span> <span class="o">=</span> <span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">r</span><span class="p">:</span> <span class="n">Person</span><span class="p">(</span><span class="o">*</span><span class="n">r</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">person</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span>
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">)])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">toPandas</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">pandas</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]]))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(0=1, 1=2)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="s2">&quot;a: string, b: int&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(a=&#39;Alice&#39;, b=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rdd</span> <span class="o">=</span> <span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="s2">&quot;int&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(value=1)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="s2">&quot;boolean&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="gt">Traceback (most recent call last):</span>
<span class="o">...</span>
<span class="gr">Py4JJavaError</span>: <span class="n">...</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.createExternalTable">
<code class="descname">createExternalTable</code><span class="sig-paren">(</span><em>tableName</em>, <em>path=None</em>, <em>source=None</em>, <em>schema=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.createExternalTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.createExternalTable" title="Permalink to this definition"></a></dt>
<dd><p>Creates an external table based on the dataset in a data source.</p>
<p>It returns the DataFrame associated with the external table.</p>
<p>The data source is specified by the <code class="docutils literal"><span class="pre">source</span></code> and a set of <code class="docutils literal"><span class="pre">options</span></code>.
If <code class="docutils literal"><span class="pre">source</span></code> is not specified, the default data source configured by
<code class="docutils literal"><span class="pre">spark.sql.sources.default</span></code> will be used.</p>
<p>Optionally, a schema can be provided as the schema of the returned <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> and
created external table.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.dropTempTable">
<code class="descname">dropTempTable</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.dropTempTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.dropTempTable" title="Permalink to this definition"></a></dt>
<dd><p>Remove the temp table from catalog.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">dropTempTable</span><span class="p">(</span><span class="s2">&quot;table1&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.getConf">
<code class="descname">getConf</code><span class="sig-paren">(</span><em>key</em>, <em>defaultValue=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.getConf"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.getConf" title="Permalink to this definition"></a></dt>
<dd><p>Returns the value of Spark SQL configuration property for the given key.</p>
<p>If the key is not set and defaultValue is not None, return
defaultValue. If the key is not set and defaultValue is None, return
the system default value.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">getConf</span><span class="p">(</span><span class="s2">&quot;spark.sql.shuffle.partitions&quot;</span><span class="p">)</span>
<span class="go">&#39;200&#39;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">getConf</span><span class="p">(</span><span class="s2">&quot;spark.sql.shuffle.partitions&quot;</span><span class="p">,</span> <span class="sa">u</span><span class="s2">&quot;10&quot;</span><span class="p">)</span>
<span class="go">&#39;10&#39;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">setConf</span><span class="p">(</span><span class="s2">&quot;spark.sql.shuffle.partitions&quot;</span><span class="p">,</span> <span class="sa">u</span><span class="s2">&quot;50&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">getConf</span><span class="p">(</span><span class="s2">&quot;spark.sql.shuffle.partitions&quot;</span><span class="p">,</span> <span class="sa">u</span><span class="s2">&quot;10&quot;</span><span class="p">)</span>
<span class="go">&#39;50&#39;</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.SQLContext.getOrCreate">
<em class="property">classmethod </em><code class="descname">getOrCreate</code><span class="sig-paren">(</span><em>sc</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.getOrCreate"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.getOrCreate" title="Permalink to this definition"></a></dt>
<dd><p>Get the existing SQLContext or create a new one with given SparkContext.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>sc</strong> – SparkContext</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.newSession">
<code class="descname">newSession</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.newSession"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.newSession" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new SQLContext as new session, that has separate SQLConf,
registered temporary views and UDFs, but shared SparkContext and
table cache.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.range">
<code class="descname">range</code><span class="sig-paren">(</span><em>start</em>, <em>end=None</em>, <em>step=1</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.range"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.range" title="Permalink to this definition"></a></dt>
<dd><p>Create a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with single <a class="reference internal" href="#pyspark.sql.types.LongType" title="pyspark.sql.types.LongType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.LongType</span></code></a> column named
<code class="docutils literal"><span class="pre">id</span></code>, containing elements in a range from <code class="docutils literal"><span class="pre">start</span></code> to <code class="docutils literal"><span class="pre">end</span></code> (exclusive) with
step value <code class="docutils literal"><span class="pre">step</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>start</strong> – the start value</li>
<li><strong>end</strong> – the end value (exclusive)</li>
<li><strong>step</strong> – the incremental step (default: 1)</li>
<li><strong>numPartitions</strong> – the number of partitions of the DataFrame</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(id=1), Row(id=3), Row(id=5)]</span>
</pre></div>
</div>
<p>If only one argument is specified, it will be used as the end value.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(id=0), Row(id=1), Row(id=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SQLContext.read">
<code class="descname">read</code><a class="headerlink" href="#pyspark.sql.SQLContext.read" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameReader" title="pyspark.sql.DataFrameReader"><code class="xref py py-class docutils literal"><span class="pre">DataFrameReader</span></code></a> that can be used to read data
in as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrameReader" title="pyspark.sql.DataFrameReader"><code class="xref py py-class docutils literal"><span class="pre">DataFrameReader</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SQLContext.readStream">
<code class="descname">readStream</code><a class="headerlink" href="#pyspark.sql.SQLContext.readStream" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">DataStreamReader</span></code> that can be used to read data streams
as a streaming <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><code class="xref py py-class docutils literal"><span class="pre">DataStreamReader</span></code></td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">text_sdf</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">text_sdf</span><span class="o">.</span><span class="n">isStreaming</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.registerDataFrameAsTable">
<code class="descname">registerDataFrameAsTable</code><span class="sig-paren">(</span><em>df</em>, <em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.registerDataFrameAsTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.registerDataFrameAsTable" title="Permalink to this definition"></a></dt>
<dd><p>Registers the given <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as a temporary table in the catalog.</p>
<p>Temporary tables exist only during the lifetime of this instance of <a class="reference internal" href="#pyspark.sql.SQLContext" title="pyspark.sql.SQLContext"><code class="xref py py-class docutils literal"><span class="pre">SQLContext</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s2">&quot;table1&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.registerFunction">
<code class="descname">registerFunction</code><span class="sig-paren">(</span><em>name</em>, <em>f</em>, <em>returnType=StringType</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.registerFunction"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.registerFunction" title="Permalink to this definition"></a></dt>
<dd><p>Registers a python function (including lambda function) as a UDF
so it can be used in SQL statements.</p>
<p>In addition to a name and the function itself, the return type can be optionally specified.
When the return type is not given it default to a string and conversion will automatically
be done. For any other return type, the produced object must match the specified type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>name</strong> – name of the UDF</li>
<li><strong>f</strong> – python function</li>
<li><strong>returnType</strong> – a <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> object</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerFunction</span><span class="p">(</span><span class="s2">&quot;stringLengthString&quot;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT stringLengthString(&#39;test&#39;)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(stringLengthString(test)=&#39;4&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="n">IntegerType</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerFunction</span><span class="p">(</span><span class="s2">&quot;stringLengthInt&quot;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT stringLengthInt(&#39;test&#39;)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(stringLengthInt(test)=4)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="n">IntegerType</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">udf</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="s2">&quot;stringLengthInt&quot;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT stringLengthInt(&#39;test&#39;)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(stringLengthInt(test)=4)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.2.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.registerJavaFunction">
<code class="descname">registerJavaFunction</code><span class="sig-paren">(</span><em>name</em>, <em>javaClassName</em>, <em>returnType=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.registerJavaFunction"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.registerJavaFunction" title="Permalink to this definition"></a></dt>
<dd><p>Register a java UDF so it can be used in SQL statements.</p>
<p>In addition to a name and the function itself, the return type can be optionally specified.
When the return type is not specified we would infer it via reflection.
:param name: name of the UDF
:param javaClassName: fully qualified name of java class
:param returnType: a <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> object</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerJavaFunction</span><span class="p">(</span><span class="s2">&quot;javaStringLength&quot;</span><span class="p">,</span>
<span class="gp">... </span> <span class="s2">&quot;test.org.apache.spark.sql.JavaStringLength&quot;</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT javaStringLength(&#39;test&#39;)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(UDF(test)=4)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerJavaFunction</span><span class="p">(</span><span class="s2">&quot;javaStringLength2&quot;</span><span class="p">,</span>
<span class="gp">... </span> <span class="s2">&quot;test.org.apache.spark.sql.JavaStringLength&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT javaStringLength2(&#39;test&#39;)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(UDF(test)=4)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.setConf">
<code class="descname">setConf</code><span class="sig-paren">(</span><em>key</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.setConf"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.setConf" title="Permalink to this definition"></a></dt>
<dd><p>Sets the given Spark SQL configuration property.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.sql">
<code class="descname">sql</code><span class="sig-paren">(</span><em>sqlQuery</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.sql"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.sql" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> representing the result of the given query.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT field1 AS f1, field2 as f2 from table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(f1=1, f2=&#39;row1&#39;), Row(f1=2, f2=&#39;row2&#39;), Row(f1=3, f2=&#39;row3&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SQLContext.streams">
<code class="descname">streams</code><a class="headerlink" href="#pyspark.sql.SQLContext.streams" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">StreamingQueryManager</span></code> that allows managing all the
<code class="xref py py-class docutils literal"><span class="pre">StreamingQuery</span></code> StreamingQueries active on <cite>this</cite> context.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.table">
<code class="descname">table</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.table"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.table" title="Permalink to this definition"></a></dt>
<dd><p>Returns the specified table or view as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.tableNames">
<code class="descname">tableNames</code><span class="sig-paren">(</span><em>dbName=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.tableNames"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.tableNames" title="Permalink to this definition"></a></dt>
<dd><p>Returns a list of names of tables in the database <code class="docutils literal"><span class="pre">dbName</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>dbName</strong> – string, name of the database to use. Default to the current database.</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">list of table names, in string</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="s2">&quot;table1&quot;</span> <span class="ow">in</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">tableNames</span><span class="p">()</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="s2">&quot;table1&quot;</span> <span class="ow">in</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">tableNames</span><span class="p">(</span><span class="s2">&quot;default&quot;</span><span class="p">)</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.tables">
<code class="descname">tables</code><span class="sig-paren">(</span><em>dbName=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.tables"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.tables" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing names of tables in the given database.</p>
<p>If <code class="docutils literal"><span class="pre">dbName</span></code> is not specified, the current database will be used.</p>
<p>The returned DataFrame has two columns: <code class="docutils literal"><span class="pre">tableName</span></code> and <code class="docutils literal"><span class="pre">isTemporary</span></code>
(a column with <code class="xref py py-class docutils literal"><span class="pre">BooleanType</span></code> indicating if a table is a temporary one or not).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>dbName</strong> – string, name of the database to use.</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s2">&quot;table1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">tables</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="s2">&quot;tableName = &#39;table1&#39;&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">Row(database=&#39;&#39;, tableName=&#39;table1&#39;, isTemporary=True)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SQLContext.udf">
<code class="descname">udf</code><a class="headerlink" href="#pyspark.sql.SQLContext.udf" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.UDFRegistration" title="pyspark.sql.UDFRegistration"><code class="xref py py-class docutils literal"><span class="pre">UDFRegistration</span></code></a> for UDF registration.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.UDFRegistration" title="pyspark.sql.UDFRegistration"><code class="xref py py-class docutils literal"><span class="pre">UDFRegistration</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.uncacheTable">
<code class="descname">uncacheTable</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.uncacheTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.uncacheTable" title="Permalink to this definition"></a></dt>
<dd><p>Removes the specified table from the in-memory cache.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.HiveContext">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">HiveContext</code><span class="sig-paren">(</span><em>sparkContext</em>, <em>jhiveContext=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#HiveContext"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.HiveContext" title="Permalink to this definition"></a></dt>
<dd><p>A variant of Spark SQL that integrates with data stored in Hive.</p>
<p>Configuration for Hive is read from <code class="docutils literal"><span class="pre">hive-site.xml</span></code> on the classpath.
It supports running both SQL and HiveQL commands.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>sparkContext</strong> – The SparkContext to wrap.</li>
<li><strong>jhiveContext</strong> – An optional JVM Scala HiveContext. If set, we do not instantiate a new
<a class="reference internal" href="#pyspark.sql.HiveContext" title="pyspark.sql.HiveContext"><code class="xref py py-class docutils literal"><span class="pre">HiveContext</span></code></a> in the JVM, instead we make all calls to this object.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 2.0.0. Use SparkSession.builder.enableHiveSupport().getOrCreate().</p>
</div>
<dl class="method">
<dt id="pyspark.sql.HiveContext.refreshTable">
<code class="descname">refreshTable</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#HiveContext.refreshTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.HiveContext.refreshTable" title="Permalink to this definition"></a></dt>
<dd><p>Invalidate and refresh all the cached the metadata of the given
table. For performance reasons, Spark SQL or the external data source
library it uses might cache certain metadata about a table, such as the
location of blocks. When those change outside of Spark SQL, users should
call this function to invalidate the cache.</p>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.UDFRegistration">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">UDFRegistration</code><span class="sig-paren">(</span><em>sqlContext</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#UDFRegistration"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.UDFRegistration" title="Permalink to this definition"></a></dt>
<dd><p>Wrapper for user-defined function registration.</p>
<dl class="method">
<dt id="pyspark.sql.UDFRegistration.register">
<code class="descname">register</code><span class="sig-paren">(</span><em>name</em>, <em>f</em>, <em>returnType=StringType</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#UDFRegistration.register"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.UDFRegistration.register" title="Permalink to this definition"></a></dt>
<dd><p>Registers a python function (including lambda function) as a UDF
so it can be used in SQL statements.</p>
<p>In addition to a name and the function itself, the return type can be optionally specified.
When the return type is not given it default to a string and conversion will automatically
be done. For any other return type, the produced object must match the specified type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>name</strong> – name of the UDF</li>
<li><strong>f</strong> – python function</li>
<li><strong>returnType</strong> – a <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> object</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerFunction</span><span class="p">(</span><span class="s2">&quot;stringLengthString&quot;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT stringLengthString(&#39;test&#39;)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(stringLengthString(test)=&#39;4&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="n">IntegerType</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerFunction</span><span class="p">(</span><span class="s2">&quot;stringLengthInt&quot;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT stringLengthInt(&#39;test&#39;)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(stringLengthInt(test)=4)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="n">IntegerType</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">udf</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="s2">&quot;stringLengthInt&quot;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;SELECT stringLengthInt(&#39;test&#39;)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(stringLengthInt(test)=4)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.2.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrame">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrame</code><span class="sig-paren">(</span><em>jdf</em>, <em>sql_ctx</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame" title="Permalink to this definition"></a></dt>
<dd><p>A distributed collection of data grouped into named columns.</p>
<p>A <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> is equivalent to a relational table in Spark SQL,
and can be created using various functions in <a class="reference internal" href="#pyspark.sql.SQLContext" title="pyspark.sql.SQLContext"><code class="xref py py-class docutils literal"><span class="pre">SQLContext</span></code></a>:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">people</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s2">&quot;...&quot;</span><span class="p">)</span>
</pre></div>
</div>
<p>Once created, it can be manipulated using the various domain-specific-language
(DSL) functions defined in: <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>.</p>
<p>To select a column from the data frame, use the apply method:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">ageCol</span> <span class="o">=</span> <span class="n">people</span><span class="o">.</span><span class="n">age</span>
</pre></div>
</div>
<p>A more concrete example:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="c1"># To create DataFrame using SQLContext</span>
<span class="n">people</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s2">&quot;...&quot;</span><span class="p">)</span>
<span class="n">department</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s2">&quot;...&quot;</span><span class="p">)</span>
<span class="n">people</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">people</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">30</span><span class="p">)</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">department</span><span class="p">,</span> <span class="n">people</span><span class="o">.</span><span class="n">deptId</span> <span class="o">==</span> <span class="n">department</span><span class="o">.</span><span class="n">id</span><span class="p">)</span> \
<span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">department</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s2">&quot;gender&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s2">&quot;salary&quot;</span><span class="p">:</span> <span class="s2">&quot;avg&quot;</span><span class="p">,</span> <span class="s2">&quot;age&quot;</span><span class="p">:</span> <span class="s2">&quot;max&quot;</span><span class="p">})</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrame.agg">
<code class="descname">agg</code><span class="sig-paren">(</span><em>*exprs</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.agg"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.agg" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate on the entire <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> without groups
(shorthand for <code class="docutils literal"><span class="pre">df.groupBy.agg()</span></code>).</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s2">&quot;age&quot;</span><span class="p">:</span> <span class="s2">&quot;max&quot;</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(max(age)=5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(min(age)=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.alias">
<code class="descname">alias</code><span class="sig-paren">(</span><em>alias</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.alias"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.alias" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with an alias set.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df_as1</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;df_as1&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df_as2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;df_as2&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">joined_df</span> <span class="o">=</span> <span class="n">df_as1</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df_as2</span><span class="p">,</span> <span class="n">col</span><span class="p">(</span><span class="s2">&quot;df_as1.name&quot;</span><span class="p">)</span> <span class="o">==</span> <span class="n">col</span><span class="p">(</span><span class="s2">&quot;df_as2.name&quot;</span><span class="p">),</span> <span class="s1">&#39;inner&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">joined_df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&quot;df_as1.name&quot;</span><span class="p">,</span> <span class="s2">&quot;df_as2.name&quot;</span><span class="p">,</span> <span class="s2">&quot;df_as2.age&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Bob&#39;, name=&#39;Bob&#39;, age=5), Row(name=&#39;Alice&#39;, name=&#39;Alice&#39;, age=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.approxQuantile">
<code class="descname">approxQuantile</code><span class="sig-paren">(</span><em>col</em>, <em>probabilities</em>, <em>relativeError</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.approxQuantile"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.approxQuantile" title="Permalink to this definition"></a></dt>
<dd><p>Calculates the approximate quantiles of numerical columns of a
DataFrame.</p>
<p>The result of this algorithm has the following deterministic bound:
If the DataFrame has N elements and if we request the quantile at
probability <cite>p</cite> up to error <cite>err</cite>, then the algorithm will return
a sample <cite>x</cite> from the DataFrame so that the <em>exact</em> rank of <cite>x</cite> is
close to (p * N). More precisely,</p>
<blockquote>
<div>floor((p - err) * N) &lt;= rank(x) &lt;= ceil((p + err) * N).</div></blockquote>
<p>This method implements a variation of the Greenwald-Khanna
algorithm (with some speed optimizations). The algorithm was first
present in [[<a class="reference external" href="http://dx.doi.org/10.1145/375663.375670">http://dx.doi.org/10.1145/375663.375670</a>
Space-efficient Online Computation of Quantile Summaries]]
by Greenwald and Khanna.</p>
<p>Note that null values will be ignored in numerical columns before calculation.
For columns only containing null values, an empty list is returned.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>col</strong> – str, list.
Can be a single column name, or a list of names for multiple columns.</li>
<li><strong>probabilities</strong> – a list of quantile probabilities
Each number must belong to [0, 1].
For example 0 is the minimum, 0.5 is the median, 1 is the maximum.</li>
<li><strong>relativeError</strong> – The relative target precision to achieve
(&gt;= 0). If set to zero, the exact quantiles are computed, which
could be very expensive. Note that values greater than 1 are
accepted but give the same result as 1.</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">the approximate quantiles at the given probabilities. If
the input <cite>col</cite> is a string, the output is a list of floats. If the
input <cite>col</cite> is a list or tuple of strings, the output is also a
list, but each element in it is a list of floats, i.e., the output
is a list of list of floats.</p>
</td>
</tr>
</tbody>
</table>
<div class="versionchanged">
<p><span class="versionmodified">Changed in version 2.2: </span>Added support for multiple columns.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.cache">
<code class="descname">cache</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.cache"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cache" title="Permalink to this definition"></a></dt>
<dd><p>Persists the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with the default storage level (<code class="xref py py-class docutils literal"><span class="pre">MEMORY_AND_DISK</span></code>).</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">The default storage level has changed to <code class="xref py py-class docutils literal"><span class="pre">MEMORY_AND_DISK</span></code> to match Scala in 2.0.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.checkpoint">
<code class="descname">checkpoint</code><span class="sig-paren">(</span><em>eager=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.checkpoint"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.checkpoint" title="Permalink to this definition"></a></dt>
<dd><p>Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
logical plan of this DataFrame, which is especially useful in iterative algorithms where the
plan may grow exponentially. It will be saved to files inside the checkpoint
directory set with <code class="xref py py-class docutils literal"><span class="pre">SparkContext.setCheckpointDir()</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>eager</strong> – Whether to checkpoint this DataFrame immediately</td>
</tr>
</tbody>
</table>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.coalesce">
<code class="descname">coalesce</code><span class="sig-paren">(</span><em>numPartitions</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.coalesce"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.coalesce" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> that has exactly <cite>numPartitions</cite> partitions.</p>
<p>Similar to coalesce defined on an <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code>, this operation results in a
narrow dependency, e.g. if you go from 1000 partitions to 100 partitions,
there will not be a shuffle, instead each of the 100 new partitions will
claim 10 of the current partitions. If a larger number of partitions is requested,
it will stay at the current number of partitions.</p>
<p>However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can call repartition(). This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">coalesce</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">1</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.collect">
<code class="descname">collect</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.collect"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.collect" title="Permalink to this definition"></a></dt>
<dd><p>Returns all the records as a list of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;), Row(age=5, name=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.columns">
<code class="descname">columns</code><a class="headerlink" href="#pyspark.sql.DataFrame.columns" title="Permalink to this definition"></a></dt>
<dd><p>Returns all column names as a list.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">columns</span>
<span class="go">[&#39;age&#39;, &#39;name&#39;]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.corr">
<code class="descname">corr</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em>, <em>method=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.corr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.corr" title="Permalink to this definition"></a></dt>
<dd><p>Calculates the correlation of two columns of a DataFrame as a double value.
Currently only supports the Pearson Correlation Coefficient.
<a class="reference internal" href="#pyspark.sql.DataFrame.corr" title="pyspark.sql.DataFrame.corr"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.corr()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.corr" title="pyspark.sql.DataFrameStatFunctions.corr"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.corr()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column</li>
<li><strong>col2</strong> – The name of the second column</li>
<li><strong>method</strong> – The correlation method. Currently only supports “pearson”</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.count">
<code class="descname">count</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.count"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.count" title="Permalink to this definition"></a></dt>
<dd><p>Returns the number of rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">2</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.cov">
<code class="descname">cov</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.cov"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cov" title="Permalink to this definition"></a></dt>
<dd><p>Calculate the sample covariance for the given columns, specified by their names, as a
double value. <a class="reference internal" href="#pyspark.sql.DataFrame.cov" title="pyspark.sql.DataFrame.cov"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.cov()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.cov" title="pyspark.sql.DataFrameStatFunctions.cov"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.cov()</span></code></a> are aliases.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column</li>
<li><strong>col2</strong> – The name of the second column</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.createGlobalTempView">
<code class="descname">createGlobalTempView</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.createGlobalTempView"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.createGlobalTempView" title="Permalink to this definition"></a></dt>
<dd><p>Creates a global temporary view with this DataFrame.</p>
<p>The lifetime of this temporary view is tied to this Spark application.
throws <code class="xref py py-class docutils literal"><span class="pre">TempTableAlreadyExistsException</span></code>, if the view name already exists in the
catalog.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createGlobalTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;select * from global_temp.people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createGlobalTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gt">Traceback (most recent call last):</span>
<span class="c">...</span>
<span class="gr">AnalysisException</span>: <span class="n">u&quot;Temporary table &#39;people&#39; already exists;&quot;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropGlobalTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.createOrReplaceGlobalTempView">
<code class="descname">createOrReplaceGlobalTempView</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.createOrReplaceGlobalTempView"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.createOrReplaceGlobalTempView" title="Permalink to this definition"></a></dt>
<dd><p>Creates or replaces a global temporary view using the given name.</p>
<p>The lifetime of this temporary view is tied to this Spark application.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceGlobalTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">createOrReplaceGlobalTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;select * from global_temp.people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df3</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropGlobalTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.2.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.createOrReplaceTempView">
<code class="descname">createOrReplaceTempView</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.createOrReplaceTempView"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.createOrReplaceTempView" title="Permalink to this definition"></a></dt>
<dd><p>Creates or replaces a local temporary view with this DataFrame.</p>
<p>The lifetime of this temporary table is tied to the <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a>
that was used to create this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;select * from people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df3</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.createTempView">
<code class="descname">createTempView</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.createTempView"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.createTempView" title="Permalink to this definition"></a></dt>
<dd><p>Creates a local temporary view with this DataFrame.</p>
<p>The lifetime of this temporary table is tied to the <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a>
that was used to create this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.
throws <code class="xref py py-class docutils literal"><span class="pre">TempTableAlreadyExistsException</span></code>, if the view name already exists in the
catalog.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;select * from people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gt">Traceback (most recent call last):</span>
<span class="c">...</span>
<span class="gr">AnalysisException</span>: <span class="n">u&quot;Temporary table &#39;people&#39; already exists;&quot;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.crossJoin">
<code class="descname">crossJoin</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.crossJoin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.crossJoin" title="Permalink to this definition"></a></dt>
<dd><p>Returns the cartesian product with another <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>other</strong> – Right side of the cartesian product.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;), Row(age=5, name=&#39;Bob&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="s2">&quot;height&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Tom&#39;, height=80), Row(name=&#39;Bob&#39;, height=85)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">crossJoin</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&quot;height&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="s2">&quot;height&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;, height=80), Row(age=2, name=&#39;Alice&#39;, height=85),</span>
<span class="go"> Row(age=5, name=&#39;Bob&#39;, height=80), Row(age=5, name=&#39;Bob&#39;, height=85)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.crosstab">
<code class="descname">crosstab</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.crosstab"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.crosstab" title="Permalink to this definition"></a></dt>
<dd><p>Computes a pair-wise frequency table of the given columns. Also known as a contingency
table. The number of distinct values for each column should be less than 1e4. At most 1e6
non-zero pair frequencies will be returned.
The first column of each row will be the distinct values of <cite>col1</cite> and the column names
will be the distinct values of <cite>col2</cite>. The name of the first column will be <cite>$col1_$col2</cite>.
Pairs that have no occurrences will have zero as their counts.
<a class="reference internal" href="#pyspark.sql.DataFrame.crosstab" title="pyspark.sql.DataFrame.crosstab"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.crosstab()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.crosstab" title="pyspark.sql.DataFrameStatFunctions.crosstab"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.crosstab()</span></code></a> are aliases.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column. Distinct items will make the first item of
each row.</li>
<li><strong>col2</strong> – The name of the second column. Distinct items will make the column names
of the DataFrame.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.cube">
<code class="descname">cube</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.cube"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cube" title="Permalink to this definition"></a></dt>
<dd><p>Create a multi-dimensional cube for the current <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> using
the specified columns, so we can run aggregation on them.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">cube</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="s2">&quot;age&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| name| age|count|</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| null|null| 2|</span>
<span class="go">| null| 2| 1|</span>
<span class="go">| null| 5| 1|</span>
<span class="go">|Alice|null| 1|</span>
<span class="go">|Alice| 2| 1|</span>
<span class="go">| Bob|null| 1|</span>
<span class="go">| Bob| 5| 1|</span>
<span class="go">+-----+----+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.describe">
<code class="descname">describe</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.describe"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.describe" title="Permalink to this definition"></a></dt>
<dd><p>Computes statistics for numeric and string columns.</p>
<p>This include count, mean, stddev, min, and max. If no columns are
given, this function computes statistics for all numerical or string columns.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">([</span><span class="s1">&#39;age&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+------------------+</span>
<span class="go">|summary| age|</span>
<span class="go">+-------+------------------+</span>
<span class="go">| count| 2|</span>
<span class="go">| mean| 3.5|</span>
<span class="go">| stddev|2.1213203435596424|</span>
<span class="go">| min| 2|</span>
<span class="go">| max| 5|</span>
<span class="go">+-------+------------------+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+------------------+-----+</span>
<span class="go">|summary| age| name|</span>
<span class="go">+-------+------------------+-----+</span>
<span class="go">| count| 2| 2|</span>
<span class="go">| mean| 3.5| null|</span>
<span class="go">| stddev|2.1213203435596424| null|</span>
<span class="go">| min| 2|Alice|</span>
<span class="go">| max| 5| Bob|</span>
<span class="go">+-------+------------------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.distinct">
<code class="descname">distinct</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.distinct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.distinct" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing the distinct rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">distinct</span><span class="p">()</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">2</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.drop">
<code class="descname">drop</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.drop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.drop" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> that drops the specified column.
This is a no-op if schema doesn’t contain the given column name(s).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – a string name of the column to drop, or a
<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> to drop, or a list of string name of the columns to drop.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;), Row(name=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;), Row(name=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s1">&#39;inner&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, height=85, name=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s1">&#39;inner&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;, height=85)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;inner&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;height&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.dropDuplicates">
<code class="descname">dropDuplicates</code><span class="sig-paren">(</span><em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.dropDuplicates"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.dropDuplicates" title="Permalink to this definition"></a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with duplicate rows removed,
optionally only considering certain columns.</p>
<p>For a static batch <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, it just drops duplicate rows. For a streaming
<a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, it will keep all data across triggers as intermediate state to drop
duplicates rows. You can use <a class="reference internal" href="#pyspark.sql.DataFrame.withWatermark" title="pyspark.sql.DataFrame.withWatermark"><code class="xref py py-func docutils literal"><span class="pre">withWatermark()</span></code></a> to limit how late the duplicate data can
be and system will accordingly limit the state. In addition, too late data older than
watermark will be dropped to avoid any possibility of duplicates.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.drop_duplicates" title="pyspark.sql.DataFrame.drop_duplicates"><code class="xref py py-func docutils literal"><span class="pre">drop_duplicates()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.dropDuplicates" title="pyspark.sql.DataFrame.dropDuplicates"><code class="xref py py-func docutils literal"><span class="pre">dropDuplicates()</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span> \
<span class="gp">... </span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> \
<span class="gp">... </span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> \
<span class="gp">... </span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 5| 80|Alice|</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">([</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;height&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 5| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.drop_duplicates">
<code class="descname">drop_duplicates</code><span class="sig-paren">(</span><em>subset=None</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.drop_duplicates" title="Permalink to this definition"></a></dt>
<dd><p><a class="reference internal" href="#pyspark.sql.DataFrame.drop_duplicates" title="pyspark.sql.DataFrame.drop_duplicates"><code class="xref py py-func docutils literal"><span class="pre">drop_duplicates()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.dropDuplicates" title="pyspark.sql.DataFrame.dropDuplicates"><code class="xref py py-func docutils literal"><span class="pre">dropDuplicates()</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.dropna">
<code class="descname">dropna</code><span class="sig-paren">(</span><em>how='any'</em>, <em>thresh=None</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.dropna"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.dropna" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> omitting rows with null values.
<a class="reference internal" href="#pyspark.sql.DataFrame.dropna" title="pyspark.sql.DataFrame.dropna"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.dropna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.drop" title="pyspark.sql.DataFrameNaFunctions.drop"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.drop()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>how</strong> – ‘any’ or ‘all’.
If ‘any’, drop a row if it contains any nulls.
If ‘all’, drop a row only if all its values are null.</li>
<li><strong>thresh</strong> – int, default None
If specified, drop rows that have less than <cite>thresh</cite> non-null values.
This overwrites the <cite>how</cite> parameter.</li>
<li><strong>subset</strong> – optional list of column names to consider.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">drop</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.dtypes">
<code class="descname">dtypes</code><a class="headerlink" href="#pyspark.sql.DataFrame.dtypes" title="Permalink to this definition"></a></dt>
<dd><p>Returns all column names and their data types as a list.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;age&#39;, &#39;int&#39;), (&#39;name&#39;, &#39;string&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.explain">
<code class="descname">explain</code><span class="sig-paren">(</span><em>extended=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.explain"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.explain" title="Permalink to this definition"></a></dt>
<dd><p>Prints the (logical and physical) plans to the console for debugging purpose.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>extended</strong> – boolean, default <code class="docutils literal"><span class="pre">False</span></code>. If <code class="docutils literal"><span class="pre">False</span></code>, prints only the physical plan.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">explain</span><span class="p">()</span>
<span class="go">== Physical Plan ==</span>
<span class="go">Scan ExistingRDD[age#0,name#1]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">explain</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span>
<span class="go">== Parsed Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Analyzed Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Optimized Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Physical Plan ==</span>
<span class="gp">...</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.fillna">
<code class="descname">fillna</code><span class="sig-paren">(</span><em>value</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.fillna"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.fillna" title="Permalink to this definition"></a></dt>
<dd><p>Replace null values, alias for <code class="docutils literal"><span class="pre">na.fill()</span></code>.
<a class="reference internal" href="#pyspark.sql.DataFrame.fillna" title="pyspark.sql.DataFrame.fillna"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.fillna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.fill" title="pyspark.sql.DataFrameNaFunctions.fill"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.fill()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>value</strong> – int, long, float, string, or dict.
Value to replace null values with.
If the value is a dict, then <cite>subset</cite> is ignored and <cite>value</cite> must be a mapping
from column name (string) to replacement value. The replacement value must be
an int, long, float, boolean, or string.</li>
<li><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">| 5| 50| Bob|</span>
<span class="go">| 50| 50| Tom|</span>
<span class="go">| 50| 50| null|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">({</span><span class="s1">&#39;age&#39;</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">:</span> <span class="s1">&#39;unknown&#39;</span><span class="p">})</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-------+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-------+</span>
<span class="go">| 10| 80| Alice|</span>
<span class="go">| 5| null| Bob|</span>
<span class="go">| 50| null| Tom|</span>
<span class="go">| 50| null|unknown|</span>
<span class="go">+---+------+-------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.filter">
<code class="descname">filter</code><span class="sig-paren">(</span><em>condition</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.filter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.filter" title="Permalink to this definition"></a></dt>
<dd><p>Filters rows using the given condition.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.where" title="pyspark.sql.DataFrame.where"><code class="xref py py-func docutils literal"><span class="pre">where()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.filter" title="pyspark.sql.DataFrame.filter"><code class="xref py py-func docutils literal"><span class="pre">filter()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>condition</strong> – a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> of <a class="reference internal" href="#pyspark.sql.types.BooleanType" title="pyspark.sql.types.BooleanType"><code class="xref py py-class docutils literal"><span class="pre">types.BooleanType</span></code></a>
or a string of SQL expression.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="s2">&quot;age &gt; 3&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s2">&quot;age = 2&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.first">
<code class="descname">first</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.first"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.first" title="Permalink to this definition"></a></dt>
<dd><p>Returns the first row as a <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">Row(age=2, name=&#39;Alice&#39;)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.foreach">
<code class="descname">foreach</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.foreach"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.foreach" title="Permalink to this definition"></a></dt>
<dd><p>Applies the <code class="docutils literal"><span class="pre">f</span></code> function to all <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a> of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>This is a shorthand for <code class="docutils literal"><span class="pre">df.rdd.foreach()</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">person</span><span class="p">):</span>
<span class="gp">... </span> <span class="nb">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">foreach</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.foreachPartition">
<code class="descname">foreachPartition</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.foreachPartition"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.foreachPartition" title="Permalink to this definition"></a></dt>
<dd><p>Applies the <code class="docutils literal"><span class="pre">f</span></code> function to each partition of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>This a shorthand for <code class="docutils literal"><span class="pre">df.rdd.foreachPartition()</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">people</span><span class="p">):</span>
<span class="gp">... </span> <span class="k">for</span> <span class="n">person</span> <span class="ow">in</span> <span class="n">people</span><span class="p">:</span>
<span class="gp">... </span> <span class="nb">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">foreachPartition</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.freqItems">
<code class="descname">freqItems</code><span class="sig-paren">(</span><em>cols</em>, <em>support=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.freqItems"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.freqItems" title="Permalink to this definition"></a></dt>
<dd><p>Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
<a class="reference external" href="http://dx.doi.org/10.1145/762471.762473">http://dx.doi.org/10.1145/762471.762473</a>, proposed by Karp, Schenker, and Papadimitriou”.
<a class="reference internal" href="#pyspark.sql.DataFrame.freqItems" title="pyspark.sql.DataFrame.freqItems"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.freqItems()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.freqItems" title="pyspark.sql.DataFrameStatFunctions.freqItems"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.freqItems()</span></code></a> are aliases.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – Names of the columns to calculate frequent items for as a list or tuple of
strings.</li>
<li><strong>support</strong> – The frequency with which to consider an item ‘frequent’. Default is 1%.
The support must be greater than 1e-4.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.groupBy">
<code class="descname">groupBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.groupBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.groupBy" title="Permalink to this definition"></a></dt>
<dd><p>Groups the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> using the specified columns,
so we can run aggregation on them. See <a class="reference internal" href="#pyspark.sql.GroupedData" title="pyspark.sql.GroupedData"><code class="xref py py-class docutils literal"><span class="pre">GroupedData</span></code></a>
for all the available aggregate functions.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.groupby" title="pyspark.sql.DataFrame.groupby"><code class="xref py py-func docutils literal"><span class="pre">groupby()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal"><span class="pre">groupBy()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of columns to group by.
Each element should be a column name (string) or an expression (<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>).</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">avg</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">&#39;age&#39;</span><span class="p">:</span> <span class="s1">&#39;mean&#39;</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(name=&#39;Alice&#39;, avg(age)=2.0), Row(name=&#39;Bob&#39;, avg(age)=5.0)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">avg</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(name=&#39;Alice&#39;, avg(age)=2.0), Row(name=&#39;Bob&#39;, avg(age)=5.0)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">([</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">])</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=2, count=1), Row(name=&#39;Bob&#39;, age=5, count=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.groupby">
<code class="descname">groupby</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.groupby" title="Permalink to this definition"></a></dt>
<dd><p><a class="reference internal" href="#pyspark.sql.DataFrame.groupby" title="pyspark.sql.DataFrame.groupby"><code class="xref py py-func docutils literal"><span class="pre">groupby()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal"><span class="pre">groupBy()</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.head">
<code class="descname">head</code><span class="sig-paren">(</span><em>n=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.head"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.head" title="Permalink to this definition"></a></dt>
<dd><p>Returns the first <code class="docutils literal"><span class="pre">n</span></code> rows.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This method should only be used if the resulting array is expected
to be small, as all the data is loaded into the driver’s memory.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>n</strong> – int, default 1. Number of rows to return.</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">If n is greater than 1, return a list of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.
If n is 1, return a single Row.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="go">Row(age=2, name=&#39;Alice&#39;)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.hint">
<code class="descname">hint</code><span class="sig-paren">(</span><em>name</em>, <em>*parameters</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.hint"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.hint" title="Permalink to this definition"></a></dt>
<dd><p>Specifies some hint on the current DataFrame.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>name</strong> – A name of the hint.</li>
<li><strong>parameters</strong> – Optional parameters.</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">hint</span><span class="p">(</span><span class="s2">&quot;broadcast&quot;</span><span class="p">),</span> <span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+---+------+</span>
<span class="go">|name|age|height|</span>
<span class="go">+----+---+------+</span>
<span class="go">| Bob| 5| 85|</span>
<span class="go">+----+---+------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.2.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.intersect">
<code class="descname">intersect</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.intersect"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.intersect" title="Permalink to this definition"></a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing rows only in
both this frame and another frame.</p>
<p>This is equivalent to <cite>INTERSECT</cite> in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.isLocal">
<code class="descname">isLocal</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.isLocal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.isLocal" title="Permalink to this definition"></a></dt>
<dd><p>Returns <code class="docutils literal"><span class="pre">True</span></code> if the <a class="reference internal" href="#pyspark.sql.DataFrame.collect" title="pyspark.sql.DataFrame.collect"><code class="xref py py-func docutils literal"><span class="pre">collect()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrame.take" title="pyspark.sql.DataFrame.take"><code class="xref py py-func docutils literal"><span class="pre">take()</span></code></a> methods can be run locally
(without any Spark executors).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.isStreaming">
<code class="descname">isStreaming</code><a class="headerlink" href="#pyspark.sql.DataFrame.isStreaming" title="Permalink to this definition"></a></dt>
<dd><p>Returns true if this <code class="xref py py-class docutils literal"><span class="pre">Dataset</span></code> contains one or more sources that continuously
return data as it arrives. A <code class="xref py py-class docutils literal"><span class="pre">Dataset</span></code> that reads data from a streaming source
must be executed as a <code class="xref py py-class docutils literal"><span class="pre">StreamingQuery</span></code> using the <code class="xref py py-func docutils literal"><span class="pre">start()</span></code> method in
<code class="xref py py-class docutils literal"><span class="pre">DataStreamWriter</span></code>. Methods that return a single answer, (e.g., <a class="reference internal" href="#pyspark.sql.DataFrame.count" title="pyspark.sql.DataFrame.count"><code class="xref py py-func docutils literal"><span class="pre">count()</span></code></a> or
<a class="reference internal" href="#pyspark.sql.DataFrame.collect" title="pyspark.sql.DataFrame.collect"><code class="xref py py-func docutils literal"><span class="pre">collect()</span></code></a>) will throw an <code class="xref py py-class docutils literal"><span class="pre">AnalysisException</span></code> when there is a streaming
source present.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.join">
<code class="descname">join</code><span class="sig-paren">(</span><em>other</em>, <em>on=None</em>, <em>how=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.join"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.join" title="Permalink to this definition"></a></dt>
<dd><p>Joins with another <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, using the given join expression.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>other</strong> – Right side of the join</li>
<li><strong>on</strong> – a string for the join column name, a list of column names,
a join expression (Column), or a list of Columns.
If <cite>on</cite> is a string or a list of strings indicating the name of the join column(s),
the column(s) must exist on both sides, and this performs an equi-join.</li>
<li><strong>how</strong> – str, default <code class="docutils literal"><span class="pre">inner</span></code>. Must be one of: <code class="docutils literal"><span class="pre">inner</span></code>, <code class="docutils literal"><span class="pre">cross</span></code>, <code class="docutils literal"><span class="pre">outer</span></code>,
<code class="docutils literal"><span class="pre">full</span></code>, <code class="docutils literal"><span class="pre">full_outer</span></code>, <code class="docutils literal"><span class="pre">left</span></code>, <code class="docutils literal"><span class="pre">left_outer</span></code>, <code class="docutils literal"><span class="pre">right</span></code>, <code class="docutils literal"><span class="pre">right_outer</span></code>,
<code class="docutils literal"><span class="pre">left_semi</span></code>, and <code class="docutils literal"><span class="pre">left_anti</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<p>The following performs a full outer join between <code class="docutils literal"><span class="pre">df1</span></code> and <code class="docutils literal"><span class="pre">df2</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s1">&#39;outer&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df2</span><span class="o">.</span><span class="n">height</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=None, height=80), Row(name=&#39;Bob&#39;, height=85), Row(name=&#39;Alice&#39;, height=None)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;outer&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;height&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Tom&#39;, height=80), Row(name=&#39;Bob&#39;, height=85), Row(name=&#39;Alice&#39;, height=None)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">cond</span> <span class="o">=</span> <span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df3</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="n">df3</span><span class="o">.</span><span class="n">age</span><span class="p">]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df3</span><span class="p">,</span> <span class="n">cond</span><span class="p">,</span> <span class="s1">&#39;outer&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df3</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=2), Row(name=&#39;Bob&#39;, age=5)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df2</span><span class="o">.</span><span class="n">height</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Bob&#39;, height=85)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df4</span><span class="p">,</span> <span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Bob&#39;, age=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.limit">
<code class="descname">limit</code><span class="sig-paren">(</span><em>num</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.limit"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.limit" title="Permalink to this definition"></a></dt>
<dd><p>Limits the result count to the number specified.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">limit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">limit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.na">
<code class="descname">na</code><a class="headerlink" href="#pyspark.sql.DataFrame.na" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions" title="pyspark.sql.DataFrameNaFunctions"><code class="xref py py-class docutils literal"><span class="pre">DataFrameNaFunctions</span></code></a> for handling missing values.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.orderBy">
<code class="descname">orderBy</code><span class="sig-paren">(</span><em>*cols</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.orderBy" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> sorted by the specified column(s).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> or column names to sort by.</li>
<li><strong>ascending</strong> – boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the <cite>cols</cite>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">asc</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;), Row(age=5, name=&#39;Bob&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">),</span> <span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">([</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="s2">&quot;name&quot;</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.persist">
<code class="descname">persist</code><span class="sig-paren">(</span><em>storageLevel=StorageLevel(True</em>, <em>True</em>, <em>False</em>, <em>False</em>, <em>1)</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.persist"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.persist" title="Permalink to this definition"></a></dt>
<dd><p>Sets the storage level to persist the contents of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> across
operations after the first time it is computed. This can only be used to assign
a new storage level if the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> does not have a storage level set yet.
If no storage level is specified defaults to (<code class="xref py py-class docutils literal"><span class="pre">MEMORY_AND_DISK</span></code>).</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">The default storage level has changed to <code class="xref py py-class docutils literal"><span class="pre">MEMORY_AND_DISK</span></code> to match Scala in 2.0.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.printSchema">
<code class="descname">printSchema</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.printSchema"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.printSchema" title="Permalink to this definition"></a></dt>
<dd><p>Prints out the schema in the tree format.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span>
<span class="go">root</span>
<span class="go"> |-- age: integer (nullable = true)</span>
<span class="go"> |-- name: string (nullable = true)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.randomSplit">
<code class="descname">randomSplit</code><span class="sig-paren">(</span><em>weights</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.randomSplit"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.randomSplit" title="Permalink to this definition"></a></dt>
<dd><p>Randomly splits this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with the provided weights.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>weights</strong> – list of doubles as weights with which to split the DataFrame. Weights will
be normalized if they don’t sum up to 1.0.</li>
<li><strong>seed</strong> – The seed for sampling.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">splits</span> <span class="o">=</span> <span class="n">df4</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">],</span> <span class="mi">24</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">splits</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">1</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">splits</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">3</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.rdd">
<code class="descname">rdd</code><a class="headerlink" href="#pyspark.sql.DataFrame.rdd" title="Permalink to this definition"></a></dt>
<dd><p>Returns the content as an <a class="reference internal" href="pyspark.html#pyspark.RDD" title="pyspark.RDD"><code class="xref py py-class docutils literal"><span class="pre">pyspark.RDD</span></code></a> of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.registerTempTable">
<code class="descname">registerTempTable</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.registerTempTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.registerTempTable" title="Permalink to this definition"></a></dt>
<dd><p>Registers this RDD as a temporary table using the given name.</p>
<p>The lifetime of this temporary table is tied to the <a class="reference internal" href="#pyspark.sql.SQLContext" title="pyspark.sql.SQLContext"><code class="xref py py-class docutils literal"><span class="pre">SQLContext</span></code></a>
that was used to create this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">&quot;select * from people&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropTempView</span><span class="p">(</span><span class="s2">&quot;people&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 2.0, use createOrReplaceTempView instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.repartition">
<code class="descname">repartition</code><span class="sig-paren">(</span><em>numPartitions</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.repartition"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.repartition" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.</p>
<p><code class="docutils literal"><span class="pre">numPartitions</span></code> can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.</p>
<div class="versionchanged">
<p><span class="versionmodified">Changed in version 1.6: </span>Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">union</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">| 5| Bob|</span>
<span class="go">| 5| Bob|</span>
<span class="go">| 2|Alice|</span>
<span class="go">| 2|Alice|</span>
<span class="go">+---+-----+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="s2">&quot;age&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">| 2|Alice|</span>
<span class="go">| 5| Bob|</span>
<span class="go">| 2|Alice|</span>
<span class="go">| 5| Bob|</span>
<span class="go">+---+-----+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">7</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="s2">&quot;age&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">| 5| Bob|</span>
<span class="go">| 5| Bob|</span>
<span class="go">| 2|Alice|</span>
<span class="go">| 2|Alice|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.replace">
<code class="descname">replace</code><span class="sig-paren">(</span><em>to_replace</em>, <em>value=None</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.replace"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.replace" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> replacing a value with another value.
<a class="reference internal" href="#pyspark.sql.DataFrame.replace" title="pyspark.sql.DataFrame.replace"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.replace()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.replace" title="pyspark.sql.DataFrameNaFunctions.replace"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.replace()</span></code></a> are
aliases of each other.
Values to_replace and value should contain either all numerics, all booleans,
or all strings. When replacing, the new value will be cast
to the type of the existing column.
For numeric replacements all values to be replaced should have unique
floating point representation. In case of conflicts (for example with <cite>{42: -1, 42.0: 1}</cite>)
and arbitrary replacement will be used.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>to_replace</strong> – bool, int, long, float, string, list or dict.
Value to be replaced.
If the value is a dict, then <cite>value</cite> is ignored and <cite>to_replace</cite> must be a
mapping between a value and a replacement.</li>
<li><strong>value</strong> – int, long, float, string, or list.
The replacement value must be an int, long, float, or string. If <cite>value</cite> is a
list, <cite>value</cite> should be of the same length and type as <cite>to_replace</cite>.
If <cite>value</cite> is a scalar and <cite>to_replace</cite> is a sequence, then <cite>value</cite> is
used as a replacement for each item in <cite>to_replace</cite>.</li>
<li><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+-----+</span>
<span class="go">| age|height| name|</span>
<span class="go">+----+------+-----+</span>
<span class="go">| 20| 80|Alice|</span>
<span class="go">| 5| null| Bob|</span>
<span class="go">|null| null| Tom|</span>
<span class="go">|null| null| null|</span>
<span class="go">+----+------+-----+</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">([</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="s1">&#39;Bob&#39;</span><span class="p">],</span> <span class="p">[</span><span class="s1">&#39;A&#39;</span><span class="p">,</span> <span class="s1">&#39;B&#39;</span><span class="p">],</span> <span class="s1">&#39;name&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+----+</span>
<span class="go">| age|height|name|</span>
<span class="go">+----+------+----+</span>
<span class="go">| 10| 80| A|</span>
<span class="go">| 5| null| B|</span>
<span class="go">|null| null| Tom|</span>
<span class="go">|null| null|null|</span>
<span class="go">+----+------+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.rollup">
<code class="descname">rollup</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.rollup"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.rollup" title="Permalink to this definition"></a></dt>
<dd><p>Create a multi-dimensional rollup for the current <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> using
the specified columns, so we can run aggregation on them.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">rollup</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="s2">&quot;age&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| name| age|count|</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| null|null| 2|</span>
<span class="go">|Alice|null| 1|</span>
<span class="go">|Alice| 2| 1|</span>
<span class="go">| Bob|null| 1|</span>
<span class="go">| Bob| 5| 1|</span>
<span class="go">+-----+----+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.sample">
<code class="descname">sample</code><span class="sig-paren">(</span><em>withReplacement</em>, <em>fraction</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.sample"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sample" title="Permalink to this definition"></a></dt>
<dd><p>Returns a sampled subset of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This is not guaranteed to provide exactly the fraction specified of the total
count of the given <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="kc">False</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mi">42</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">2</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.sampleBy">
<code class="descname">sampleBy</code><span class="sig-paren">(</span><em>col</em>, <em>fractions</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.sampleBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sampleBy" title="Permalink to this definition"></a></dt>
<dd><p>Returns a stratified sample without replacement based on the
fraction given on each stratum.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>col</strong> – column that defines strata</li>
<li><strong>fractions</strong> – sampling fraction for each stratum. If a stratum is not
specified, we treat its fraction as zero.</li>
<li><strong>seed</strong> – random seed</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">a new DataFrame that represents the stratified sample</p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="n">col</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">((</span><span class="n">col</span><span class="p">(</span><span class="s2">&quot;id&quot;</span><span class="p">)</span> <span class="o">%</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sampled</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">sampleBy</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="n">fractions</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">:</span> <span class="mf">0.2</span><span class="p">},</span> <span class="n">seed</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sampled</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|key|count|</span>
<span class="go">+---+-----+</span>
<span class="go">| 0| 5|</span>
<span class="go">| 1| 9|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.schema">
<code class="descname">schema</code><a class="headerlink" href="#pyspark.sql.DataFrame.schema" title="Permalink to this definition"></a></dt>
<dd><p>Returns the schema of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as a <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">schema</span>
<span class="go">StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.select">
<code class="descname">select</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.select"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.select" title="Permalink to this definition"></a></dt>
<dd><p>Projects a set of expressions and returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string) or expressions (<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>).
If one of the column names is ‘*’, that column is expanded to include all columns
in the current DataFrame.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;), Row(age=5, name=&#39;Bob&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=2), Row(name=&#39;Bob&#39;, age=5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=&#39;Alice&#39;, age=12), Row(name=&#39;Bob&#39;, age=15)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.selectExpr">
<code class="descname">selectExpr</code><span class="sig-paren">(</span><em>*expr</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.selectExpr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.selectExpr" title="Permalink to this definition"></a></dt>
<dd><p>Projects a set of SQL expressions and returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>This is a variant of <a class="reference internal" href="#pyspark.sql.DataFrame.select" title="pyspark.sql.DataFrame.select"><code class="xref py py-func docutils literal"><span class="pre">select()</span></code></a> that accepts SQL expressions.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s2">&quot;age * 2&quot;</span><span class="p">,</span> <span class="s2">&quot;abs(age)&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.show">
<code class="descname">show</code><span class="sig-paren">(</span><em>n=20</em>, <em>truncate=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.show"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.show" title="Permalink to this definition"></a></dt>
<dd><p>Prints the first <code class="docutils literal"><span class="pre">n</span></code> rows to the console.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>n</strong> – Number of rows to show.</li>
<li><strong>truncate</strong> – If set to True, truncate strings longer than 20 chars by default.
If set to a number greater than one, truncates long strings to length <code class="docutils literal"><span class="pre">truncate</span></code>
and align cells right.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span>
<span class="go">DataFrame[age: int, name: string]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">| 2|Alice|</span>
<span class="go">| 5| Bob|</span>
<span class="go">+---+-----+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">truncate</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="go">+---+----+</span>
<span class="go">|age|name|</span>
<span class="go">+---+----+</span>
<span class="go">| 2| Ali|</span>
<span class="go">| 5| Bob|</span>
<span class="go">+---+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.sort">
<code class="descname">sort</code><span class="sig-paren">(</span><em>*cols</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.sort"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sort" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> sorted by the specified column(s).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> or column names to sort by.</li>
<li><strong>ascending</strong> – boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the <cite>cols</cite>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">asc</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;), Row(age=5, name=&#39;Bob&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">),</span> <span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">([</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="s2">&quot;name&quot;</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;), Row(age=2, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.sortWithinPartitions">
<code class="descname">sortWithinPartitions</code><span class="sig-paren">(</span><em>*cols</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.sortWithinPartitions"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sortWithinPartitions" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with each partition sorted by the specified column(s).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> or column names to sort by.</li>
<li><strong>ascending</strong> – boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the <cite>cols</cite>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sortWithinPartitions</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">| 2|Alice|</span>
<span class="go">| 5| Bob|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.stat">
<code class="descname">stat</code><a class="headerlink" href="#pyspark.sql.DataFrame.stat" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions" title="pyspark.sql.DataFrameStatFunctions"><code class="xref py py-class docutils literal"><span class="pre">DataFrameStatFunctions</span></code></a> for statistic functions.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.storageLevel">
<code class="descname">storageLevel</code><a class="headerlink" href="#pyspark.sql.DataFrame.storageLevel" title="Permalink to this definition"></a></dt>
<dd><p>Get the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>’s current storage level.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">storageLevel</span>
<span class="go">StorageLevel(False, False, False, False, 1)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">cache</span><span class="p">()</span><span class="o">.</span><span class="n">storageLevel</span>
<span class="go">StorageLevel(True, True, False, True, 1)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">persist</span><span class="p">(</span><span class="n">StorageLevel</span><span class="o">.</span><span class="n">DISK_ONLY_2</span><span class="p">)</span><span class="o">.</span><span class="n">storageLevel</span>
<span class="go">StorageLevel(True, False, False, False, 2)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.subtract">
<code class="descname">subtract</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.subtract"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.subtract" title="Permalink to this definition"></a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing rows in this frame
but not in another frame.</p>
<p>This is equivalent to <cite>EXCEPT</cite> in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.take">
<code class="descname">take</code><span class="sig-paren">(</span><em>num</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.take"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.take" title="Permalink to this definition"></a></dt>
<dd><p>Returns the first <code class="docutils literal"><span class="pre">num</span></code> rows as a <code class="xref py py-class docutils literal"><span class="pre">list</span></code> of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;), Row(age=5, name=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.toDF">
<code class="descname">toDF</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.toDF"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toDF" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new class:<cite>DataFrame</cite> that with new specified column names</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of new column names (string)</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">toDF</span><span class="p">(</span><span class="s1">&#39;f1&#39;</span><span class="p">,</span> <span class="s1">&#39;f2&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(f1=2, f2=&#39;Alice&#39;), Row(f1=5, f2=&#39;Bob&#39;)]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.toJSON">
<code class="descname">toJSON</code><span class="sig-paren">(</span><em>use_unicode=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.toJSON"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toJSON" title="Permalink to this definition"></a></dt>
<dd><p>Converts a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> into a <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code> of string.</p>
<p>Each row is turned into a JSON document as one element in the returned RDD.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">toJSON</span><span class="p">()</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">&#39;{&quot;age&quot;:2,&quot;name&quot;:&quot;Alice&quot;}&#39;</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.toLocalIterator">
<code class="descname">toLocalIterator</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.toLocalIterator"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toLocalIterator" title="Permalink to this definition"></a></dt>
<dd><p>Returns an iterator that contains all of the rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.
The iterator will consume as much memory as the largest partition in this DataFrame.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">toLocalIterator</span><span class="p">())</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;), Row(age=5, name=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.toPandas">
<code class="descname">toPandas</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.toPandas"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toPandas" title="Permalink to this definition"></a></dt>
<dd><p>Returns the contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as Pandas <code class="docutils literal"><span class="pre">pandas.DataFrame</span></code>.</p>
<p>This is only available if Pandas is installed and available.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This method should only be used if the resulting Pandas’s DataFrame is expected
to be small, as all the data is loaded into the driver’s memory.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">toPandas</span><span class="p">()</span>
<span class="go"> age name</span>
<span class="go">0 2 Alice</span>
<span class="go">1 5 Bob</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.union">
<code class="descname">union</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.union"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.union" title="Permalink to this definition"></a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing union of rows in this and another frame.</p>
<p>This is equivalent to <cite>UNION ALL</cite> in SQL. To do a SQL-style set union
(that does deduplication of elements), use this function followed by a distinct.</p>
<p>Also as standard in SQL, this function resolves columns by position (not by name).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.unionAll">
<code class="descname">unionAll</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.unionAll"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.unionAll" title="Permalink to this definition"></a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing union of rows in this and another frame.</p>
<p>This is equivalent to <cite>UNION ALL</cite> in SQL. To do a SQL-style set union
(that does deduplication of elements), use this function followed by a distinct.</p>
<p>Also as standard in SQL, this function resolves columns by position (not by name).</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 2.0, use union instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.unpersist">
<code class="descname">unpersist</code><span class="sig-paren">(</span><em>blocking=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.unpersist"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.unpersist" title="Permalink to this definition"></a></dt>
<dd><p>Marks the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as non-persistent, and remove all blocks for it from
memory and disk.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last"><cite>blocking</cite> default has changed to False to match Scala in 2.0.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.where">
<code class="descname">where</code><span class="sig-paren">(</span><em>condition</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.where" title="Permalink to this definition"></a></dt>
<dd><p><a class="reference internal" href="#pyspark.sql.DataFrame.where" title="pyspark.sql.DataFrame.where"><code class="xref py py-func docutils literal"><span class="pre">where()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.filter" title="pyspark.sql.DataFrame.filter"><code class="xref py py-func docutils literal"><span class="pre">filter()</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.withColumn">
<code class="descname">withColumn</code><span class="sig-paren">(</span><em>colName</em>, <em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.withColumn"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.withColumn" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> by adding a column or replacing the
existing column that has the same name.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>colName</strong> – string, name of the new column.</li>
<li><strong>col</strong> – a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expression for the new column.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s1">&#39;age2&#39;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;, age2=4), Row(age=5, name=&#39;Bob&#39;, age2=7)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.withColumnRenamed">
<code class="descname">withColumnRenamed</code><span class="sig-paren">(</span><em>existing</em>, <em>new</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.withColumnRenamed"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.withColumnRenamed" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> by renaming an existing column.
This is a no-op if schema doesn’t contain the given column name.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>existing</strong> – string, name of the existing column to rename.</li>
<li><strong>col</strong> – string, new name of the column.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">withColumnRenamed</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;age2&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age2=2, name=&#39;Alice&#39;), Row(age2=5, name=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.withWatermark">
<code class="descname">withWatermark</code><span class="sig-paren">(</span><em>eventTime</em>, <em>delayThreshold</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.withWatermark"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.withWatermark" title="Permalink to this definition"></a></dt>
<dd><p>Defines an event time watermark for this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>. A watermark tracks a point
in time before which we assume no more late data is going to arrive.</p>
<dl class="docutils">
<dt>Spark will use this watermark for several purposes:</dt>
<dd><ul class="first last simple">
<li>To know when a given time window aggregation can be finalized and thus can be emitted
when using output modes that do not allow updates.</li>
<li>To minimize the amount of state that we need to keep for on-going aggregations.</li>
</ul>
</dd>
</dl>
<p>The current watermark is computed by looking at the <cite>MAX(eventTime)</cite> seen across
all of the partitions in the query minus a user specified <cite>delayThreshold</cite>. Due to the cost
of coordinating this value across partitions, the actual watermark used is only guaranteed
to be at least <cite>delayThreshold</cite> behind the actual event time. In some cases we may still
process records that arrive more than <cite>delayThreshold</cite> late.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>eventTime</strong> – the name of the column that contains the event time of the row.</li>
<li><strong>delayThreshold</strong> – the minimum delay to wait to data to arrive late, relative to the
latest record that has been processed in the form of an interval
(e.g. “1 minute” or “5 hours”).</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sdf</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="n">sdf</span><span class="o">.</span><span class="n">time</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="s1">&#39;timestamp&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">withWatermark</span><span class="p">(</span><span class="s1">&#39;time&#39;</span><span class="p">,</span> <span class="s1">&#39;10 minutes&#39;</span><span class="p">)</span>
<span class="go">DataFrame[name: string, time: timestamp]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.write">
<code class="descname">write</code><a class="headerlink" href="#pyspark.sql.DataFrame.write" title="Permalink to this definition"></a></dt>
<dd><p>Interface for saving the content of the non-streaming <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> out into external
storage.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrameWriter" title="pyspark.sql.DataFrameWriter"><code class="xref py py-class docutils literal"><span class="pre">DataFrameWriter</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.writeStream">
<code class="descname">writeStream</code><a class="headerlink" href="#pyspark.sql.DataFrame.writeStream" title="Permalink to this definition"></a></dt>
<dd><p>Interface for saving the content of the streaming <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> out into external
storage.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><code class="xref py py-class docutils literal"><span class="pre">DataStreamWriter</span></code></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.GroupedData">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">GroupedData</code><span class="sig-paren">(</span><em>jgd</em>, <em>sql_ctx</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData" title="Permalink to this definition"></a></dt>
<dd><p>A set of methods for aggregations on a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>,
created by <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.groupBy()</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.GroupedData.agg">
<code class="descname">agg</code><span class="sig-paren">(</span><em>*exprs</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.agg"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.agg" title="Permalink to this definition"></a></dt>
<dd><p>Compute aggregates and returns the result as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>The available aggregate functions are <cite>avg</cite>, <cite>max</cite>, <cite>min</cite>, <cite>sum</cite>, <cite>count</cite>.</p>
<p>If <code class="docutils literal"><span class="pre">exprs</span></code> is a single <code class="xref py py-class docutils literal"><span class="pre">dict</span></code> mapping from string to string, then the key
is the column to perform aggregation on, and the value is the aggregate function.</p>
<p>Alternatively, <code class="docutils literal"><span class="pre">exprs</span></code> can also be a list of aggregate <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expressions.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>exprs</strong> – a dict mapping from column name (string) to aggregate functions (string),
or a list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">gdf</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">gdf</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s2">&quot;*&quot;</span><span class="p">:</span> <span class="s2">&quot;count&quot;</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(name=&#39;Alice&#39;, count(1)=1), Row(name=&#39;Bob&#39;, count(1)=1)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">gdf</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(name=&#39;Alice&#39;, min(age)=2), Row(name=&#39;Bob&#39;, min(age)=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.avg">
<code class="descname">avg</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.avg"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.avg" title="Permalink to this definition"></a></dt>
<dd><p>Computes average values for each numeric columns for each group.</p>
<p><a class="reference internal" href="#pyspark.sql.GroupedData.mean" title="pyspark.sql.GroupedData.mean"><code class="xref py py-func docutils literal"><span class="pre">mean()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.GroupedData.avg" title="pyspark.sql.GroupedData.avg"><code class="xref py py-func docutils literal"><span class="pre">avg()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string). Non-numeric columns are ignored.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">avg</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">avg</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;height&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5, avg(height)=82.5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.count">
<code class="descname">count</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.count"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.count" title="Permalink to this definition"></a></dt>
<dd><p>Counts the number of records for each group.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(age=2, count=1), Row(age=5, count=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.max">
<code class="descname">max</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.max"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.max" title="Permalink to this definition"></a></dt>
<dd><p>Computes the max value for each numeric columns for each group.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(max(age)=5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;height&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(max(age)=5, max(height)=85)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.mean">
<code class="descname">mean</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.mean"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.mean" title="Permalink to this definition"></a></dt>
<dd><p>Computes average values for each numeric columns for each group.</p>
<p><a class="reference internal" href="#pyspark.sql.GroupedData.mean" title="pyspark.sql.GroupedData.mean"><code class="xref py py-func docutils literal"><span class="pre">mean()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.GroupedData.avg" title="pyspark.sql.GroupedData.avg"><code class="xref py py-func docutils literal"><span class="pre">avg()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string). Non-numeric columns are ignored.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;height&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5, avg(height)=82.5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.min">
<code class="descname">min</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.min"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.min" title="Permalink to this definition"></a></dt>
<dd><p>Computes the min value for each numeric column for each group.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string). Non-numeric columns are ignored.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(min(age)=2)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;height&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(min(age)=2, min(height)=80)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.pivot">
<code class="descname">pivot</code><span class="sig-paren">(</span><em>pivot_col</em>, <em>values=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.pivot"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.pivot" title="Permalink to this definition"></a></dt>
<dd><p>Pivots a column of the current [[DataFrame]] and perform the specified aggregation.
There are two versions of pivot function: one that requires the caller to specify the list
of distinct values to pivot on, and one that does not. The latter is more concise but less
efficient, because Spark needs to first compute the list of distinct values internally.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>pivot_col</strong> – Name of the column to pivot.</li>
<li><strong>values</strong> – List of values that will be translated to columns in the output DataFrame.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<p># Compute the sum of earnings for each year by course with each course as a separate column</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s2">&quot;year&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s2">&quot;course&quot;</span><span class="p">,</span> <span class="p">[</span><span class="s2">&quot;dotNET&quot;</span><span class="p">,</span> <span class="s2">&quot;Java&quot;</span><span class="p">])</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="s2">&quot;earnings&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(year=2012, dotNET=15000, Java=20000), Row(year=2013, dotNET=48000, Java=30000)]</span>
</pre></div>
</div>
<p># Or without specifying column values (less efficient)</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s2">&quot;year&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s2">&quot;course&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="s2">&quot;earnings&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(year=2012, Java=20000, dotNET=15000), Row(year=2013, Java=30000, dotNET=48000)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.sum">
<code class="descname">sum</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.sum"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.sum" title="Permalink to this definition"></a></dt>
<dd><p>Compute the sum for each numeric columns for each group.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string). Non-numeric columns are ignored.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(sum(age)=7)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;height&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(sum(age)=7, sum(height)=165)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.Column">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">Column</code><span class="sig-paren">(</span><em>jc</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column" title="Permalink to this definition"></a></dt>
<dd><p>A column in a DataFrame.</p>
<p><a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> instances can be created by:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="c1"># 1. Select a column out of a DataFrame</span>
<span class="n">df</span><span class="o">.</span><span class="n">colName</span>
<span class="n">df</span><span class="p">[</span><span class="s2">&quot;colName&quot;</span><span class="p">]</span>
<span class="c1"># 2. Create from an expression</span>
<span class="n">df</span><span class="o">.</span><span class="n">colName</span> <span class="o">+</span> <span class="mi">1</span>
<span class="mi">1</span> <span class="o">/</span> <span class="n">df</span><span class="o">.</span><span class="n">colName</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.Column.alias">
<code class="descname">alias</code><span class="sig-paren">(</span><em>*alias</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.alias"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.alias" title="Permalink to this definition"></a></dt>
<dd><p>Returns this column aliased with a new name or names (in the case of expressions that
return more than one column, such as explode).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>alias</strong> – strings of desired column names (collects all positional arguments passed)</li>
<li><strong>metadata</strong> – a dict of information to be stored in <code class="docutils literal"><span class="pre">metadata</span></code> attribute of the
corresponding :class: <cite>StructField</cite> (optional, keyword only argument)</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionchanged">
<p><span class="versionmodified">Changed in version 2.2: </span>Added optional <code class="docutils literal"><span class="pre">metadata</span></code> argument.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;age2&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age2=2), Row(age2=5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;age3&quot;</span><span class="p">,</span> <span class="n">metadata</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;max&#39;</span><span class="p">:</span> <span class="mi">99</span><span class="p">}))</span><span class="o">.</span><span class="n">schema</span><span class="p">[</span><span class="s1">&#39;age3&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">metadata</span><span class="p">[</span><span class="s1">&#39;max&#39;</span><span class="p">]</span>
<span class="go">99</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.asc">
<code class="descname">asc</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.asc" title="Permalink to this definition"></a></dt>
<dd><p>Returns a sort expression based on the ascending order of the given column name.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.astype">
<code class="descname">astype</code><span class="sig-paren">(</span><em>dataType</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.astype" title="Permalink to this definition"></a></dt>
<dd><p><a class="reference internal" href="#pyspark.sql.Column.astype" title="pyspark.sql.Column.astype"><code class="xref py py-func docutils literal"><span class="pre">astype()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.Column.cast" title="pyspark.sql.Column.cast"><code class="xref py py-func docutils literal"><span class="pre">cast()</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.between">
<code class="descname">between</code><span class="sig-paren">(</span><em>lowerBound</em>, <em>upperBound</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.between"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.between" title="Permalink to this definition"></a></dt>
<dd><p>A boolean expression that is evaluated to true if the value of this
expression is between the given columns.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">between</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+---------------------------+</span>
<span class="go">| name|((age &gt;= 2) AND (age &lt;= 4))|</span>
<span class="go">+-----+---------------------------+</span>
<span class="go">|Alice| true|</span>
<span class="go">| Bob| false|</span>
<span class="go">+-----+---------------------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.bitwiseAND">
<code class="descname">bitwiseAND</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.bitwiseAND" title="Permalink to this definition"></a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.bitwiseOR">
<code class="descname">bitwiseOR</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.bitwiseOR" title="Permalink to this definition"></a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.bitwiseXOR">
<code class="descname">bitwiseXOR</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.bitwiseXOR" title="Permalink to this definition"></a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.cast">
<code class="descname">cast</code><span class="sig-paren">(</span><em>dataType</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.cast"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.cast" title="Permalink to this definition"></a></dt>
<dd><p>Convert the column into type <code class="docutils literal"><span class="pre">dataType</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="s2">&quot;string&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;ages&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(ages=&#39;2&#39;), Row(ages=&#39;5&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">StringType</span><span class="p">())</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;ages&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(ages=&#39;2&#39;), Row(ages=&#39;5&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.contains">
<code class="descname">contains</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.contains" title="Permalink to this definition"></a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.desc">
<code class="descname">desc</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.desc" title="Permalink to this definition"></a></dt>
<dd><p>Returns a sort expression based on the descending order of the given column name.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.endswith">
<code class="descname">endswith</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.endswith" title="Permalink to this definition"></a></dt>
<dd><p>Return a Boolean <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> based on matching end of string.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>other</strong> – string at end of line (do not use a regex <cite>$</cite>)</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">&#39;ice&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">&#39;ice$&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.getField">
<code class="descname">getField</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.getField"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.getField" title="Permalink to this definition"></a></dt>
<dd><p>An expression that gets a field by name in a StructField.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="n">Row</span><span class="p">(</span><span class="n">r</span><span class="o">=</span><span class="n">Row</span><span class="p">(</span><span class="n">a</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="s2">&quot;b&quot;</span><span class="p">))])</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">r</span><span class="o">.</span><span class="n">getField</span><span class="p">(</span><span class="s2">&quot;b&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+</span>
<span class="go">|r.b|</span>
<span class="go">+---+</span>
<span class="go">| b|</span>
<span class="go">+---+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">r</span><span class="o">.</span><span class="n">a</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+</span>
<span class="go">|r.a|</span>
<span class="go">+---+</span>
<span class="go">| 1|</span>
<span class="go">+---+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.getItem">
<code class="descname">getItem</code><span class="sig-paren">(</span><em>key</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.getItem"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.getItem" title="Permalink to this definition"></a></dt>
<dd><p>An expression that gets an item at position <code class="docutils literal"><span class="pre">ordinal</span></code> out of a list,
or gets an item by key out of a dict.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span> <span class="p">{</span><span class="s2">&quot;key&quot;</span><span class="p">:</span> <span class="s2">&quot;value&quot;</span><span class="p">})])</span><span class="o">.</span><span class="n">toDF</span><span class="p">([</span><span class="s2">&quot;l&quot;</span><span class="p">,</span> <span class="s2">&quot;d&quot;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">l</span><span class="o">.</span><span class="n">getItem</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="o">.</span><span class="n">getItem</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+</span>
<span class="go">|l[0]|d[key]|</span>
<span class="go">+----+------+</span>
<span class="go">| 1| value|</span>
<span class="go">+----+------+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">l</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">[</span><span class="s2">&quot;key&quot;</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+</span>
<span class="go">|l[0]|d[key]|</span>
<span class="go">+----+------+</span>
<span class="go">| 1| value|</span>
<span class="go">+----+------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.isNotNull">
<code class="descname">isNotNull</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.isNotNull" title="Permalink to this definition"></a></dt>
<dd><p>True if the current expression is null. Often combined with
<a class="reference internal" href="#pyspark.sql.DataFrame.filter" title="pyspark.sql.DataFrame.filter"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.filter()</span></code></a> to select rows with non-null values.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Tom&#39;</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="kc">None</span><span class="p">)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">height</span><span class="o">.</span><span class="n">isNotNull</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(height=80, name=&#39;Tom&#39;)]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.isNull">
<code class="descname">isNull</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.isNull" title="Permalink to this definition"></a></dt>
<dd><p>True if the current expression is null. Often combined with
<a class="reference internal" href="#pyspark.sql.DataFrame.filter" title="pyspark.sql.DataFrame.filter"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.filter()</span></code></a> to select rows with null values.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Tom&#39;</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="kc">None</span><span class="p">)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">height</span><span class="o">.</span><span class="n">isNull</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(height=None, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.isin">
<code class="descname">isin</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.isin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.isin" title="Permalink to this definition"></a></dt>
<dd><p>A boolean expression that is evaluated to true if the value of this
expression is contained by the evaluated values of the arguments.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="s2">&quot;Bob&quot;</span><span class="p">,</span> <span class="s2">&quot;Mike&quot;</span><span class="p">)]</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=&#39;Bob&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">isin</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])]</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.like">
<code class="descname">like</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.like" title="Permalink to this definition"></a></dt>
<dd><p>Return a Boolean <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> based on a SQL LIKE match.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>other</strong> – a SQL LIKE pattern</td>
</tr>
</tbody>
</table>
<p>See <a class="reference internal" href="#pyspark.sql.Column.rlike" title="pyspark.sql.Column.rlike"><code class="xref py py-func docutils literal"><span class="pre">rlike()</span></code></a> for a regex version</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">like</span><span class="p">(</span><span class="s1">&#39;Al%&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.name">
<code class="descname">name</code><span class="sig-paren">(</span><em>*alias</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.name" title="Permalink to this definition"></a></dt>
<dd><p><a class="reference internal" href="#pyspark.sql.Column.name" title="pyspark.sql.Column.name"><code class="xref py py-func docutils literal"><span class="pre">name()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.Column.alias" title="pyspark.sql.Column.alias"><code class="xref py py-func docutils literal"><span class="pre">alias()</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.otherwise">
<code class="descname">otherwise</code><span class="sig-paren">(</span><em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.otherwise"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.otherwise" title="Permalink to this definition"></a></dt>
<dd><p>Evaluates a list of conditions and returns one of multiple possible result expressions.
If <a class="reference internal" href="#pyspark.sql.Column.otherwise" title="pyspark.sql.Column.otherwise"><code class="xref py py-func docutils literal"><span class="pre">Column.otherwise()</span></code></a> is not invoked, None is returned for unmatched conditions.</p>
<p>See <a class="reference internal" href="#pyspark.sql.functions.when" title="pyspark.sql.functions.when"><code class="xref py py-func docutils literal"><span class="pre">pyspark.sql.functions.when()</span></code></a> for example usage.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>value</strong> – a literal value, or a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expression.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+-------------------------------------+</span>
<span class="go">| name|CASE WHEN (age &gt; 3) THEN 1 ELSE 0 END|</span>
<span class="go">+-----+-------------------------------------+</span>
<span class="go">|Alice| 0|</span>
<span class="go">| Bob| 1|</span>
<span class="go">+-----+-------------------------------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.over">
<code class="descname">over</code><span class="sig-paren">(</span><em>window</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.over"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.over" title="Permalink to this definition"></a></dt>
<dd><p>Define a windowing column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>window</strong> – a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a></td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">a Column</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Window</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">window</span> <span class="o">=</span> <span class="n">Window</span><span class="o">.</span><span class="n">partitionBy</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">rowsBetween</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="n">rank</span><span class="p">,</span> <span class="nb">min</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># df.select(rank().over(window), min(&#39;age&#39;).over(window))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.rlike">
<code class="descname">rlike</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.rlike" title="Permalink to this definition"></a></dt>
<dd><p>Return a Boolean <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> based on a regex match.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>other</strong> – an extended regex expression</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">rlike</span><span class="p">(</span><span class="s1">&#39;ice$&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.startswith">
<code class="descname">startswith</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.startswith" title="Permalink to this definition"></a></dt>
<dd><p>Return a Boolean <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> based on a string match.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>other</strong> – string at end of line (do not use a regex <cite>^</cite>)</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;Al&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=&#39;Alice&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;^Al&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.substr">
<code class="descname">substr</code><span class="sig-paren">(</span><em>startPos</em>, <em>length</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.substr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.substr" title="Permalink to this definition"></a></dt>
<dd><p>Return a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> which is a substring of the column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>startPos</strong> – start position (int or Column)</li>
<li><strong>length</strong> – length of the substring (int or Column)</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">substr</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;col&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(col=&#39;Ali&#39;), Row(col=&#39;Bob&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.when">
<code class="descname">when</code><span class="sig-paren">(</span><em>condition</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.when"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.when" title="Permalink to this definition"></a></dt>
<dd><p>Evaluates a list of conditions and returns one of multiple possible result expressions.
If <a class="reference internal" href="#pyspark.sql.Column.otherwise" title="pyspark.sql.Column.otherwise"><code class="xref py py-func docutils literal"><span class="pre">Column.otherwise()</span></code></a> is not invoked, None is returned for unmatched conditions.</p>
<p>See <a class="reference internal" href="#pyspark.sql.functions.when" title="pyspark.sql.functions.when"><code class="xref py py-func docutils literal"><span class="pre">pyspark.sql.functions.when()</span></code></a> for example usage.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>condition</strong> – a boolean <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expression.</li>
<li><strong>value</strong> – a literal value, or a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expression.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+------------------------------------------------------------+</span>
<span class="go">| name|CASE WHEN (age &gt; 4) THEN 1 WHEN (age &lt; 3) THEN -1 ELSE 0 END|</span>
<span class="go">+-----+------------------------------------------------------------+</span>
<span class="go">|Alice| -1|</span>
<span class="go">| Bob| 1|</span>
<span class="go">+-----+------------------------------------------------------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.Row">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">Row</code><a class="reference internal" href="_modules/pyspark/sql/types.html#Row"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Row" title="Permalink to this definition"></a></dt>
<dd><p>A row in <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.
The fields in it can be accessed:</p>
<ul class="simple">
<li>like attributes (<code class="docutils literal"><span class="pre">row.key</span></code>)</li>
<li>like dictionary values (<code class="docutils literal"><span class="pre">row[key]</span></code>)</li>
</ul>
<p><code class="docutils literal"><span class="pre">key</span> <span class="pre">in</span> <span class="pre">row</span></code> will search through row keys.</p>
<p>Row can be used to create a row object by using named arguments,
the fields will be sorted by names. It is not allowed to omit
a named argument to represent the value is None or missing. This should be
explicitly set to None in this case.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">row</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s2">&quot;Alice&quot;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">row</span>
<span class="go">Row(age=11, name=&#39;Alice&#39;)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">row</span><span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="s1">&#39;age&#39;</span><span class="p">]</span>
<span class="go">(&#39;Alice&#39;, 11)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">row</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">row</span><span class="o">.</span><span class="n">age</span>
<span class="go">(&#39;Alice&#39;, 11)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="s1">&#39;name&#39;</span> <span class="ow">in</span> <span class="n">row</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="s1">&#39;wrong_key&#39;</span> <span class="ow">in</span> <span class="n">row</span>
<span class="go">False</span>
</pre></div>
</div>
<p>Row also can be used to create another Row like class, then it
could be used to create Row objects, such as</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">Person</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="s2">&quot;age&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">Person</span>
<span class="go">&lt;Row(name, age)&gt;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="s1">&#39;name&#39;</span> <span class="ow">in</span> <span class="n">Person</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="s1">&#39;wrong_key&#39;</span> <span class="ow">in</span> <span class="n">Person</span>
<span class="go">False</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">Person</span><span class="p">(</span><span class="s2">&quot;Alice&quot;</span><span class="p">,</span> <span class="mi">11</span><span class="p">)</span>
<span class="go">Row(name=&#39;Alice&#39;, age=11)</span>
</pre></div>
</div>
<dl class="method">
<dt id="pyspark.sql.Row.asDict">
<code class="descname">asDict</code><span class="sig-paren">(</span><em>recursive=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#Row.asDict"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Row.asDict" title="Permalink to this definition"></a></dt>
<dd><p>Return as an dict</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>recursive</strong> – turns the nested Row as dict (default: False).</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s2">&quot;Alice&quot;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span><span class="o">.</span><span class="n">asDict</span><span class="p">()</span> <span class="o">==</span> <span class="p">{</span><span class="s1">&#39;name&#39;</span><span class="p">:</span> <span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">:</span> <span class="mi">11</span><span class="p">}</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">row</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">row</span><span class="o">.</span><span class="n">asDict</span><span class="p">()</span> <span class="o">==</span> <span class="p">{</span><span class="s1">&#39;key&#39;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;value&#39;</span><span class="p">:</span> <span class="n">Row</span><span class="p">(</span><span class="n">age</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">&#39;a&#39;</span><span class="p">)}</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">row</span><span class="o">.</span><span class="n">asDict</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span> <span class="o">==</span> <span class="p">{</span><span class="s1">&#39;key&#39;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;value&#39;</span><span class="p">:</span> <span class="p">{</span><span class="s1">&#39;name&#39;</span><span class="p">:</span> <span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">:</span> <span class="mi">2</span><span class="p">}}</span>
<span class="go">True</span>
</pre></div>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrameNaFunctions">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrameNaFunctions</code><span class="sig-paren">(</span><em>df</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameNaFunctions"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameNaFunctions" title="Permalink to this definition"></a></dt>
<dd><p>Functionality for working with missing data in <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrameNaFunctions.drop">
<code class="descname">drop</code><span class="sig-paren">(</span><em>how='any'</em>, <em>thresh=None</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameNaFunctions.drop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameNaFunctions.drop" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> omitting rows with null values.
<a class="reference internal" href="#pyspark.sql.DataFrame.dropna" title="pyspark.sql.DataFrame.dropna"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.dropna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.drop" title="pyspark.sql.DataFrameNaFunctions.drop"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.drop()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>how</strong> – ‘any’ or ‘all’.
If ‘any’, drop a row if it contains any nulls.
If ‘all’, drop a row only if all its values are null.</li>
<li><strong>thresh</strong> – int, default None
If specified, drop rows that have less than <cite>thresh</cite> non-null values.
This overwrites the <cite>how</cite> parameter.</li>
<li><strong>subset</strong> – optional list of column names to consider.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">drop</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameNaFunctions.fill">
<code class="descname">fill</code><span class="sig-paren">(</span><em>value</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameNaFunctions.fill"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameNaFunctions.fill" title="Permalink to this definition"></a></dt>
<dd><p>Replace null values, alias for <code class="docutils literal"><span class="pre">na.fill()</span></code>.
<a class="reference internal" href="#pyspark.sql.DataFrame.fillna" title="pyspark.sql.DataFrame.fillna"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.fillna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.fill" title="pyspark.sql.DataFrameNaFunctions.fill"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.fill()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>value</strong> – int, long, float, string, or dict.
Value to replace null values with.
If the value is a dict, then <cite>subset</cite> is ignored and <cite>value</cite> must be a mapping
from column name (string) to replacement value. The replacement value must be
an int, long, float, boolean, or string.</li>
<li><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">| 5| 50| Bob|</span>
<span class="go">| 50| 50| Tom|</span>
<span class="go">| 50| 50| null|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">({</span><span class="s1">&#39;age&#39;</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">:</span> <span class="s1">&#39;unknown&#39;</span><span class="p">})</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-------+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-------+</span>
<span class="go">| 10| 80| Alice|</span>
<span class="go">| 5| null| Bob|</span>
<span class="go">| 50| null| Tom|</span>
<span class="go">| 50| null|unknown|</span>
<span class="go">+---+------+-------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameNaFunctions.replace">
<code class="descname">replace</code><span class="sig-paren">(</span><em>to_replace</em>, <em>value</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameNaFunctions.replace"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameNaFunctions.replace" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> replacing a value with another value.
<a class="reference internal" href="#pyspark.sql.DataFrame.replace" title="pyspark.sql.DataFrame.replace"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.replace()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.replace" title="pyspark.sql.DataFrameNaFunctions.replace"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.replace()</span></code></a> are
aliases of each other.
Values to_replace and value should contain either all numerics, all booleans,
or all strings. When replacing, the new value will be cast
to the type of the existing column.
For numeric replacements all values to be replaced should have unique
floating point representation. In case of conflicts (for example with <cite>{42: -1, 42.0: 1}</cite>)
and arbitrary replacement will be used.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>to_replace</strong> – bool, int, long, float, string, list or dict.
Value to be replaced.
If the value is a dict, then <cite>value</cite> is ignored and <cite>to_replace</cite> must be a
mapping between a value and a replacement.</li>
<li><strong>value</strong> – int, long, float, string, or list.
The replacement value must be an int, long, float, or string. If <cite>value</cite> is a
list, <cite>value</cite> should be of the same length and type as <cite>to_replace</cite>.
If <cite>value</cite> is a scalar and <cite>to_replace</cite> is a sequence, then <cite>value</cite> is
used as a replacement for each item in <cite>to_replace</cite>.</li>
<li><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+-----+</span>
<span class="go">| age|height| name|</span>
<span class="go">+----+------+-----+</span>
<span class="go">| 20| 80|Alice|</span>
<span class="go">| 5| null| Bob|</span>
<span class="go">|null| null| Tom|</span>
<span class="go">|null| null| null|</span>
<span class="go">+----+------+-----+</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">([</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="s1">&#39;Bob&#39;</span><span class="p">],</span> <span class="p">[</span><span class="s1">&#39;A&#39;</span><span class="p">,</span> <span class="s1">&#39;B&#39;</span><span class="p">],</span> <span class="s1">&#39;name&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+----+</span>
<span class="go">| age|height|name|</span>
<span class="go">+----+------+----+</span>
<span class="go">| 10| 80| A|</span>
<span class="go">| 5| null| B|</span>
<span class="go">|null| null| Tom|</span>
<span class="go">|null| null|null|</span>
<span class="go">+----+------+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrameStatFunctions">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrameStatFunctions</code><span class="sig-paren">(</span><em>df</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions" title="Permalink to this definition"></a></dt>
<dd><p>Functionality for statistic functions with <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.approxQuantile">
<code class="descname">approxQuantile</code><span class="sig-paren">(</span><em>col</em>, <em>probabilities</em>, <em>relativeError</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.approxQuantile"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.approxQuantile" title="Permalink to this definition"></a></dt>
<dd><p>Calculates the approximate quantiles of numerical columns of a
DataFrame.</p>
<p>The result of this algorithm has the following deterministic bound:
If the DataFrame has N elements and if we request the quantile at
probability <cite>p</cite> up to error <cite>err</cite>, then the algorithm will return
a sample <cite>x</cite> from the DataFrame so that the <em>exact</em> rank of <cite>x</cite> is
close to (p * N). More precisely,</p>
<blockquote>
<div>floor((p - err) * N) &lt;= rank(x) &lt;= ceil((p + err) * N).</div></blockquote>
<p>This method implements a variation of the Greenwald-Khanna
algorithm (with some speed optimizations). The algorithm was first
present in [[<a class="reference external" href="http://dx.doi.org/10.1145/375663.375670">http://dx.doi.org/10.1145/375663.375670</a>
Space-efficient Online Computation of Quantile Summaries]]
by Greenwald and Khanna.</p>
<p>Note that null values will be ignored in numerical columns before calculation.
For columns only containing null values, an empty list is returned.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>col</strong> – str, list.
Can be a single column name, or a list of names for multiple columns.</li>
<li><strong>probabilities</strong> – a list of quantile probabilities
Each number must belong to [0, 1].
For example 0 is the minimum, 0.5 is the median, 1 is the maximum.</li>
<li><strong>relativeError</strong> – The relative target precision to achieve
(&gt;= 0). If set to zero, the exact quantiles are computed, which
could be very expensive. Note that values greater than 1 are
accepted but give the same result as 1.</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">the approximate quantiles at the given probabilities. If
the input <cite>col</cite> is a string, the output is a list of floats. If the
input <cite>col</cite> is a list or tuple of strings, the output is also a
list, but each element in it is a list of floats, i.e., the output
is a list of list of floats.</p>
</td>
</tr>
</tbody>
</table>
<div class="versionchanged">
<p><span class="versionmodified">Changed in version 2.2: </span>Added support for multiple columns.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.corr">
<code class="descname">corr</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em>, <em>method=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.corr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.corr" title="Permalink to this definition"></a></dt>
<dd><p>Calculates the correlation of two columns of a DataFrame as a double value.
Currently only supports the Pearson Correlation Coefficient.
<a class="reference internal" href="#pyspark.sql.DataFrame.corr" title="pyspark.sql.DataFrame.corr"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.corr()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.corr" title="pyspark.sql.DataFrameStatFunctions.corr"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.corr()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column</li>
<li><strong>col2</strong> – The name of the second column</li>
<li><strong>method</strong> – The correlation method. Currently only supports “pearson”</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.cov">
<code class="descname">cov</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.cov"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.cov" title="Permalink to this definition"></a></dt>
<dd><p>Calculate the sample covariance for the given columns, specified by their names, as a
double value. <a class="reference internal" href="#pyspark.sql.DataFrame.cov" title="pyspark.sql.DataFrame.cov"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.cov()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.cov" title="pyspark.sql.DataFrameStatFunctions.cov"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.cov()</span></code></a> are aliases.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column</li>
<li><strong>col2</strong> – The name of the second column</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.crosstab">
<code class="descname">crosstab</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.crosstab"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.crosstab" title="Permalink to this definition"></a></dt>
<dd><p>Computes a pair-wise frequency table of the given columns. Also known as a contingency
table. The number of distinct values for each column should be less than 1e4. At most 1e6
non-zero pair frequencies will be returned.
The first column of each row will be the distinct values of <cite>col1</cite> and the column names
will be the distinct values of <cite>col2</cite>. The name of the first column will be <cite>$col1_$col2</cite>.
Pairs that have no occurrences will have zero as their counts.
<a class="reference internal" href="#pyspark.sql.DataFrame.crosstab" title="pyspark.sql.DataFrame.crosstab"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.crosstab()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.crosstab" title="pyspark.sql.DataFrameStatFunctions.crosstab"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.crosstab()</span></code></a> are aliases.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column. Distinct items will make the first item of
each row.</li>
<li><strong>col2</strong> – The name of the second column. Distinct items will make the column names
of the DataFrame.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.freqItems">
<code class="descname">freqItems</code><span class="sig-paren">(</span><em>cols</em>, <em>support=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.freqItems"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.freqItems" title="Permalink to this definition"></a></dt>
<dd><p>Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
<a class="reference external" href="http://dx.doi.org/10.1145/762471.762473">http://dx.doi.org/10.1145/762471.762473</a>, proposed by Karp, Schenker, and Papadimitriou”.
<a class="reference internal" href="#pyspark.sql.DataFrame.freqItems" title="pyspark.sql.DataFrame.freqItems"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.freqItems()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.freqItems" title="pyspark.sql.DataFrameStatFunctions.freqItems"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.freqItems()</span></code></a> are aliases.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – Names of the columns to calculate frequent items for as a list or tuple of
strings.</li>
<li><strong>support</strong> – The frequency with which to consider an item ‘frequent’. Default is 1%.
The support must be greater than 1e-4.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.sampleBy">
<code class="descname">sampleBy</code><span class="sig-paren">(</span><em>col</em>, <em>fractions</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.sampleBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.sampleBy" title="Permalink to this definition"></a></dt>
<dd><p>Returns a stratified sample without replacement based on the
fraction given on each stratum.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>col</strong> – column that defines strata</li>
<li><strong>fractions</strong> – sampling fraction for each stratum. If a stratum is not
specified, we treat its fraction as zero.</li>
<li><strong>seed</strong> – random seed</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">a new DataFrame that represents the stratified sample</p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="n">col</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">((</span><span class="n">col</span><span class="p">(</span><span class="s2">&quot;id&quot;</span><span class="p">)</span> <span class="o">%</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sampled</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">sampleBy</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="n">fractions</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">:</span> <span class="mf">0.2</span><span class="p">},</span> <span class="n">seed</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sampled</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|key|count|</span>
<span class="go">+---+-----+</span>
<span class="go">| 0| 5|</span>
<span class="go">| 1| 9|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.Window">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">Window</code><a class="reference internal" href="_modules/pyspark/sql/window.html#Window"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Window" title="Permalink to this definition"></a></dt>
<dd><p>Utility functions for defining window in DataFrames.</p>
<p>For example:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="c1"># ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">window</span> <span class="o">=</span> <span class="n">Window</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;date&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">rowsBetween</span><span class="p">(</span><span class="n">Window</span><span class="o">.</span><span class="n">unboundedPreceding</span><span class="p">,</span> <span class="n">Window</span><span class="o">.</span><span class="n">currentRow</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="c1"># PARTITION BY country ORDER BY date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">window</span> <span class="o">=</span> <span class="n">Window</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;date&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">partitionBy</span><span class="p">(</span><span class="s2">&quot;country&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">rangeBetween</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="attribute">
<dt id="pyspark.sql.Window.currentRow">
<code class="descname">currentRow</code><em class="property"> = 0</em><a class="headerlink" href="#pyspark.sql.Window.currentRow" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="staticmethod">
<dt id="pyspark.sql.Window.orderBy">
<em class="property">static </em><code class="descname">orderBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#Window.orderBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Window.orderBy" title="Permalink to this definition"></a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a> with the ordering defined.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="staticmethod">
<dt id="pyspark.sql.Window.partitionBy">
<em class="property">static </em><code class="descname">partitionBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#Window.partitionBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Window.partitionBy" title="Permalink to this definition"></a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a> with the partitioning defined.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="staticmethod">
<dt id="pyspark.sql.Window.rangeBetween">
<em class="property">static </em><code class="descname">rangeBetween</code><span class="sig-paren">(</span><em>start</em>, <em>end</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#Window.rangeBetween"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Window.rangeBetween" title="Permalink to this definition"></a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a> with the frame boundaries defined,
from <cite>start</cite> (inclusive) to <cite>end</cite> (inclusive).</p>
<p>Both <cite>start</cite> and <cite>end</cite> are relative from the current row. For example,
“0” means “current row”, while “-1” means one off before the current row,
and “5” means the five off after the current row.</p>
<p>We recommend users use <code class="docutils literal"><span class="pre">Window.unboundedPreceding</span></code>, <code class="docutils literal"><span class="pre">Window.unboundedFollowing</span></code>,
and <code class="docutils literal"><span class="pre">Window.currentRow</span></code> to specify special boundary values, rather than using integral
values directly.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>start</strong> – boundary start, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">Window.unboundedPreceding</span></code>, or
any value less than or equal to max(-sys.maxsize, -9223372036854775808).</li>
<li><strong>end</strong> – boundary end, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">Window.unboundedFollowing</span></code>, or
any value greater than or equal to min(sys.maxsize, 9223372036854775807).</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="staticmethod">
<dt id="pyspark.sql.Window.rowsBetween">
<em class="property">static </em><code class="descname">rowsBetween</code><span class="sig-paren">(</span><em>start</em>, <em>end</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#Window.rowsBetween"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Window.rowsBetween" title="Permalink to this definition"></a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a> with the frame boundaries defined,
from <cite>start</cite> (inclusive) to <cite>end</cite> (inclusive).</p>
<p>Both <cite>start</cite> and <cite>end</cite> are relative positions from the current row.
For example, “0” means “current row”, while “-1” means the row before
the current row, and “5” means the fifth row after the current row.</p>
<p>We recommend users use <code class="docutils literal"><span class="pre">Window.unboundedPreceding</span></code>, <code class="docutils literal"><span class="pre">Window.unboundedFollowing</span></code>,
and <code class="docutils literal"><span class="pre">Window.currentRow</span></code> to specify special boundary values, rather than using integral
values directly.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>start</strong> – boundary start, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">Window.unboundedPreceding</span></code>, or
any value less than or equal to -9223372036854775808.</li>
<li><strong>end</strong> – boundary end, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">Window.unboundedFollowing</span></code>, or
any value greater than or equal to 9223372036854775807.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.Window.unboundedFollowing">
<code class="descname">unboundedFollowing</code><em class="property"> = 9223372036854775807</em><a class="headerlink" href="#pyspark.sql.Window.unboundedFollowing" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.Window.unboundedPreceding">
<code class="descname">unboundedPreceding</code><em class="property"> = -9223372036854775808</em><a class="headerlink" href="#pyspark.sql.Window.unboundedPreceding" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.WindowSpec">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">WindowSpec</code><span class="sig-paren">(</span><em>jspec</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec" title="Permalink to this definition"></a></dt>
<dd><p>A window specification that defines the partitioning, ordering,
and frame boundaries.</p>
<p>Use the static methods in <a class="reference internal" href="#pyspark.sql.Window" title="pyspark.sql.Window"><code class="xref py py-class docutils literal"><span class="pre">Window</span></code></a> to create a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.WindowSpec.orderBy">
<code class="descname">orderBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec.orderBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec.orderBy" title="Permalink to this definition"></a></dt>
<dd><p>Defines the ordering columns in a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – names of columns or expressions</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.WindowSpec.partitionBy">
<code class="descname">partitionBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec.partitionBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec.partitionBy" title="Permalink to this definition"></a></dt>
<dd><p>Defines the partitioning columns in a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – names of columns or expressions</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.WindowSpec.rangeBetween">
<code class="descname">rangeBetween</code><span class="sig-paren">(</span><em>start</em>, <em>end</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec.rangeBetween"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec.rangeBetween" title="Permalink to this definition"></a></dt>
<dd><p>Defines the frame boundaries, from <cite>start</cite> (inclusive) to <cite>end</cite> (inclusive).</p>
<p>Both <cite>start</cite> and <cite>end</cite> are relative from the current row. For example,
“0” means “current row”, while “-1” means one off before the current row,
and “5” means the five off after the current row.</p>
<p>We recommend users use <code class="docutils literal"><span class="pre">Window.unboundedPreceding</span></code>, <code class="docutils literal"><span class="pre">Window.unboundedFollowing</span></code>,
and <code class="docutils literal"><span class="pre">Window.currentRow</span></code> to specify special boundary values, rather than using integral
values directly.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>start</strong> – boundary start, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">Window.unboundedPreceding</span></code>, or
any value less than or equal to max(-sys.maxsize, -9223372036854775808).</li>
<li><strong>end</strong> – boundary end, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">Window.unboundedFollowing</span></code>, or
any value greater than or equal to min(sys.maxsize, 9223372036854775807).</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.WindowSpec.rowsBetween">
<code class="descname">rowsBetween</code><span class="sig-paren">(</span><em>start</em>, <em>end</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec.rowsBetween"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec.rowsBetween" title="Permalink to this definition"></a></dt>
<dd><p>Defines the frame boundaries, from <cite>start</cite> (inclusive) to <cite>end</cite> (inclusive).</p>
<p>Both <cite>start</cite> and <cite>end</cite> are relative positions from the current row.
For example, “0” means “current row”, while “-1” means the row before
the current row, and “5” means the fifth row after the current row.</p>
<p>We recommend users use <code class="docutils literal"><span class="pre">Window.unboundedPreceding</span></code>, <code class="docutils literal"><span class="pre">Window.unboundedFollowing</span></code>,
and <code class="docutils literal"><span class="pre">Window.currentRow</span></code> to specify special boundary values, rather than using integral
values directly.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>start</strong> – boundary start, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">Window.unboundedPreceding</span></code>, or
any value less than or equal to max(-sys.maxsize, -9223372036854775808).</li>
<li><strong>end</strong> – boundary end, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">Window.unboundedFollowing</span></code>, or
any value greater than or equal to min(sys.maxsize, 9223372036854775807).</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrameReader">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrameReader</code><span class="sig-paren">(</span><em>spark</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader" title="Permalink to this definition"></a></dt>
<dd><p>Interface used to load a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> from external storage systems
(e.g. file systems, key-value stores, etc). Use <code class="xref py py-func docutils literal"><span class="pre">spark.read()</span></code>
to access this.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.csv">
<code class="descname">csv</code><span class="sig-paren">(</span><em>path</em>, <em>schema=None</em>, <em>sep=None</em>, <em>encoding=None</em>, <em>quote=None</em>, <em>escape=None</em>, <em>comment=None</em>, <em>header=None</em>, <em>inferSchema=None</em>, <em>ignoreLeadingWhiteSpace=None</em>, <em>ignoreTrailingWhiteSpace=None</em>, <em>nullValue=None</em>, <em>nanValue=None</em>, <em>positiveInf=None</em>, <em>negativeInf=None</em>, <em>dateFormat=None</em>, <em>timestampFormat=None</em>, <em>maxColumns=None</em>, <em>maxCharsPerColumn=None</em>, <em>maxMalformedLogPerPartition=None</em>, <em>mode=None</em>, <em>columnNameOfCorruptRecord=None</em>, <em>multiLine=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.csv"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.csv" title="Permalink to this definition"></a></dt>
<dd><p>Loads a CSV file and returns the result as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>This function will go through the input once to determine the input schema if
<code class="docutils literal"><span class="pre">inferSchema</span></code> is enabled. To avoid going through the entire data once, disable
<code class="docutils literal"><span class="pre">inferSchema</span></code> option or specify the schema explicitly using <code class="docutils literal"><span class="pre">schema</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – string, or list of strings, for input path(s).</li>
<li><strong>schema</strong> – an optional <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> for the input schema.</li>
<li><strong>sep</strong> – sets the single character as a separator for each field and value.
If None is set, it uses the default value, <code class="docutils literal"><span class="pre">,</span></code>.</li>
<li><strong>encoding</strong> – decodes the CSV files by the given encoding type. If None is set,
it uses the default value, <code class="docutils literal"><span class="pre">UTF-8</span></code>.</li>
<li><strong>quote</strong> – sets the single character used for escaping quoted values where the
separator can be part of the value. If None is set, it uses the default
value, <code class="docutils literal"><span class="pre">&quot;</span></code>. If you would like to turn off quotations, you need to set an
empty string.</li>
<li><strong>escape</strong> – sets the single character used for escaping quotes inside an already
quoted value. If None is set, it uses the default value, <code class="docutils literal"><span class="pre">\</span></code>.</li>
<li><strong>comment</strong> – sets the single character used for skipping lines beginning with this
character. By default (None), it is disabled.</li>
<li><strong>header</strong> – uses the first line as names of columns. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>inferSchema</strong> – infers the input schema automatically from data. It requires one extra
pass over the data. If None is set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>ignoreLeadingWhiteSpace</strong> – A flag indicating whether or not leading whitespaces from
values being read should be skipped. If None is set, it
uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>ignoreTrailingWhiteSpace</strong> – A flag indicating whether or not trailing whitespaces from
values being read should be skipped. If None is set, it
uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>nullValue</strong> – sets the string representation of a null value. If None is set, it uses
the default value, empty string. Since 2.0.1, this <code class="docutils literal"><span class="pre">nullValue</span></code> param
applies to all supported types including the string type.</li>
<li><strong>nanValue</strong> – sets the string representation of a non-number value. If None is set, it
uses the default value, <code class="docutils literal"><span class="pre">NaN</span></code>.</li>
<li><strong>positiveInf</strong> – sets the string representation of a positive infinity value. If None
is set, it uses the default value, <code class="docutils literal"><span class="pre">Inf</span></code>.</li>
<li><strong>negativeInf</strong> – sets the string representation of a negative infinity value. If None
is set, it uses the default value, <code class="docutils literal"><span class="pre">Inf</span></code>.</li>
<li><strong>dateFormat</strong> – sets the string that indicates a date format. Custom date formats
follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>. This
applies to date type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd</span></code>.</li>
<li><strong>timestampFormat</strong> – sets the string that indicates a timestamp format. Custom date
formats follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>.
This applies to timestamp type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd'T'HH:mm:ss.SSSXXX</span></code>.</li>
<li><strong>maxColumns</strong> – defines a hard limit of how many columns a record can have. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">20480</span></code>.</li>
<li><strong>maxCharsPerColumn</strong> – defines the maximum number of characters allowed for any given
value being read. If None is set, it uses the default value,
<code class="docutils literal"><span class="pre">-1</span></code> meaning unlimited length.</li>
<li><strong>maxMalformedLogPerPartition</strong> – this parameter is no longer used since Spark 2.2.0.
If specified, it is ignored.</li>
<li><strong>mode</strong><dl class="docutils">
<dt>allows a mode for dealing with corrupt records during parsing. If None is</dt>
<dd>set, it uses the default value, <code class="docutils literal"><span class="pre">PERMISSIVE</span></code>.</dd>
</dl>
<ul>
<li><code class="docutils literal"><span class="pre">PERMISSIVE</span></code> : sets other fields to <code class="docutils literal"><span class="pre">null</span></code> when it meets a corrupted record, and puts the malformed string into a field configured by <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code>. To keep corrupt records, an user can set a string type field named <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets <cite>null</cite> for extra fields.</li>
<li><code class="docutils literal"><span class="pre">DROPMALFORMED</span></code> : ignores the whole corrupted records.</li>
<li><code class="docutils literal"><span class="pre">FAILFAST</span></code> : throws an exception when it meets corrupted records.</li>
</ul>
</li>
<li><strong>columnNameOfCorruptRecord</strong> – allows renaming the new field having malformed string
created by <code class="docutils literal"><span class="pre">PERMISSIVE</span></code> mode. This overrides
<code class="docutils literal"><span class="pre">spark.sql.columnNameOfCorruptRecord</span></code>. If None is set,
it uses the value specified in
<code class="docutils literal"><span class="pre">spark.sql.columnNameOfCorruptRecord</span></code>.</li>
<li><strong>multiLine</strong> – parse records, which may span multiple lines. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">csv</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/ages.csv&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;_c0&#39;, &#39;string&#39;), (&#39;_c1&#39;, &#39;string&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.format">
<code class="descname">format</code><span class="sig-paren">(</span><em>source</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.format"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.format" title="Permalink to this definition"></a></dt>
<dd><p>Specifies the input data source format.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>source</strong> – string, name of the data source, e.g. ‘json’, ‘parquet’.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;json&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/people.json&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;age&#39;, &#39;bigint&#39;), (&#39;name&#39;, &#39;string&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.jdbc">
<code class="descname">jdbc</code><span class="sig-paren">(</span><em>url</em>, <em>table</em>, <em>column=None</em>, <em>lowerBound=None</em>, <em>upperBound=None</em>, <em>numPartitions=None</em>, <em>predicates=None</em>, <em>properties=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.jdbc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.jdbc" title="Permalink to this definition"></a></dt>
<dd><p>Construct a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> representing the database table named <code class="docutils literal"><span class="pre">table</span></code>
accessible via JDBC URL <code class="docutils literal"><span class="pre">url</span></code> and connection <code class="docutils literal"><span class="pre">properties</span></code>.</p>
<p>Partitions of the table will be retrieved in parallel if either <code class="docutils literal"><span class="pre">column</span></code> or
<code class="docutils literal"><span class="pre">predicates</span></code> is specified. <code class="docutils literal"><span class="pre">lowerBound`,</span> <span class="pre">``upperBound</span></code> and <code class="docutils literal"><span class="pre">numPartitions</span></code>
is needed when <code class="docutils literal"><span class="pre">column</span></code> is specified.</p>
<p>If both <code class="docutils literal"><span class="pre">column</span></code> and <code class="docutils literal"><span class="pre">predicates</span></code> are specified, <code class="docutils literal"><span class="pre">column</span></code> will be used.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Don’t create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>url</strong> – a JDBC URL of the form <code class="docutils literal"><span class="pre">jdbc:subprotocol:subname</span></code></li>
<li><strong>table</strong> – the name of the table</li>
<li><strong>column</strong> – the name of an integer column that will be used for partitioning;
if this parameter is specified, then <code class="docutils literal"><span class="pre">numPartitions</span></code>, <code class="docutils literal"><span class="pre">lowerBound</span></code>
(inclusive), and <code class="docutils literal"><span class="pre">upperBound</span></code> (exclusive) will form partition strides
for generated WHERE clause expressions used to split the column
<code class="docutils literal"><span class="pre">column</span></code> evenly</li>
<li><strong>lowerBound</strong> – the minimum value of <code class="docutils literal"><span class="pre">column</span></code> used to decide partition stride</li>
<li><strong>upperBound</strong> – the maximum value of <code class="docutils literal"><span class="pre">column</span></code> used to decide partition stride</li>
<li><strong>numPartitions</strong> – the number of partitions</li>
<li><strong>predicates</strong> – a list of expressions suitable for inclusion in WHERE clauses;
each one defines one partition of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></li>
<li><strong>properties</strong> – a dictionary of JDBC database connection arguments. Normally at
least properties “user” and “password” with their corresponding values.
For example { ‘user’ : ‘SYSTEM’, ‘password’ : ‘mypassword’ }</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">a DataFrame</p>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.json">
<code class="descname">json</code><span class="sig-paren">(</span><em>path</em>, <em>schema=None</em>, <em>primitivesAsString=None</em>, <em>prefersDecimal=None</em>, <em>allowComments=None</em>, <em>allowUnquotedFieldNames=None</em>, <em>allowSingleQuotes=None</em>, <em>allowNumericLeadingZero=None</em>, <em>allowBackslashEscapingAnyCharacter=None</em>, <em>mode=None</em>, <em>columnNameOfCorruptRecord=None</em>, <em>dateFormat=None</em>, <em>timestampFormat=None</em>, <em>multiLine=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.json" title="Permalink to this definition"></a></dt>
<dd><p>Loads JSON files and returns the results as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p><a class="reference external" href="http://jsonlines.org/">JSON Lines</a> (newline-delimited JSON) is supported by default.
For JSON (one record per file), set the <code class="docutils literal"><span class="pre">multiLine</span></code> parameter to <code class="docutils literal"><span class="pre">true</span></code>.</p>
<p>If the <code class="docutils literal"><span class="pre">schema</span></code> parameter is not specified, this function goes
through the input once to determine the input schema.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – string represents path to the JSON dataset, or a list of paths,
or RDD of Strings storing JSON objects.</li>
<li><strong>schema</strong> – an optional <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> for the input schema.</li>
<li><strong>primitivesAsString</strong> – infers all primitive values as a string type. If None is set,
it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>prefersDecimal</strong> – infers all floating-point values as a decimal type. If the values
do not fit in decimal, then it infers them as doubles. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>allowComments</strong> – ignores Java/C++ style comment in JSON records. If None is set,
it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>allowUnquotedFieldNames</strong> – allows unquoted JSON field names. If None is set,
it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>allowSingleQuotes</strong> – allows single quotes in addition to double quotes. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">true</span></code>.</li>
<li><strong>allowNumericLeadingZero</strong> – allows leading zeros in numbers (e.g. 00012). If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>allowBackslashEscapingAnyCharacter</strong> – allows accepting quoting of all character
using backslash quoting mechanism. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>mode</strong><dl class="docutils">
<dt>allows a mode for dealing with corrupt records during parsing. If None is</dt>
<dd>set, it uses the default value, <code class="docutils literal"><span class="pre">PERMISSIVE</span></code>.</dd>
</dl>
<ul>
<li><code class="docutils literal"><span class="pre">PERMISSIVE</span></code> : sets other fields to <code class="docutils literal"><span class="pre">null</span></code> when it meets a corrupted record, and puts the malformed string into a field configured by <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code>. To keep corrupt records, an user can set a string type field named <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code> field in an output schema.</li>
<li><code class="docutils literal"><span class="pre">DROPMALFORMED</span></code> : ignores the whole corrupted records.</li>
<li><code class="docutils literal"><span class="pre">FAILFAST</span></code> : throws an exception when it meets corrupted records.</li>
</ul>
</li>
<li><strong>columnNameOfCorruptRecord</strong> – allows renaming the new field having malformed string
created by <code class="docutils literal"><span class="pre">PERMISSIVE</span></code> mode. This overrides
<code class="docutils literal"><span class="pre">spark.sql.columnNameOfCorruptRecord</span></code>. If None is set,
it uses the value specified in
<code class="docutils literal"><span class="pre">spark.sql.columnNameOfCorruptRecord</span></code>.</li>
<li><strong>dateFormat</strong> – sets the string that indicates a date format. Custom date formats
follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>. This
applies to date type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd</span></code>.</li>
<li><strong>timestampFormat</strong> – sets the string that indicates a timestamp format. Custom date
formats follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>.
This applies to timestamp type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd'T'HH:mm:ss.SSSXXX</span></code>.</li>
<li><strong>multiLine</strong> – parse one record, which may span multiple lines, per file. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df1</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/people.json&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df1</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;age&#39;, &#39;bigint&#39;), (&#39;name&#39;, &#39;string&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">rdd</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/people.json&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="n">rdd</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;age&#39;, &#39;bigint&#39;), (&#39;name&#39;, &#39;string&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.load">
<code class="descname">load</code><span class="sig-paren">(</span><em>path=None</em>, <em>format=None</em>, <em>schema=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.load"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.load" title="Permalink to this definition"></a></dt>
<dd><p>Loads data from a data source and returns it as a :class`DataFrame`.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – optional string or a list of string for file-system backed data sources.</li>
<li><strong>format</strong> – optional string for format of the data source. Default to ‘parquet’.</li>
<li><strong>schema</strong> – optional <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> for the input schema.</li>
<li><strong>options</strong> – all other string options</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/parquet_partitioned&#39;</span><span class="p">,</span> <span class="n">opt1</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">opt2</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">opt3</span><span class="o">=</span><span class="s1">&#39;str&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;name&#39;, &#39;string&#39;), (&#39;year&#39;, &#39;int&#39;), (&#39;month&#39;, &#39;int&#39;), (&#39;day&#39;, &#39;int&#39;)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;json&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">([</span><span class="s1">&#39;python/test_support/sql/people.json&#39;</span><span class="p">,</span>
<span class="gp">... </span> <span class="s1">&#39;python/test_support/sql/people1.json&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;age&#39;, &#39;bigint&#39;), (&#39;aka&#39;, &#39;string&#39;), (&#39;name&#39;, &#39;string&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.option">
<code class="descname">option</code><span class="sig-paren">(</span><em>key</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.option"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.option" title="Permalink to this definition"></a></dt>
<dd><p>Adds an input option for the underlying data source.</p>
<dl class="docutils">
<dt>You can set the following option(s) for reading files:</dt>
<dd><ul class="first last simple">
<li><dl class="first docutils">
<dt><code class="docutils literal"><span class="pre">timeZone</span></code>: sets the string that indicates a timezone to be used to parse timestamps</dt>
<dd>in the JSON/CSV datasources or partition values.
If it isn’t set, it uses the default value, session local timezone.</dd>
</dl>
</li>
</ul>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.options">
<code class="descname">options</code><span class="sig-paren">(</span><em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.options"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.options" title="Permalink to this definition"></a></dt>
<dd><p>Adds input options for the underlying data source.</p>
<dl class="docutils">
<dt>You can set the following option(s) for reading files:</dt>
<dd><ul class="first last simple">
<li><dl class="first docutils">
<dt><code class="docutils literal"><span class="pre">timeZone</span></code>: sets the string that indicates a timezone to be used to parse timestamps</dt>
<dd>in the JSON/CSV datasources or partition values.
If it isn’t set, it uses the default value, session local timezone.</dd>
</dl>
</li>
</ul>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.orc">
<code class="descname">orc</code><span class="sig-paren">(</span><em>path</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.orc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.orc" title="Permalink to this definition"></a></dt>
<dd><p>Loads ORC files, returning the result as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Currently ORC support is only available together with Hive support.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">orc</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/orc_partitioned&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;a&#39;, &#39;bigint&#39;), (&#39;b&#39;, &#39;int&#39;), (&#39;c&#39;, &#39;int&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.parquet">
<code class="descname">parquet</code><span class="sig-paren">(</span><em>*paths</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.parquet"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.parquet" title="Permalink to this definition"></a></dt>
<dd><p>Loads Parquet files, returning the result as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<dl class="docutils">
<dt>You can set the following Parquet-specific option(s) for reading Parquet files:</dt>
<dd><ul class="first last simple">
<li><code class="docutils literal"><span class="pre">mergeSchema</span></code>: sets whether we should merge schemas collected from all Parquet part-files. This will override <code class="docutils literal"><span class="pre">spark.sql.parquet.mergeSchema</span></code>. The default value is specified in <code class="docutils literal"><span class="pre">spark.sql.parquet.mergeSchema</span></code>.</li>
</ul>
</dd>
</dl>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/parquet_partitioned&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;name&#39;, &#39;string&#39;), (&#39;year&#39;, &#39;int&#39;), (&#39;month&#39;, &#39;int&#39;), (&#39;day&#39;, &#39;int&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.schema">
<code class="descname">schema</code><span class="sig-paren">(</span><em>schema</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.schema"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.schema" title="Permalink to this definition"></a></dt>
<dd><p>Specifies the input schema.</p>
<p>Some data sources (e.g. JSON) can infer the input schema automatically from data.
By specifying the schema here, the underlying data source can skip the schema
inference step, and thus speed up data loading.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>schema</strong> – a <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> object</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.table">
<code class="descname">table</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.table"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.table" title="Permalink to this definition"></a></dt>
<dd><p>Returns the specified table as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>tableName</strong> – string, name of the table.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/parquet_partitioned&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s1">&#39;tmpTable&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s1">&#39;tmpTable&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[(&#39;name&#39;, &#39;string&#39;), (&#39;year&#39;, &#39;int&#39;), (&#39;month&#39;, &#39;int&#39;), (&#39;day&#39;, &#39;int&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.text">
<code class="descname">text</code><span class="sig-paren">(</span><em>paths</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.text"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.text" title="Permalink to this definition"></a></dt>
<dd><p>Loads text files and returns a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> whose schema starts with a
string column named “value”, and followed by partitioned columns if there
are any.</p>
<p>Each line in the text file is a new row in the resulting DataFrame.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>paths</strong> – string, or list of strings, for input path(s).</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/text-test.txt&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(value=&#39;hello&#39;), Row(value=&#39;this&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrameWriter">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrameWriter</code><span class="sig-paren">(</span><em>df</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter" title="Permalink to this definition"></a></dt>
<dd><p>Interface used to write a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to external storage systems
(e.g. file systems, key-value stores, etc). Use <a class="reference internal" href="#pyspark.sql.DataFrame.write" title="pyspark.sql.DataFrame.write"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.write()</span></code></a>
to access this.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.csv">
<code class="descname">csv</code><span class="sig-paren">(</span><em>path</em>, <em>mode=None</em>, <em>compression=None</em>, <em>sep=None</em>, <em>quote=None</em>, <em>escape=None</em>, <em>header=None</em>, <em>nullValue=None</em>, <em>escapeQuotes=None</em>, <em>quoteAll=None</em>, <em>dateFormat=None</em>, <em>timestampFormat=None</em>, <em>ignoreLeadingWhiteSpace=None</em>, <em>ignoreTrailingWhiteSpace=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.csv"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.csv" title="Permalink to this definition"></a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> in CSV format at the specified path.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in any Hadoop supported file system</li>
<li><strong>mode</strong><p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>compression</strong> – compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).</li>
<li><strong>sep</strong> – sets the single character as a separator for each field and value. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">,</span></code>.</li>
<li><strong>quote</strong> – sets the single character used for escaping quoted values where the
separator can be part of the value. If None is set, it uses the default
value, <code class="docutils literal"><span class="pre">&quot;</span></code>. If you would like to turn off quotations, you need to set an
empty string.</li>
<li><strong>escape</strong> – sets the single character used for escaping quotes inside an already
quoted value. If None is set, it uses the default value, <code class="docutils literal"><span class="pre">\</span></code></li>
<li><strong>escapeQuotes</strong> – a flag indicating whether values containing quotes should always
be enclosed in quotes. If None is set, it uses the default value
<code class="docutils literal"><span class="pre">true</span></code>, escaping all values containing a quote character.</li>
<li><strong>quoteAll</strong> – a flag indicating whether all values should always be enclosed in
quotes. If None is set, it uses the default value <code class="docutils literal"><span class="pre">false</span></code>,
only escaping values containing a quote character.</li>
<li><strong>header</strong> – writes the names of columns as the first line. If None is set, it uses
the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>nullValue</strong> – sets the string representation of a null value. If None is set, it uses
the default value, empty string.</li>
<li><strong>dateFormat</strong> – sets the string that indicates a date format. Custom date formats
follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>. This
applies to date type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd</span></code>.</li>
<li><strong>timestampFormat</strong> – sets the string that indicates a timestamp format. Custom date
formats follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>.
This applies to timestamp type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd'T'HH:mm:ss.SSSXXX</span></code>.</li>
<li><strong>ignoreLeadingWhiteSpace</strong> – a flag indicating whether or not leading whitespaces from
values being written should be skipped. If None is set, it
uses the default value, <code class="docutils literal"><span class="pre">true</span></code>.</li>
<li><strong>ignoreTrailingWhiteSpace</strong> – a flag indicating whether or not trailing whitespaces from
values being written should be skipped. If None is set, it
uses the default value, <code class="docutils literal"><span class="pre">true</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">csv</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s1">&#39;data&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.format">
<code class="descname">format</code><span class="sig-paren">(</span><em>source</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.format"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.format" title="Permalink to this definition"></a></dt>
<dd><p>Specifies the underlying output data source.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>source</strong> – string, name of the data source, e.g. ‘json’, ‘parquet’.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;json&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s1">&#39;data&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.insertInto">
<code class="descname">insertInto</code><span class="sig-paren">(</span><em>tableName</em>, <em>overwrite=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.insertInto"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.insertInto" title="Permalink to this definition"></a></dt>
<dd><p>Inserts the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to the specified table.</p>
<p>It requires that the schema of the class:<cite>DataFrame</cite> is the same as the
schema of the table.</p>
<p>Optionally overwriting any existing data.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.jdbc">
<code class="descname">jdbc</code><span class="sig-paren">(</span><em>url</em>, <em>table</em>, <em>mode=None</em>, <em>properties=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.jdbc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.jdbc" title="Permalink to this definition"></a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to an external database table via JDBC.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Don’t create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>url</strong> – a JDBC URL of the form <code class="docutils literal"><span class="pre">jdbc:subprotocol:subname</span></code></li>
<li><strong>table</strong> – Name of the table in the external database.</li>
<li><strong>mode</strong><p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>properties</strong> – a dictionary of JDBC database connection arguments. Normally at
least properties “user” and “password” with their corresponding values.
For example { ‘user’ : ‘SYSTEM’, ‘password’ : ‘mypassword’ }</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.json">
<code class="descname">json</code><span class="sig-paren">(</span><em>path</em>, <em>mode=None</em>, <em>compression=None</em>, <em>dateFormat=None</em>, <em>timestampFormat=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.json" title="Permalink to this definition"></a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> in JSON format
(<a class="reference external" href="http://jsonlines.org/">JSON Lines text format or newline-delimited JSON</a>) at the
specified path.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in any Hadoop supported file system</li>
<li><strong>mode</strong><p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>compression</strong> – compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).</li>
<li><strong>dateFormat</strong> – sets the string that indicates a date format. Custom date formats
follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>. This
applies to date type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd</span></code>.</li>
<li><strong>timestampFormat</strong> – sets the string that indicates a timestamp format. Custom date
formats follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>.
This applies to timestamp type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd'T'HH:mm:ss.SSSXXX</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s1">&#39;data&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.mode">
<code class="descname">mode</code><span class="sig-paren">(</span><em>saveMode</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.mode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.mode" title="Permalink to this definition"></a></dt>
<dd><p>Specifies the behavior when data or table already exists.</p>
<p>Options include:</p>
<ul class="simple">
<li><cite>append</cite>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><cite>overwrite</cite>: Overwrite existing data.</li>
<li><cite>error</cite>: Throw an exception if data already exists.</li>
<li><cite>ignore</cite>: Silently ignore this operation if data already exists.</li>
</ul>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">mode</span><span class="p">(</span><span class="s1">&#39;append&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s1">&#39;data&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.option">
<code class="descname">option</code><span class="sig-paren">(</span><em>key</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.option"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.option" title="Permalink to this definition"></a></dt>
<dd><p>Adds an output option for the underlying data source.</p>
<dl class="docutils">
<dt>You can set the following option(s) for writing files:</dt>
<dd><ul class="first last simple">
<li><dl class="first docutils">
<dt><code class="docutils literal"><span class="pre">timeZone</span></code>: sets the string that indicates a timezone to be used to format</dt>
<dd>timestamps in the JSON/CSV datasources or partition values.
If it isn’t set, it uses the default value, session local timezone.</dd>
</dl>
</li>
</ul>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.options">
<code class="descname">options</code><span class="sig-paren">(</span><em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.options"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.options" title="Permalink to this definition"></a></dt>
<dd><p>Adds output options for the underlying data source.</p>
<dl class="docutils">
<dt>You can set the following option(s) for writing files:</dt>
<dd><ul class="first last simple">
<li><dl class="first docutils">
<dt><code class="docutils literal"><span class="pre">timeZone</span></code>: sets the string that indicates a timezone to be used to format</dt>
<dd>timestamps in the JSON/CSV datasources or partition values.
If it isn’t set, it uses the default value, session local timezone.</dd>
</dl>
</li>
</ul>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.orc">
<code class="descname">orc</code><span class="sig-paren">(</span><em>path</em>, <em>mode=None</em>, <em>partitionBy=None</em>, <em>compression=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.orc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.orc" title="Permalink to this definition"></a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> in ORC format at the specified path.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Currently ORC support is only available together with Hive support.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in any Hadoop supported file system</li>
<li><strong>mode</strong><p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
<li><strong>compression</strong> – compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, snappy, zlib, and lzo).
This will override <code class="docutils literal"><span class="pre">orc.compress</span></code>. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">snappy</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">orc_df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">orc</span><span class="p">(</span><span class="s1">&#39;python/test_support/sql/orc_partitioned&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">orc_df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">orc</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s1">&#39;data&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.parquet">
<code class="descname">parquet</code><span class="sig-paren">(</span><em>path</em>, <em>mode=None</em>, <em>partitionBy=None</em>, <em>compression=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.parquet"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.parquet" title="Permalink to this definition"></a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> in Parquet format at the specified path.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in any Hadoop supported file system</li>
<li><strong>mode</strong><p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
<li><strong>compression</strong> – compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, snappy, gzip, and lzo).
This will override <code class="docutils literal"><span class="pre">spark.sql.parquet.compression.codec</span></code>. If None
is set, it uses the value specified in
<code class="docutils literal"><span class="pre">spark.sql.parquet.compression.codec</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s1">&#39;data&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.partitionBy">
<code class="descname">partitionBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.partitionBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.partitionBy" title="Permalink to this definition"></a></dt>
<dd><p>Partitions the output by the given columns on the file system.</p>
<p>If specified, the output is laid out on the file system similar
to Hive’s partitioning scheme.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – name of columns</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">partitionBy</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">,</span> <span class="s1">&#39;month&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s1">&#39;data&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.save">
<code class="descname">save</code><span class="sig-paren">(</span><em>path=None</em>, <em>format=None</em>, <em>mode=None</em>, <em>partitionBy=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.save"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.save" title="Permalink to this definition"></a></dt>
<dd><p>Saves the contents of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to a data source.</p>
<p>The data source is specified by the <code class="docutils literal"><span class="pre">format</span></code> and a set of <code class="docutils literal"><span class="pre">options</span></code>.
If <code class="docutils literal"><span class="pre">format</span></code> is not specified, the default data source configured by
<code class="docutils literal"><span class="pre">spark.sql.sources.default</span></code> will be used.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in a Hadoop supported file system</li>
<li><strong>format</strong> – the format used to save</li>
<li><strong>mode</strong><p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
<li><strong>options</strong> – all other string options</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">mode</span><span class="p">(</span><span class="s1">&#39;append&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s1">&#39;data&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.saveAsTable">
<code class="descname">saveAsTable</code><span class="sig-paren">(</span><em>name</em>, <em>format=None</em>, <em>mode=None</em>, <em>partitionBy=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.saveAsTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.saveAsTable" title="Permalink to this definition"></a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as the specified table.</p>
<p>In the case the table already exists, behavior of this function depends on the
save mode, specified by the <cite>mode</cite> function (default to throwing an exception).
When <cite>mode</cite> is <cite>Overwrite</cite>, the schema of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> does not need to be
the same as that of the existing table.</p>
<ul class="simple">
<li><cite>append</cite>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><cite>overwrite</cite>: Overwrite existing data.</li>
<li><cite>error</cite>: Throw an exception if data already exists.</li>
<li><cite>ignore</cite>: Silently ignore this operation if data already exists.</li>
</ul>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>name</strong> – the table name</li>
<li><strong>format</strong> – the format used to save</li>
<li><strong>mode</strong> – one of <cite>append</cite>, <cite>overwrite</cite>, <cite>error</cite>, <cite>ignore</cite> (default: error)</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
<li><strong>options</strong> – all other string options</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.text">
<code class="descname">text</code><span class="sig-paren">(</span><em>path</em>, <em>compression=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.text"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.text" title="Permalink to this definition"></a></dt>
<dd><p>Saves the content of the DataFrame in a text file at the specified path.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in any Hadoop supported file system</li>
<li><strong>compression</strong> – compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).</li>
</ul>
</td>
</tr>
</tbody>
</table>
<p>The DataFrame must have only one column that is of string type.
Each row becomes a new line in the output file.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
</dd></dl>
</div>
<div class="section" id="module-pyspark.sql.types">
<span id="pyspark-sql-types-module"></span><h2>pyspark.sql.types module<a class="headerlink" href="#module-pyspark.sql.types" title="Permalink to this headline"></a></h2>
<dl class="class">
<dt id="pyspark.sql.types.DataType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">DataType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType" title="Permalink to this definition"></a></dt>
<dd><p>Base class for data types.</p>
<dl class="method">
<dt id="pyspark.sql.types.DataType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.fromInternal" title="Permalink to this definition"></a></dt>
<dd><p>Converts an internal SQL object into a native Python object.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.json">
<code class="descname">json</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.json" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.jsonValue" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.needConversion" title="Permalink to this definition"></a></dt>
<dd><p>Does this type need to conversion between Python object and internal SQL object.</p>
<p>This is used to avoid the unnecessary conversion for ArrayType/MapType/StructType.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.toInternal" title="Permalink to this definition"></a></dt>
<dd><p>Converts a Python object into an internal SQL object.</p>
</dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.DataType.typeName">
<em class="property">classmethod </em><code class="descname">typeName</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.typeName"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.typeName" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.NullType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">NullType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#NullType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.NullType" title="Permalink to this definition"></a></dt>
<dd><p>Null type.</p>
<p>The data type representing None, used for the types that cannot be inferred.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.StringType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">StringType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#StringType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StringType" title="Permalink to this definition"></a></dt>
<dd><p>String data type.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.BinaryType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">BinaryType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#BinaryType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.BinaryType" title="Permalink to this definition"></a></dt>
<dd><p>Binary (byte array) data type.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.BooleanType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">BooleanType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#BooleanType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.BooleanType" title="Permalink to this definition"></a></dt>
<dd><p>Boolean data type.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.DateType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">DateType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#DateType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DateType" title="Permalink to this definition"></a></dt>
<dd><p>Date (datetime.date) data type.</p>
<dl class="attribute">
<dt id="pyspark.sql.types.DateType.EPOCH_ORDINAL">
<code class="descname">EPOCH_ORDINAL</code><em class="property"> = 719163</em><a class="headerlink" href="#pyspark.sql.types.DateType.EPOCH_ORDINAL" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DateType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>v</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DateType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DateType.fromInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DateType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DateType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DateType.needConversion" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DateType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>d</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DateType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DateType.toInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.TimestampType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">TimestampType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#TimestampType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.TimestampType" title="Permalink to this definition"></a></dt>
<dd><p>Timestamp (datetime.datetime) data type.</p>
<dl class="method">
<dt id="pyspark.sql.types.TimestampType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>ts</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#TimestampType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.TimestampType.fromInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.TimestampType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#TimestampType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.TimestampType.needConversion" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.TimestampType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>dt</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#TimestampType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.TimestampType.toInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.DecimalType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">DecimalType</code><span class="sig-paren">(</span><em>precision=10</em>, <em>scale=0</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DecimalType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DecimalType" title="Permalink to this definition"></a></dt>
<dd><p>Decimal (decimal.Decimal) data type.</p>
<p>The DecimalType must have fixed precision (the maximum total number of digits)
and scale (the number of digits on the right of dot). For example, (5, 2) can
support the value from [-999.99 to 999.99].</p>
<p>The precision can be up to 38, the scale must less or equal to precision.</p>
<p>When create a DecimalType, the default precision and scale is (10, 0). When infer
schema from decimal.Decimal objects, it will be DecimalType(38, 18).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>precision</strong> – the maximum total number of digits (default: 10)</li>
<li><strong>scale</strong> – the number of digits on right side of dot. (default: 0)</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.types.DecimalType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DecimalType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DecimalType.jsonValue" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DecimalType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DecimalType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DecimalType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.DoubleType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">DoubleType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#DoubleType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DoubleType" title="Permalink to this definition"></a></dt>
<dd><p>Double data type, representing double precision floats.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.FloatType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">FloatType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#FloatType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.FloatType" title="Permalink to this definition"></a></dt>
<dd><p>Float data type, representing single precision floats.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.ByteType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">ByteType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#ByteType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ByteType" title="Permalink to this definition"></a></dt>
<dd><p>Byte data type, i.e. a signed integer in a single byte.</p>
<dl class="method">
<dt id="pyspark.sql.types.ByteType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ByteType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ByteType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.IntegerType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">IntegerType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#IntegerType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.IntegerType" title="Permalink to this definition"></a></dt>
<dd><p>Int data type, i.e. a signed 32-bit integer.</p>
<dl class="method">
<dt id="pyspark.sql.types.IntegerType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#IntegerType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.IntegerType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.LongType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">LongType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#LongType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.LongType" title="Permalink to this definition"></a></dt>
<dd><p>Long data type, i.e. a signed 64-bit integer.</p>
<p>If the values are beyond the range of [-9223372036854775808, 9223372036854775807],
please use <a class="reference internal" href="#pyspark.sql.types.DecimalType" title="pyspark.sql.types.DecimalType"><code class="xref py py-class docutils literal"><span class="pre">DecimalType</span></code></a>.</p>
<dl class="method">
<dt id="pyspark.sql.types.LongType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#LongType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.LongType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.ShortType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">ShortType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#ShortType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ShortType" title="Permalink to this definition"></a></dt>
<dd><p>Short data type, i.e. a signed 16-bit integer.</p>
<dl class="method">
<dt id="pyspark.sql.types.ShortType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ShortType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ShortType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.ArrayType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">ArrayType</code><span class="sig-paren">(</span><em>elementType</em>, <em>containsNull=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType" title="Permalink to this definition"></a></dt>
<dd><p>Array data type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>elementType</strong><a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">DataType</span></code></a> of each element in the array.</li>
<li><strong>containsNull</strong> – boolean, whether the array can contain null (None) values.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.fromInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.ArrayType.fromJson">
<em class="property">classmethod </em><code class="descname">fromJson</code><span class="sig-paren">(</span><em>json</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.fromJson"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.fromJson" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.jsonValue" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.needConversion" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.toInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.MapType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">MapType</code><span class="sig-paren">(</span><em>keyType</em>, <em>valueType</em>, <em>valueContainsNull=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType" title="Permalink to this definition"></a></dt>
<dd><p>Map data type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>keyType</strong><a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">DataType</span></code></a> of the keys in the map.</li>
<li><strong>valueType</strong><a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">DataType</span></code></a> of the values in the map.</li>
<li><strong>valueContainsNull</strong> – indicates whether values can contain null (None) values.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<p>Keys in a map data type are not allowed to be null (None).</p>
<dl class="method">
<dt id="pyspark.sql.types.MapType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.fromInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.MapType.fromJson">
<em class="property">classmethod </em><code class="descname">fromJson</code><span class="sig-paren">(</span><em>json</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.fromJson"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.fromJson" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.MapType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.jsonValue" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.MapType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.needConversion" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.MapType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.MapType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.toInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.StructField">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">StructField</code><span class="sig-paren">(</span><em>name</em>, <em>dataType</em>, <em>nullable=True</em>, <em>metadata=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField" title="Permalink to this definition"></a></dt>
<dd><p>A field in <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">StructType</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>name</strong> – string, name of the field.</li>
<li><strong>dataType</strong><a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">DataType</span></code></a> of the field.</li>
<li><strong>nullable</strong> – boolean, whether the field can be null (None) or not.</li>
<li><strong>metadata</strong> – a dict from string to simple type that can be toInternald to JSON automatically</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.types.StructField.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.fromInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.StructField.fromJson">
<em class="property">classmethod </em><code class="descname">fromJson</code><span class="sig-paren">(</span><em>json</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.fromJson"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.fromJson" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.jsonValue" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.needConversion" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.toInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.typeName">
<code class="descname">typeName</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.typeName"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.typeName" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.StructType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">StructType</code><span class="sig-paren">(</span><em>fields=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType" title="Permalink to this definition"></a></dt>
<dd><p>Struct type, consisting of a list of <a class="reference internal" href="#pyspark.sql.types.StructField" title="pyspark.sql.types.StructField"><code class="xref py py-class docutils literal"><span class="pre">StructField</span></code></a>.</p>
<p>This is the data type representing a <code class="xref py py-class docutils literal"><span class="pre">Row</span></code>.</p>
<p>Iterating a <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">StructType</span></code></a> will iterate its <code class="xref py py-class docutils literal"><span class="pre">StructField`s.</span>
<span class="pre">A</span> <span class="pre">contained</span> <span class="pre">:class:`StructField</span></code> can be accessed by name or position.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;f1&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">)])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span><span class="p">[</span><span class="s2">&quot;f1&quot;</span><span class="p">]</span>
<span class="go">StructField(f1,StringType,true)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">StructField(f1,StringType,true)</span>
</pre></div>
</div>
<dl class="method">
<dt id="pyspark.sql.types.StructType.add">
<code class="descname">add</code><span class="sig-paren">(</span><em>field</em>, <em>data_type=None</em>, <em>nullable=True</em>, <em>metadata=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.add"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.add" title="Permalink to this definition"></a></dt>
<dd><p>Construct a StructType by adding new elements to it to define the schema. The method accepts
either:</p>
<blockquote>
<div><ol class="loweralpha simple">
<li>A single parameter which is a StructField object.</li>
<li>Between 2 and 4 parameters as (name, data_type, nullable (optional),
metadata(optional). The data_type parameter may be either a String or a
DataType object.</li>
</ol>
</div></blockquote>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">()</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="s2">&quot;f1&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="s2">&quot;f2&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct2</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;f1&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">),</span> \
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;f2&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">,</span> <span class="kc">None</span><span class="p">)])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span> <span class="o">==</span> <span class="n">struct2</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">()</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;f1&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct2</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;f1&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">)])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span> <span class="o">==</span> <span class="n">struct2</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">()</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="s2">&quot;f1&quot;</span><span class="p">,</span> <span class="s2">&quot;string&quot;</span><span class="p">,</span> <span class="kc">True</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct2</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;f1&quot;</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="kc">True</span><span class="p">)])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">struct1</span> <span class="o">==</span> <span class="n">struct2</span>
<span class="go">True</span>
</pre></div>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>field</strong> – Either the name of the field or a StructField object</li>
<li><strong>data_type</strong> – If present, the DataType of the StructField to create</li>
<li><strong>nullable</strong> – Whether the field to add should be nullable (default True)</li>
<li><strong>metadata</strong> – Any additional metadata (default None)</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">a new updated StructType</p>
</td>
</tr>
</tbody>
</table>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.fromInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.StructType.fromJson">
<em class="property">classmethod </em><code class="descname">fromJson</code><span class="sig-paren">(</span><em>json</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.fromJson"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.fromJson" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.jsonValue" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.needConversion" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.simpleString" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.toInternal" title="Permalink to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
</div>
<div class="section" id="module-pyspark.sql.functions">
<span id="pyspark-sql-functions-module"></span><h2>pyspark.sql.functions module<a class="headerlink" href="#module-pyspark.sql.functions" title="Permalink to this headline"></a></h2>
<p>A collections of builtin functions</p>
<dl class="function">
<dt id="pyspark.sql.functions.abs">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">abs</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.abs" title="Permalink to this definition"></a></dt>
<dd><p>Computes the absolute value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.acos">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">acos</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.acos" title="Permalink to this definition"></a></dt>
<dd><p>Computes the cosine inverse of the given value; the returned angle is in the range0.0 through pi.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.add_months">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">add_months</code><span class="sig-paren">(</span><em>start</em>, <em>months</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#add_months"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.add_months" title="Permalink to this definition"></a></dt>
<dd><p>Returns the date that is <cite>months</cite> months after <cite>start</cite></p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">add_months</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;d&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=datetime.date(2015, 5, 8))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.approxCountDistinct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">approxCountDistinct</code><span class="sig-paren">(</span><em>col</em>, <em>rsd=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#approxCountDistinct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.approxCountDistinct" title="Permalink to this definition"></a></dt>
<dd><div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 2.1, use approx_count_distinct instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.approx_count_distinct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">approx_count_distinct</code><span class="sig-paren">(</span><em>col</em>, <em>rsd=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#approx_count_distinct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.approx_count_distinct" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> for approximate distinct count of <code class="docutils literal"><span class="pre">col</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">approx_count_distinct</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;c&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.array">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">array</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#array"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.array" title="Permalink to this definition"></a></dt>
<dd><p>Creates a new array column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string) or list of <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expressions that have
the same data type.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">array</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;arr&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(arr=[2, 2]), Row(arr=[5, 5])]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">array</span><span class="p">([</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">])</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;arr&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(arr=[2, 2]), Row(arr=[5, 5])]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.array_contains">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">array_contains</code><span class="sig-paren">(</span><em>col</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#array_contains"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.array_contains" title="Permalink to this definition"></a></dt>
<dd><p>Collection function: returns null if the array is null, true if the array contains the
given value, and false otherwise.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – name of column containing array</li>
<li><strong>value</strong> – value to check for in array</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([([</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">,</span> <span class="s2">&quot;c&quot;</span><span class="p">],),</span> <span class="p">([],)],</span> <span class="p">[</span><span class="s1">&#39;data&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">array_contains</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="s2">&quot;a&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(array_contains(data, a)=True), Row(array_contains(data, a)=False)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.asc">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">asc</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.asc" title="Permalink to this definition"></a></dt>
<dd><p>Returns a sort expression based on the ascending order of the given column name.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.ascii">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">ascii</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.ascii" title="Permalink to this definition"></a></dt>
<dd><p>Computes the numeric value of the first character of the string column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.asin">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">asin</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.asin" title="Permalink to this definition"></a></dt>
<dd><p>Computes the sine inverse of the given value; the returned angle is in the range-pi/2 through pi/2.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.atan">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">atan</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.atan" title="Permalink to this definition"></a></dt>
<dd><p>Computes the tangent inverse of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.atan2">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">atan2</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.atan2" title="Permalink to this definition"></a></dt>
<dd><p>Returns the angle theta from the conversion of rectangular coordinates (x, y) topolar coordinates (r, theta).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.avg">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">avg</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.avg" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the average of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.base64">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">base64</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.base64" title="Permalink to this definition"></a></dt>
<dd><p>Computes the BASE64 encoding of a binary column and returns it as a string column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.bin">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">bin</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#bin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.bin" title="Permalink to this definition"></a></dt>
<dd><p>Returns the string representation of the binary value of the given column.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="nb">bin</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;c&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=&#39;10&#39;), Row(c=&#39;101&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.bitwiseNOT">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">bitwiseNOT</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.bitwiseNOT" title="Permalink to this definition"></a></dt>
<dd><p>Computes bitwise not.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.broadcast">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">broadcast</code><span class="sig-paren">(</span><em>df</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#broadcast"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.broadcast" title="Permalink to this definition"></a></dt>
<dd><p>Marks a DataFrame as small enough for use in broadcast joins.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.bround">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">bround</code><span class="sig-paren">(</span><em>col</em>, <em>scale=0</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#bround"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.bround" title="Permalink to this definition"></a></dt>
<dd><p>Round the given value to <cite>scale</cite> decimal places using HALF_EVEN rounding mode if <cite>scale</cite> &gt;= 0
or at integral part when <cite>scale</cite> &lt; 0.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mf">2.5</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">bround</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=2.0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.cbrt">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">cbrt</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.cbrt" title="Permalink to this definition"></a></dt>
<dd><p>Computes the cube-root of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.ceil">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">ceil</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.ceil" title="Permalink to this definition"></a></dt>
<dd><p>Computes the ceiling of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.coalesce">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">coalesce</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#coalesce"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.coalesce" title="Permalink to this definition"></a></dt>
<dd><p>Returns the first column that is not null.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">cDf</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="kc">None</span><span class="p">,</span> <span class="kc">None</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="kc">None</span><span class="p">),</span> <span class="p">(</span><span class="kc">None</span><span class="p">,</span> <span class="mi">2</span><span class="p">)],</span> <span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">cDf</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+----+</span>
<span class="go">| a| b|</span>
<span class="go">+----+----+</span>
<span class="go">|null|null|</span>
<span class="go">| 1|null|</span>
<span class="go">|null| 2|</span>
<span class="go">+----+----+</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">cDf</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">coalesce</span><span class="p">(</span><span class="n">cDf</span><span class="p">[</span><span class="s2">&quot;a&quot;</span><span class="p">],</span> <span class="n">cDf</span><span class="p">[</span><span class="s2">&quot;b&quot;</span><span class="p">]))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+--------------+</span>
<span class="go">|coalesce(a, b)|</span>
<span class="go">+--------------+</span>
<span class="go">| null|</span>
<span class="go">| 1|</span>
<span class="go">| 2|</span>
<span class="go">+--------------+</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">cDf</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;*&#39;</span><span class="p">,</span> <span class="n">coalesce</span><span class="p">(</span><span class="n">cDf</span><span class="p">[</span><span class="s2">&quot;a&quot;</span><span class="p">],</span> <span class="n">lit</span><span class="p">(</span><span class="mf">0.0</span><span class="p">)))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+----+----------------+</span>
<span class="go">| a| b|coalesce(a, 0.0)|</span>
<span class="go">+----+----+----------------+</span>
<span class="go">|null|null| 0.0|</span>
<span class="go">| 1|null| 1.0|</span>
<span class="go">|null| 2| 0.0|</span>
<span class="go">+----+----+----------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.col">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">col</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.col" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> based on the given column name.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.collect_list">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">collect_list</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.collect_list" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns a list of objects with duplicates.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.collect_set">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">collect_set</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.collect_set" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns a set of objects with duplicate elements eliminated.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.column">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">column</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.column" title="Permalink to this definition"></a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> based on the given column name.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.concat">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">concat</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#concat"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.concat" title="Permalink to this definition"></a></dt>
<dd><p>Concatenates multiple input string columns together into a single string column.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;abcd&#39;</span><span class="p">,</span><span class="s1">&#39;123&#39;</span><span class="p">)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,</span> <span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">concat</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=&#39;abcd123&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.concat_ws">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">concat_ws</code><span class="sig-paren">(</span><em>sep</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#concat_ws"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.concat_ws" title="Permalink to this definition"></a></dt>
<dd><p>Concatenates multiple input string columns together into a single string column,
using the given separator.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;abcd&#39;</span><span class="p">,</span><span class="s1">&#39;123&#39;</span><span class="p">)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,</span> <span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">concat_ws</span><span class="p">(</span><span class="s1">&#39;-&#39;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=&#39;abcd-123&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.conv">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">conv</code><span class="sig-paren">(</span><em>col</em>, <em>fromBase</em>, <em>toBase</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#conv"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.conv" title="Permalink to this definition"></a></dt>
<dd><p>Convert a number in a string column from one base to another.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s2">&quot;010101&quot;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;n&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">conv</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">n</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">16</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;hex&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hex=&#39;15&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.corr">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">corr</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#corr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.corr" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> for the Pearson Correlation Coefficient for <code class="docutils literal"><span class="pre">col1</span></code>
and <code class="docutils literal"><span class="pre">col2</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">a</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="mi">20</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">b</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span> <span class="o">*</span> <span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">20</span><span class="p">)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">),</span> <span class="p">[</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">corr</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;c&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=1.0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.cos">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">cos</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.cos" title="Permalink to this definition"></a></dt>
<dd><p>Computes the cosine of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.cosh">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">cosh</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.cosh" title="Permalink to this definition"></a></dt>
<dd><p>Computes the hyperbolic cosine of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.count">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">count</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.count" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the number of items in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.countDistinct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">countDistinct</code><span class="sig-paren">(</span><em>col</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#countDistinct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.countDistinct" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> for distinct count of <code class="docutils literal"><span class="pre">col</span></code> or <code class="docutils literal"><span class="pre">cols</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">countDistinct</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;c&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=2)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">countDistinct</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">,</span> <span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;c&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.covar_pop">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">covar_pop</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#covar_pop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.covar_pop" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> for the population covariance of <code class="docutils literal"><span class="pre">col1</span></code>
and <code class="docutils literal"><span class="pre">col2</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">b</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">),</span> <span class="p">[</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">covar_pop</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;c&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=0.0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.covar_samp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">covar_samp</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#covar_samp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.covar_samp" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> for the sample covariance of <code class="docutils literal"><span class="pre">col1</span></code>
and <code class="docutils literal"><span class="pre">col2</span></code>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">b</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">),</span> <span class="p">[</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">covar_samp</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;c&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=0.0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.crc32">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">crc32</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#crc32"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.crc32" title="Permalink to this definition"></a></dt>
<dd><p>Calculates the cyclic redundancy check value (CRC32) of a binary column and
returns the value as a bigint.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ABC&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">crc32</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;crc32&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(crc32=2743272264)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.create_map">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">create_map</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#create_map"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.create_map" title="Permalink to this definition"></a></dt>
<dd><p>Creates a new map column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string) or list of <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expressions that grouped
as key-value pairs, e.g. (key1, value1, key2, value2, …).</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">create_map</span><span class="p">(</span><span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;age&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;map&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(map={&#39;Alice&#39;: 2}), Row(map={&#39;Bob&#39;: 5})]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">create_map</span><span class="p">([</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">])</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;map&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(map={&#39;Alice&#39;: 2}), Row(map={&#39;Bob&#39;: 5})]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.cume_dist">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">cume_dist</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.cume_dist" title="Permalink to this definition"></a></dt>
<dd><p>Window function: returns the cumulative distribution of values within a window partition,
i.e. the fraction of rows that are below the current row.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.current_date">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">current_date</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#current_date"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.current_date" title="Permalink to this definition"></a></dt>
<dd><p>Returns the current date as a date column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.current_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">current_timestamp</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#current_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.current_timestamp" title="Permalink to this definition"></a></dt>
<dd><p>Returns the current timestamp as a timestamp column.</p>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.date_add">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">date_add</code><span class="sig-paren">(</span><em>start</em>, <em>days</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#date_add"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.date_add" title="Permalink to this definition"></a></dt>
<dd><p>Returns the date that is <cite>days</cite> days after <cite>start</cite></p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">date_add</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;d&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=datetime.date(2015, 4, 9))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.date_format">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">date_format</code><span class="sig-paren">(</span><em>date</em>, <em>format</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#date_format"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.date_format" title="Permalink to this definition"></a></dt>
<dd><p>Converts a date/timestamp/string to a value of string in the format specified by the date
format given by the second argument.</p>
<p>A pattern could be for instance <cite>dd.MM.yyyy</cite> and could return a string like ‘18.03.1993’. All
pattern letters of the Java class <cite>java.text.SimpleDateFormat</cite> can be used.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Use when ever possible specialized functions like <cite>year</cite>. These benefit from a
specialized implementation.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">date_format</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s1">&#39;MM/dd/yyy&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=&#39;04/08/2015&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.date_sub">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">date_sub</code><span class="sig-paren">(</span><em>start</em>, <em>days</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#date_sub"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.date_sub" title="Permalink to this definition"></a></dt>
<dd><p>Returns the date that is <cite>days</cite> days before <cite>start</cite></p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">date_sub</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;d&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=datetime.date(2015, 4, 7))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.datediff">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">datediff</code><span class="sig-paren">(</span><em>end</em>, <em>start</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#datediff"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.datediff" title="Permalink to this definition"></a></dt>
<dd><p>Returns the number of days from <cite>start</cite> to <cite>end</cite>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,</span><span class="s1">&#39;2015-05-10&#39;</span><span class="p">)],</span> <span class="p">[</span><span class="s1">&#39;d1&#39;</span><span class="p">,</span> <span class="s1">&#39;d2&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">datediff</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">d1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;diff&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(diff=32)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.dayofmonth">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">dayofmonth</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#dayofmonth"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.dayofmonth" title="Permalink to this definition"></a></dt>
<dd><p>Extract the day of the month of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">dayofmonth</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;day&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(day=8)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.dayofyear">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">dayofyear</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#dayofyear"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.dayofyear" title="Permalink to this definition"></a></dt>
<dd><p>Extract the day of the year of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">dayofyear</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;day&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(day=98)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.decode">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">decode</code><span class="sig-paren">(</span><em>col</em>, <em>charset</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#decode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.decode" title="Permalink to this definition"></a></dt>
<dd><p>Computes the first argument into a string from a binary using the provided character set
(one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.degrees">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">degrees</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.degrees" title="Permalink to this definition"></a></dt>
<dd><p>Converts an angle measured in radians to an approximately equivalent angle measured in degrees.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.dense_rank">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">dense_rank</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.dense_rank" title="Permalink to this definition"></a></dt>
<dd><p>Window function: returns the rank of rows within a window partition, without any gaps.</p>
<p>The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using dense_rank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third. Rank would give me sequential numbers, making
the person that came in third place (after the ties) would register as coming in fifth.</p>
<p>This is equivalent to the DENSE_RANK function in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.desc">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">desc</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.desc" title="Permalink to this definition"></a></dt>
<dd><p>Returns a sort expression based on the descending order of the given column name.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.encode">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">encode</code><span class="sig-paren">(</span><em>col</em>, <em>charset</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#encode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.encode" title="Permalink to this definition"></a></dt>
<dd><p>Computes the first argument into a binary from a string using the provided character set
(one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.exp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">exp</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.exp" title="Permalink to this definition"></a></dt>
<dd><p>Computes the exponential of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.explode">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">explode</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#explode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.explode" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new row for each element in the given array or map.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">eDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([</span><span class="n">Row</span><span class="p">(</span><span class="n">a</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">intlist</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span> <span class="n">mapfield</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;a&quot;</span><span class="p">:</span> <span class="s2">&quot;b&quot;</span><span class="p">})])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">eDF</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">explode</span><span class="p">(</span><span class="n">eDF</span><span class="o">.</span><span class="n">intlist</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;anInt&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(anInt=1), Row(anInt=2), Row(anInt=3)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">eDF</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">explode</span><span class="p">(</span><span class="n">eDF</span><span class="o">.</span><span class="n">mapfield</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|key|value|</span>
<span class="go">+---+-----+</span>
<span class="go">| a| b|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.expm1">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">expm1</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.expm1" title="Permalink to this definition"></a></dt>
<dd><p>Computes the exponential of the given value minus one.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.expr">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">expr</code><span class="sig-paren">(</span><em>str</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#expr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.expr" title="Permalink to this definition"></a></dt>
<dd><p>Parses the expression string into the column that it represents</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">expr</span><span class="p">(</span><span class="s2">&quot;length(name)&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(length(name)=5), Row(length(name)=3)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.factorial">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">factorial</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#factorial"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.factorial" title="Permalink to this definition"></a></dt>
<dd><p>Computes the factorial of the given value.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">5</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;n&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">factorial</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">n</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;f&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(f=120)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.first">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">first</code><span class="sig-paren">(</span><em>col</em>, <em>ignorenulls=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#first"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.first" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the first value in a group.</p>
<p>The function by default returns the first values it sees. It will return the first non-null
value it sees when ignoreNulls is set to true. If all values are null, then null is returned.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.floor">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">floor</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.floor" title="Permalink to this definition"></a></dt>
<dd><p>Computes the floor of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.format_number">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">format_number</code><span class="sig-paren">(</span><em>col</em>, <em>d</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#format_number"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.format_number" title="Permalink to this definition"></a></dt>
<dd><p>Formats the number X to a format like ‘#,–#,–#.–’, rounded to d decimal places
with HALF_EVEN round mode, and returns the result as a string.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – the column name of the numeric value to be formatted</li>
<li><strong>d</strong> – the N decimal places</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">5</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">format_number</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;v&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(v=&#39;5.0000&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.format_string">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">format_string</code><span class="sig-paren">(</span><em>format</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#format_string"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.format_string" title="Permalink to this definition"></a></dt>
<dd><p>Formats the arguments in printf-style and returns the result as a string column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – the column name of the numeric value to be formatted</li>
<li><strong>d</strong> – the N decimal places</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">5</span><span class="p">,</span> <span class="s2">&quot;hello&quot;</span><span class="p">)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">format_string</span><span class="p">(</span><span class="s1">&#39;</span><span class="si">%d</span><span class="s1"> </span><span class="si">%s</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">b</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;v&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(v=&#39;5 hello&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.from_json">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">from_json</code><span class="sig-paren">(</span><em>col</em>, <em>schema</em>, <em>options={}</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#from_json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.from_json" title="Permalink to this definition"></a></dt>
<dd><p>Parses a column containing a JSON string into a [[StructType]] or [[ArrayType]]
of [[StructType]]s with the specified schema. Returns <cite>null</cite>, in the case of an unparseable
string.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – string column in json format</li>
<li><strong>schema</strong> – a StructType or ArrayType of StructType to use when parsing the json column</li>
<li><strong>options</strong> – options to control parsing. accepts the same options as the json datasource</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;&#39;&#39;{&quot;a&quot;: 1}&#39;&#39;&#39;</span><span class="p">)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">())])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">from_json</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">value</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;json&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(json=Row(a=1))]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">&#39;&#39;&#39;[{&quot;a&quot;: 1}]&#39;&#39;&#39;</span><span class="p">)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">schema</span> <span class="o">=</span> <span class="n">ArrayType</span><span class="p">(</span><span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">())]))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">from_json</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">value</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;json&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(json=[Row(a=1)])]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.from_unixtime">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">from_unixtime</code><span class="sig-paren">(</span><em>timestamp</em>, <em>format='yyyy-MM-dd HH:mm:ss'</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#from_unixtime"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.from_unixtime" title="Permalink to this definition"></a></dt>
<dd><p>Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string
representing the timestamp of that moment in the current system time zone in the given
format.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.from_utc_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">from_utc_timestamp</code><span class="sig-paren">(</span><em>timestamp</em>, <em>tz</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#from_utc_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.from_utc_timestamp" title="Permalink to this definition"></a></dt>
<dd><p>Given a timestamp, which corresponds to a certain time of day in UTC, returns another timestamp
that corresponds to the same time of day in the given timezone.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-28 10:30:00&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;t&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">from_utc_timestamp</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">,</span> <span class="s2">&quot;PST&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;t&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(t=datetime.datetime(1997, 2, 28, 2, 30))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.get_json_object">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">get_json_object</code><span class="sig-paren">(</span><em>col</em>, <em>path</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#get_json_object"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.get_json_object" title="Permalink to this definition"></a></dt>
<dd><p>Extracts json object from a json string based on json path specified, and returns json string
of the extracted json object. It will return null if the input json string is invalid.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – string column in json format</li>
<li><strong>path</strong> – path to the json object to extract</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="p">[(</span><span class="s2">&quot;1&quot;</span><span class="p">,</span> <span class="s1">&#39;&#39;&#39;{&quot;f1&quot;: &quot;value1&quot;, &quot;f2&quot;: &quot;value2&quot;}&#39;&#39;&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s2">&quot;2&quot;</span><span class="p">,</span> <span class="s1">&#39;&#39;&#39;{&quot;f1&quot;: &quot;value12&quot;}&#39;&#39;&#39;</span><span class="p">)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="s2">&quot;jstring&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">key</span><span class="p">,</span> <span class="n">get_json_object</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">jstring</span><span class="p">,</span> <span class="s1">&#39;$.f1&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;c0&quot;</span><span class="p">),</span> \
<span class="gp">... </span> <span class="n">get_json_object</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">jstring</span><span class="p">,</span> <span class="s1">&#39;$.f2&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;c1&quot;</span><span class="p">)</span> <span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(key=&#39;1&#39;, c0=&#39;value1&#39;, c1=&#39;value2&#39;), Row(key=&#39;2&#39;, c0=&#39;value12&#39;, c1=None)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.greatest">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">greatest</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#greatest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.greatest" title="Permalink to this definition"></a></dt>
<dd><p>Returns the greatest value of the list of column names, skipping null values.
This function takes at least 2 parameters. It will return null iff all parameters are null.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="s1">&#39;c&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">greatest</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">b</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;greatest&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(greatest=4)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.grouping">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">grouping</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#grouping"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.grouping" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated
or not, returns 1 for aggregated or 0 for not aggregated in the result set.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">cube</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">grouping</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">),</span> <span class="nb">sum</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+--------------+--------+</span>
<span class="go">| name|grouping(name)|sum(age)|</span>
<span class="go">+-----+--------------+--------+</span>
<span class="go">| null| 1| 7|</span>
<span class="go">|Alice| 0| 2|</span>
<span class="go">| Bob| 0| 5|</span>
<span class="go">+-----+--------------+--------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.grouping_id">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">grouping_id</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#grouping_id"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.grouping_id" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the level of grouping, equals to</p>
<blockquote>
<div>(grouping(c1) &lt;&lt; (n-1)) + (grouping(c2) &lt;&lt; (n-2)) + … + grouping(cn)</div></blockquote>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">The list of columns should match with grouping columns exactly, or empty (means all
the grouping columns).</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">cube</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">grouping_id</span><span class="p">(),</span> <span class="nb">sum</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+-------------+--------+</span>
<span class="go">| name|grouping_id()|sum(age)|</span>
<span class="go">+-----+-------------+--------+</span>
<span class="go">| null| 1| 7|</span>
<span class="go">|Alice| 0| 2|</span>
<span class="go">| Bob| 0| 5|</span>
<span class="go">+-----+-------------+--------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.hash">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">hash</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#hash"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.hash" title="Permalink to this definition"></a></dt>
<dd><p>Calculates the hash code of given columns, and returns the result as an int column.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ABC&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="nb">hash</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;hash&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hash=-757602832)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.hex">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">hex</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#hex"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.hex" title="Permalink to this definition"></a></dt>
<dd><p>Computes hex value of the given column, which could be <a class="reference internal" href="#pyspark.sql.types.StringType" title="pyspark.sql.types.StringType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StringType</span></code></a>,
<a class="reference internal" href="#pyspark.sql.types.BinaryType" title="pyspark.sql.types.BinaryType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.BinaryType</span></code></a>, <a class="reference internal" href="#pyspark.sql.types.IntegerType" title="pyspark.sql.types.IntegerType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.IntegerType</span></code></a> or
<a class="reference internal" href="#pyspark.sql.types.LongType" title="pyspark.sql.types.LongType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.LongType</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ABC&#39;</span><span class="p">,</span> <span class="mi">3</span><span class="p">)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="nb">hex</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">),</span> <span class="nb">hex</span><span class="p">(</span><span class="s1">&#39;b&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hex(a)=&#39;414243&#39;, hex(b)=&#39;3&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.hour">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">hour</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#hour"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.hour" title="Permalink to this definition"></a></dt>
<dd><p>Extract the hours of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08 13:08:15&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">hour</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;hour&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hour=13)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.hypot">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">hypot</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.hypot" title="Permalink to this definition"></a></dt>
<dd><p>Computes <code class="docutils literal"><span class="pre">sqrt(a^2</span> <span class="pre">+</span> <span class="pre">b^2)</span></code> without intermediate overflow or underflow.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.initcap">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">initcap</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#initcap"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.initcap" title="Permalink to this definition"></a></dt>
<dd><p>Translate the first letter of each word to upper case in the sentence.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ab cd&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">initcap</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;v&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(v=&#39;Ab Cd&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.input_file_name">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">input_file_name</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#input_file_name"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.input_file_name" title="Permalink to this definition"></a></dt>
<dd><p>Creates a string column for the file name of the current Spark task.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.instr">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">instr</code><span class="sig-paren">(</span><em>str</em>, <em>substr</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#instr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.instr" title="Permalink to this definition"></a></dt>
<dd><p>Locate the position of the first occurrence of substr column in the given string.
Returns null if either of the arguments are null.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">The position is not zero based, but 1 based index. Returns 0 if substr
could not be found in str.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;abcd&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">instr</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.isnan">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">isnan</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#isnan"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.isnan" title="Permalink to this definition"></a></dt>
<dd><p>An expression that returns true iff the column is NaN.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mf">1.0</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;nan&#39;</span><span class="p">)),</span> <span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="s1">&#39;nan&#39;</span><span class="p">),</span> <span class="mf">2.0</span><span class="p">)],</span> <span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">isnan</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;r1&quot;</span><span class="p">),</span> <span class="n">isnan</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;r2&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r1=False, r2=False), Row(r1=True, r2=True)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.isnull">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">isnull</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#isnull"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.isnull" title="Permalink to this definition"></a></dt>
<dd><p>An expression that returns true iff the column is null.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="kc">None</span><span class="p">),</span> <span class="p">(</span><span class="kc">None</span><span class="p">,</span> <span class="mi">2</span><span class="p">)],</span> <span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">isnull</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;r1&quot;</span><span class="p">),</span> <span class="n">isnull</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;r2&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r1=False, r2=False), Row(r1=True, r2=True)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.json_tuple">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">json_tuple</code><span class="sig-paren">(</span><em>col</em>, <em>*fields</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#json_tuple"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.json_tuple" title="Permalink to this definition"></a></dt>
<dd><p>Creates a new row for a json column according to the given field names.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – string column in json format</li>
<li><strong>fields</strong> – list of fields to extract</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="p">[(</span><span class="s2">&quot;1&quot;</span><span class="p">,</span> <span class="s1">&#39;&#39;&#39;{&quot;f1&quot;: &quot;value1&quot;, &quot;f2&quot;: &quot;value2&quot;}&#39;&#39;&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s2">&quot;2&quot;</span><span class="p">,</span> <span class="s1">&#39;&#39;&#39;{&quot;f1&quot;: &quot;value12&quot;}&#39;&#39;&#39;</span><span class="p">)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="s2">&quot;jstring&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">key</span><span class="p">,</span> <span class="n">json_tuple</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">jstring</span><span class="p">,</span> <span class="s1">&#39;f1&#39;</span><span class="p">,</span> <span class="s1">&#39;f2&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(key=&#39;1&#39;, c0=&#39;value1&#39;, c1=&#39;value2&#39;), Row(key=&#39;2&#39;, c0=&#39;value12&#39;, c1=None)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.kurtosis">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">kurtosis</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.kurtosis" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the kurtosis of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lag">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lag</code><span class="sig-paren">(</span><em>col</em>, <em>count=1</em>, <em>default=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#lag"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.lag" title="Permalink to this definition"></a></dt>
<dd><p>Window function: returns the value that is <cite>offset</cite> rows before the current row, and
<cite>defaultValue</cite> if there is less than <cite>offset</cite> rows before the current row. For example,
an <cite>offset</cite> of one will return the previous row at any given point in the window partition.</p>
<p>This is equivalent to the LAG function in SQL.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – name of column or expression</li>
<li><strong>count</strong> – number of row to extend</li>
<li><strong>default</strong> – default value</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.last">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">last</code><span class="sig-paren">(</span><em>col</em>, <em>ignorenulls=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#last"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.last" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the last value in a group.</p>
<p>The function by default returns the last values it sees. It will return the last non-null
value it sees when ignoreNulls is set to true. If all values are null, then null is returned.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.last_day">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">last_day</code><span class="sig-paren">(</span><em>date</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#last_day"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.last_day" title="Permalink to this definition"></a></dt>
<dd><p>Returns the last day of the month which the given date belongs to.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-10&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">last_day</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=datetime.date(1997, 2, 28))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lead">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lead</code><span class="sig-paren">(</span><em>col</em>, <em>count=1</em>, <em>default=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#lead"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.lead" title="Permalink to this definition"></a></dt>
<dd><p>Window function: returns the value that is <cite>offset</cite> rows after the current row, and
<cite>defaultValue</cite> if there is less than <cite>offset</cite> rows after the current row. For example,
an <cite>offset</cite> of one will return the next row at any given point in the window partition.</p>
<p>This is equivalent to the LEAD function in SQL.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – name of column or expression</li>
<li><strong>count</strong> – number of row to extend</li>
<li><strong>default</strong> – default value</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.least">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">least</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#least"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.least" title="Permalink to this definition"></a></dt>
<dd><p>Returns the least value of the list of column names, skipping null values.
This function takes at least 2 parameters. It will return null iff all parameters are null.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="s1">&#39;c&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">least</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">b</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;least&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(least=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.length">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">length</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#length"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.length" title="Permalink to this definition"></a></dt>
<dd><p>Calculates the length of a string or binary expression.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ABC&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">length</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;length&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(length=3)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.levenshtein">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">levenshtein</code><span class="sig-paren">(</span><em>left</em>, <em>right</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#levenshtein"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.levenshtein" title="Permalink to this definition"></a></dt>
<dd><p>Computes the Levenshtein distance of the two given strings.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df0</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;kitten&#39;</span><span class="p">,</span> <span class="s1">&#39;sitting&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;l&#39;</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df0</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">levenshtein</span><span class="p">(</span><span class="s1">&#39;l&#39;</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;d&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=3)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lit">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lit</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.lit" title="Permalink to this definition"></a></dt>
<dd><p>Creates a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> of literal value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.locate">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">locate</code><span class="sig-paren">(</span><em>substr</em>, <em>str</em>, <em>pos=1</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#locate"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.locate" title="Permalink to this definition"></a></dt>
<dd><p>Locate the position of the first occurrence of substr in a string column, after position pos.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">The position is not zero based, but 1 based index. Returns 0 if substr
could not be found in str.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>substr</strong> – a string</li>
<li><strong>str</strong> – a Column of <a class="reference internal" href="#pyspark.sql.types.StringType" title="pyspark.sql.types.StringType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StringType</span></code></a></li>
<li><strong>pos</strong> – start position (zero based)</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;abcd&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">locate</span><span class="p">(</span><span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.log">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">log</code><span class="sig-paren">(</span><em>arg1</em>, <em>arg2=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#log"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.log" title="Permalink to this definition"></a></dt>
<dd><p>Returns the first argument-based logarithm of the second argument.</p>
<p>If there is only one argument, then this takes the natural logarithm of the argument.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">log</span><span class="p">(</span><span class="mf">10.0</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;ten&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">ten</span><span class="p">)[:</span><span class="mi">7</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[&#39;0.30102&#39;, &#39;0.69897&#39;]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">log</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;e&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">e</span><span class="p">)[:</span><span class="mi">7</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[&#39;0.69314&#39;, &#39;1.60943&#39;]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.log10">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">log10</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.log10" title="Permalink to this definition"></a></dt>
<dd><p>Computes the logarithm of the given value in Base 10.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.log1p">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">log1p</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.log1p" title="Permalink to this definition"></a></dt>
<dd><p>Computes the natural logarithm of the given value plus one.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.log2">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">log2</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#log2"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.log2" title="Permalink to this definition"></a></dt>
<dd><p>Returns the base-2 logarithm of the argument.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">4</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">log2</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;log2&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(log2=2.0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lower">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lower</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.lower" title="Permalink to this definition"></a></dt>
<dd><p>Converts a string column to lower case.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lpad">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lpad</code><span class="sig-paren">(</span><em>col</em>, <em>len</em>, <em>pad</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#lpad"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.lpad" title="Permalink to this definition"></a></dt>
<dd><p>Left-pad the string column to width <cite>len</cite> with <cite>pad</cite>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;abcd&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">lpad</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="s1">&#39;#&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=&#39;##abcd&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.ltrim">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">ltrim</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.ltrim" title="Permalink to this definition"></a></dt>
<dd><p>Trim the spaces from left end for the specified string value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.max">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">max</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.max" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the maximum value of the expression in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.md5">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">md5</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#md5"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.md5" title="Permalink to this definition"></a></dt>
<dd><p>Calculates the MD5 digest and returns the value as a 32 character hex string.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ABC&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">md5</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;hash&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hash=&#39;902fbdd2b1df0c4f70b4a5d23525e932&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.mean">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">mean</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.mean" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the average of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.min">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">min</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.min" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the minimum value of the expression in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.minute">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">minute</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#minute"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.minute" title="Permalink to this definition"></a></dt>
<dd><p>Extract the minutes of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08 13:08:15&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">minute</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;minute&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(minute=8)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.monotonically_increasing_id">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">monotonically_increasing_id</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#monotonically_increasing_id"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.monotonically_increasing_id" title="Permalink to this definition"></a></dt>
<dd><p>A column that generates monotonically increasing 64-bit integers.</p>
<p>The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
The current implementation puts the partition ID in the upper 31 bits, and the record number
within each partition in the lower 33 bits. The assumption is that the data frame has
less than 1 billion partitions, and each partition has less than 8 billion records.</p>
<p>As an example, consider a <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code> with two partitions, each with 3 records.
This expression would return the following IDs:
0, 1, 2, 8589934592 (1L &lt;&lt; 33), 8589934593, 8589934594.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df0</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">mapPartitions</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,),</span> <span class="p">(</span><span class="mi">3</span><span class="p">,)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">([</span><span class="s1">&#39;col1&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df0</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">monotonically_increasing_id</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;id&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(id=0), Row(id=1), Row(id=2), Row(id=8589934592), Row(id=8589934593), Row(id=8589934594)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.month">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">month</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#month"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.month" title="Permalink to this definition"></a></dt>
<dd><blockquote>
<div><p>Extract the month of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">month</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;month&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(month=4)]</span>
</pre></div>
</div>
</div></blockquote>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.months_between">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">months_between</code><span class="sig-paren">(</span><em>date1</em>, <em>date2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#months_between"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.months_between" title="Permalink to this definition"></a></dt>
<dd><p>Returns the number of months between date1 and date2.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-28 10:30:00&#39;</span><span class="p">,</span> <span class="s1">&#39;1996-10-30&#39;</span><span class="p">)],</span> <span class="p">[</span><span class="s1">&#39;t&#39;</span><span class="p">,</span> <span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">months_between</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;months&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(months=3.9495967...)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.nanvl">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">nanvl</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#nanvl"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.nanvl" title="Permalink to this definition"></a></dt>
<dd><p>Returns col1 if it is not NaN, or col2 if col1 is NaN.</p>
<p>Both inputs should be floating point columns (DoubleType or FloatType).</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mf">1.0</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="s1">&#39;nan&#39;</span><span class="p">)),</span> <span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="s1">&#39;nan&#39;</span><span class="p">),</span> <span class="mf">2.0</span><span class="p">)],</span> <span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">nanvl</span><span class="p">(</span><span class="s2">&quot;a&quot;</span><span class="p">,</span> <span class="s2">&quot;b&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;r1&quot;</span><span class="p">),</span> <span class="n">nanvl</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">b</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;r2&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.next_day">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">next_day</code><span class="sig-paren">(</span><em>date</em>, <em>dayOfWeek</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#next_day"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.next_day" title="Permalink to this definition"></a></dt>
<dd><p>Returns the first date which is later than the value of the date column.</p>
<dl class="docutils">
<dt>Day of the week parameter is case insensitive, and accepts:</dt>
<dd>“Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, “Sun”.</dd>
</dl>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-07-27&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">next_day</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="s1">&#39;Sun&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=datetime.date(2015, 8, 2))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.ntile">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">ntile</code><span class="sig-paren">(</span><em>n</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#ntile"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.ntile" title="Permalink to this definition"></a></dt>
<dd><p>Window function: returns the ntile group id (from 1 to <cite>n</cite> inclusive)
in an ordered window partition. For example, if <cite>n</cite> is 4, the first
quarter of the rows will get value 1, the second quarter will get 2,
the third quarter will get 3, and the last quarter will get 4.</p>
<p>This is equivalent to the NTILE function in SQL.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>n</strong> – an integer</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.percent_rank">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">percent_rank</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.percent_rank" title="Permalink to this definition"></a></dt>
<dd><p>Window function: returns the relative rank (i.e. percentile) of rows within a window partition.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.posexplode">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">posexplode</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#posexplode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.posexplode" title="Permalink to this definition"></a></dt>
<dd><p>Returns a new row for each element with position in the given array or map.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">eDF</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([</span><span class="n">Row</span><span class="p">(</span><span class="n">a</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">intlist</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span> <span class="n">mapfield</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;a&quot;</span><span class="p">:</span> <span class="s2">&quot;b&quot;</span><span class="p">})])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">eDF</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">posexplode</span><span class="p">(</span><span class="n">eDF</span><span class="o">.</span><span class="n">intlist</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(pos=0, col=1), Row(pos=1, col=2), Row(pos=2, col=3)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">eDF</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">posexplode</span><span class="p">(</span><span class="n">eDF</span><span class="o">.</span><span class="n">mapfield</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+---+-----+</span>
<span class="go">|pos|key|value|</span>
<span class="go">+---+---+-----+</span>
<span class="go">| 0| a| b|</span>
<span class="go">+---+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.pow">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">pow</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.pow" title="Permalink to this definition"></a></dt>
<dd><p>Returns the value of the first argument raised to the power of the second argument.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.quarter">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">quarter</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#quarter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.quarter" title="Permalink to this definition"></a></dt>
<dd><p>Extract the quarter of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">quarter</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;quarter&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(quarter=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.radians">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">radians</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.radians" title="Permalink to this definition"></a></dt>
<dd><p>Converts an angle measured in degrees to an approximately equivalent angle measured in radians.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rand">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rand</code><span class="sig-paren">(</span><em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#rand"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.rand" title="Permalink to this definition"></a></dt>
<dd><p>Generates a random column with independent and identically distributed (i.i.d.) samples
from U[0.0, 1.0].</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.randn">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">randn</code><span class="sig-paren">(</span><em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#randn"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.randn" title="Permalink to this definition"></a></dt>
<dd><p>Generates a column with independent and identically distributed (i.i.d.) samples from
the standard normal distribution.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rank">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rank</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.rank" title="Permalink to this definition"></a></dt>
<dd><p>Window function: returns the rank of rows within a window partition.</p>
<p>The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using dense_rank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third. Rank would give me sequential numbers, making
the person that came in third place (after the ties) would register as coming in fifth.</p>
<p>This is equivalent to the RANK function in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.regexp_extract">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">regexp_extract</code><span class="sig-paren">(</span><em>str</em>, <em>pattern</em>, <em>idx</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#regexp_extract"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.regexp_extract" title="Permalink to this definition"></a></dt>
<dd><p>Extract a specific group matched by a Java regex, from the specified string column.
If the regex did not match, or the specified group did not match, an empty string is returned.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;100-200&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;str&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">regexp_extract</span><span class="p">(</span><span class="s1">&#39;str&#39;</span><span class="p">,</span> <span class="s1">&#39;(\d+)-(\d+)&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;d&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=&#39;100&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;foo&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;str&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">regexp_extract</span><span class="p">(</span><span class="s1">&#39;str&#39;</span><span class="p">,</span> <span class="s1">&#39;(\d+)&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;d&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=&#39;&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;aaaac&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;str&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">regexp_extract</span><span class="p">(</span><span class="s1">&#39;str&#39;</span><span class="p">,</span> <span class="s1">&#39;(a+)(b)?(c)&#39;</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;d&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=&#39;&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.regexp_replace">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">regexp_replace</code><span class="sig-paren">(</span><em>str</em>, <em>pattern</em>, <em>replacement</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#regexp_replace"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.regexp_replace" title="Permalink to this definition"></a></dt>
<dd><p>Replace all substrings of the specified string value that match regexp with rep.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;100-200&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;str&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">regexp_replace</span><span class="p">(</span><span class="s1">&#39;str&#39;</span><span class="p">,</span> <span class="s1">&#39;(\d+)&#39;</span><span class="p">,</span> <span class="s1">&#39;--&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;d&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=&#39;-----&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.repeat">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">repeat</code><span class="sig-paren">(</span><em>col</em>, <em>n</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#repeat"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.repeat" title="Permalink to this definition"></a></dt>
<dd><p>Repeats a string column n times, and returns it as a new string column.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ab&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">repeat</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=&#39;ababab&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.reverse">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">reverse</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.reverse" title="Permalink to this definition"></a></dt>
<dd><p>Reverses the string column and returns it as a new string column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rint">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rint</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.rint" title="Permalink to this definition"></a></dt>
<dd><p>Returns the double value that is closest in value to the argument and is equal to a mathematical integer.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.round">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">round</code><span class="sig-paren">(</span><em>col</em>, <em>scale=0</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#round"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.round" title="Permalink to this definition"></a></dt>
<dd><p>Round the given value to <cite>scale</cite> decimal places using HALF_UP rounding mode if <cite>scale</cite> &gt;= 0
or at integral part when <cite>scale</cite> &lt; 0.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mf">2.5</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=3.0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.row_number">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">row_number</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.row_number" title="Permalink to this definition"></a></dt>
<dd><p>Window function: returns a sequential number starting at 1 within a window partition.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rpad">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rpad</code><span class="sig-paren">(</span><em>col</em>, <em>len</em>, <em>pad</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#rpad"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.rpad" title="Permalink to this definition"></a></dt>
<dd><p>Right-pad the string column to width <cite>len</cite> with <cite>pad</cite>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;abcd&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">rpad</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="s1">&#39;#&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=&#39;abcd##&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rtrim">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rtrim</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.rtrim" title="Permalink to this definition"></a></dt>
<dd><p>Trim the spaces from right end for the specified string value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.second">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">second</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#second"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.second" title="Permalink to this definition"></a></dt>
<dd><p>Extract the seconds of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08 13:08:15&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">second</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;second&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(second=15)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sha1">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sha1</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#sha1"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.sha1" title="Permalink to this definition"></a></dt>
<dd><p>Returns the hex string result of SHA-1.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ABC&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sha1</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;hash&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hash=&#39;3c01bdbb26f358bab27f267924aa2c9a03fcfdb8&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sha2">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sha2</code><span class="sig-paren">(</span><em>col</em>, <em>numBits</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#sha2"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.sha2" title="Permalink to this definition"></a></dt>
<dd><p>Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384,
and SHA-512). The numBits indicates the desired bit length of the result, which must have a
value of 224, 256, 384, 512, or 0 (which is equivalent to 256).</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">digests</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sha2</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="mi">256</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">digests</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">Row(s=&#39;3bc51062973c458d5a6f2d8d64a023246354ad7e064b1e4e009ec8a0699a3043&#39;)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">digests</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="go">Row(s=&#39;cd9fb1e148ccd8442e5aa74904cc73bf6fb54d1d54d333bd596aa9bb4bb4e961&#39;)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.shiftLeft">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">shiftLeft</code><span class="sig-paren">(</span><em>col</em>, <em>numBits</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#shiftLeft"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.shiftLeft" title="Permalink to this definition"></a></dt>
<dd><p>Shift the given value numBits left.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">21</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">shiftLeft</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=42)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.shiftRight">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">shiftRight</code><span class="sig-paren">(</span><em>col</em>, <em>numBits</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#shiftRight"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.shiftRight" title="Permalink to this definition"></a></dt>
<dd><p>(Signed) shift the given value numBits right.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">42</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">shiftRight</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=21)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.shiftRightUnsigned">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">shiftRightUnsigned</code><span class="sig-paren">(</span><em>col</em>, <em>numBits</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#shiftRightUnsigned"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.shiftRightUnsigned" title="Permalink to this definition"></a></dt>
<dd><p>Unsigned shift the given value numBits right.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="o">-</span><span class="mi">42</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">shiftRightUnsigned</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=9223372036854775787)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.signum">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">signum</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.signum" title="Permalink to this definition"></a></dt>
<dd><p>Computes the signum of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sin">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sin</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sin" title="Permalink to this definition"></a></dt>
<dd><p>Computes the sine of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sinh">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sinh</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sinh" title="Permalink to this definition"></a></dt>
<dd><p>Computes the hyperbolic sine of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.size">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">size</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#size"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.size" title="Permalink to this definition"></a></dt>
<dd><p>Collection function: returns the length of the array or map stored in the column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>col</strong> – name of column or expression</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],),([</span><span class="mi">1</span><span class="p">],),([],)],</span> <span class="p">[</span><span class="s1">&#39;data&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">size</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">data</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(size(data)=3), Row(size(data)=1), Row(size(data)=0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.skewness">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">skewness</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.skewness" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the skewness of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sort_array">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sort_array</code><span class="sig-paren">(</span><em>col</em>, <em>asc=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#sort_array"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.sort_array" title="Permalink to this definition"></a></dt>
<dd><p>Collection function: sorts the input array in ascending or descending order according
to the natural ordering of the array elements.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>col</strong> – name of column or expression</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">],),([</span><span class="mi">1</span><span class="p">],),([],)],</span> <span class="p">[</span><span class="s1">&#39;data&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sort_array</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">data</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=[1, 2, 3]), Row(r=[1]), Row(r=[])]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sort_array</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">asc</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=[3, 2, 1]), Row(r=[1]), Row(r=[])]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.soundex">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">soundex</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#soundex"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.soundex" title="Permalink to this definition"></a></dt>
<dd><p>Returns the SoundEx encoding for a string</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s2">&quot;Peters&quot;</span><span class="p">,),(</span><span class="s2">&quot;Uhrbach&quot;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;name&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">soundex</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;soundex&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(soundex=&#39;P362&#39;), Row(soundex=&#39;U612&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.spark_partition_id">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">spark_partition_id</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#spark_partition_id"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.spark_partition_id" title="Permalink to this definition"></a></dt>
<dd><p>A column for partition ID.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This is indeterministic because it depends on data partitioning and task scheduling.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">spark_partition_id</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;pid&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(pid=0), Row(pid=0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.split">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">split</code><span class="sig-paren">(</span><em>str</em>, <em>pattern</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#split"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.split" title="Permalink to this definition"></a></dt>
<dd><p>Splits str around pattern (pattern is a regular expression).</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">pattern is a string represent the regular expression.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;ab12cd&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">split</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="s1">&#39;[0-9]+&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=[&#39;ab&#39;, &#39;cd&#39;])]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sqrt">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sqrt</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sqrt" title="Permalink to this definition"></a></dt>
<dd><p>Computes the square root of the specified float value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.stddev">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">stddev</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.stddev" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the unbiased sample standard deviation of the expression in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.stddev_pop">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">stddev_pop</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.stddev_pop" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns population standard deviation of the expression in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.stddev_samp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">stddev_samp</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.stddev_samp" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the unbiased sample standard deviation of the expression in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.struct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">struct</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#struct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.struct" title="Permalink to this definition"></a></dt>
<dd><p>Creates a new struct column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string) or list of <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expressions</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">struct</span><span class="p">(</span><span class="s1">&#39;age&#39;</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;struct&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(struct=Row(age=2, name=&#39;Alice&#39;)), Row(struct=Row(age=5, name=&#39;Bob&#39;))]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">struct</span><span class="p">([</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">])</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;struct&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(struct=Row(age=2, name=&#39;Alice&#39;)), Row(struct=Row(age=5, name=&#39;Bob&#39;))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.substring">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">substring</code><span class="sig-paren">(</span><em>str</em>, <em>pos</em>, <em>len</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#substring"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.substring" title="Permalink to this definition"></a></dt>
<dd><p>Substring starts at <cite>pos</cite> and is of length <cite>len</cite> when str is String type or
returns the slice of byte array that starts at <cite>pos</cite> in byte and is of length <cite>len</cite>
when str is Binary type</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;abcd&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">,])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">substring</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=&#39;ab&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.substring_index">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">substring_index</code><span class="sig-paren">(</span><em>str</em>, <em>delim</em>, <em>count</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#substring_index"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.substring_index" title="Permalink to this definition"></a></dt>
<dd><p>Returns the substring from string str before count occurrences of the delimiter delim.
If count is positive, everything the left of the final delimiter (counting from left) is
returned. If count is negative, every to the right of the final delimiter (counting from the
right) is returned. substring_index performs a case-sensitive match when searching for delim.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;a.b.c.d&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;s&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">substring_index</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="s1">&#39;.&#39;</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=&#39;a.b&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">substring_index</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="s1">&#39;.&#39;</span><span class="p">,</span> <span class="o">-</span><span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;s&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=&#39;b.c.d&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sum">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sum</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sum" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the sum of all values in the expression.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sumDistinct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sumDistinct</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sumDistinct" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the sum of distinct values in the expression.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.tan">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">tan</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.tan" title="Permalink to this definition"></a></dt>
<dd><p>Computes the tangent of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.tanh">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">tanh</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.tanh" title="Permalink to this definition"></a></dt>
<dd><p>Computes the hyperbolic tangent of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.toDegrees">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">toDegrees</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.toDegrees" title="Permalink to this definition"></a></dt>
<dd><div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 2.1, use degrees instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.toRadians">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">toRadians</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.toRadians" title="Permalink to this definition"></a></dt>
<dd><div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 2.1, use radians instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.to_date">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">to_date</code><span class="sig-paren">(</span><em>col</em>, <em>format=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#to_date"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.to_date" title="Permalink to this definition"></a></dt>
<dd><p>Converts a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> of <a class="reference internal" href="#pyspark.sql.types.StringType" title="pyspark.sql.types.StringType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StringType</span></code></a> or
<a class="reference internal" href="#pyspark.sql.types.TimestampType" title="pyspark.sql.types.TimestampType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.TimestampType</span></code></a> into <a class="reference internal" href="#pyspark.sql.types.DateType" title="pyspark.sql.types.DateType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DateType</span></code></a>
using the optionally specified format. Default format is ‘yyyy-MM-dd’.
Specify formats according to
<a class="reference external" href="http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html">SimpleDateFormats</a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-28 10:30:00&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;t&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_date</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=datetime.date(1997, 2, 28))]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-28 10:30:00&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;t&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_date</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">,</span> <span class="s1">&#39;yyyy-MM-dd HH:mm:ss&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;date&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=datetime.date(1997, 2, 28))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.2.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.to_json">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">to_json</code><span class="sig-paren">(</span><em>col</em>, <em>options={}</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#to_json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.to_json" title="Permalink to this definition"></a></dt>
<dd><p>Converts a column containing a [[StructType]] or [[ArrayType]] of [[StructType]]s into a
JSON string. Throws an exception, in the case of an unsupported type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – name of column containing the struct or array of the structs</li>
<li><strong>options</strong> – options to control converting. accepts the same options as the json datasource</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">2</span><span class="p">))]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_json</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">value</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;json&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(json=&#39;{&quot;age&quot;:2,&quot;name&quot;:&quot;Alice&quot;}&#39;)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="p">[</span><span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Alice&#39;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">2</span><span class="p">),</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;Bob&#39;</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">3</span><span class="p">)])]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="s2">&quot;key&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_json</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">value</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;json&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(json=&#39;[{&quot;age&quot;:2,&quot;name&quot;:&quot;Alice&quot;},{&quot;age&quot;:3,&quot;name&quot;:&quot;Bob&quot;}]&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.to_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">to_timestamp</code><span class="sig-paren">(</span><em>col</em>, <em>format=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#to_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.to_timestamp" title="Permalink to this definition"></a></dt>
<dd><p>Converts a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> of <a class="reference internal" href="#pyspark.sql.types.StringType" title="pyspark.sql.types.StringType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StringType</span></code></a> or
<a class="reference internal" href="#pyspark.sql.types.TimestampType" title="pyspark.sql.types.TimestampType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.TimestampType</span></code></a> into <a class="reference internal" href="#pyspark.sql.types.DateType" title="pyspark.sql.types.DateType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DateType</span></code></a>
using the optionally specified format. Default format is ‘yyyy-MM-dd HH:mm:ss’. Specify
formats according to
<a class="reference external" href="http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html">SimpleDateFormats</a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-28 10:30:00&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;t&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_timestamp</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;dt&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-28 10:30:00&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;t&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_timestamp</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">,</span> <span class="s1">&#39;yyyy-MM-dd HH:mm:ss&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;dt&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.2.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.to_utc_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">to_utc_timestamp</code><span class="sig-paren">(</span><em>timestamp</em>, <em>tz</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#to_utc_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.to_utc_timestamp" title="Permalink to this definition"></a></dt>
<dd><p>Given a timestamp, which corresponds to a certain time of day in the given timezone, returns
another timestamp that corresponds to the same time of day in UTC.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-28 10:30:00&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;t&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_utc_timestamp</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">,</span> <span class="s2">&quot;PST&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;t&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(t=datetime.datetime(1997, 2, 28, 18, 30))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.translate">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">translate</code><span class="sig-paren">(</span><em>srcCol</em>, <em>matching</em>, <em>replace</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#translate"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.translate" title="Permalink to this definition"></a></dt>
<dd><p>A function translate any character in the <cite>srcCol</cite> by a character in <cite>matching</cite>.
The characters in <cite>replace</cite> is corresponding to the characters in <cite>matching</cite>.
The translate will happen when any character in the string matching with the character
in the <cite>matching</cite>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;translate&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">translate</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s2">&quot;rnlt&quot;</span><span class="p">,</span> <span class="s2">&quot;123&quot;</span><span class="p">)</span> \
<span class="gp">... </span> <span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;r&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=&#39;1a2s3ae&#39;)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.trim">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">trim</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.trim" title="Permalink to this definition"></a></dt>
<dd><p>Trim the spaces from both ends for the specified string column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.trunc">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">trunc</code><span class="sig-paren">(</span><em>date</em>, <em>format</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#trunc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.trunc" title="Permalink to this definition"></a></dt>
<dd><p>Returns date truncated to the unit specified by the format.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>format</strong> – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;1997-02-28&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;d&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">trunc</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="s1">&#39;year&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(year=datetime.date(1997, 1, 1))]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">trunc</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="s1">&#39;mon&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;month&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(month=datetime.date(1997, 2, 1))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.udf">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">udf</code><span class="sig-paren">(</span><em>f=None</em>, <em>returnType=StringType</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#udf"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.udf" title="Permalink to this definition"></a></dt>
<dd><p>Creates a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expression representing a user defined function (UDF).</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">The user-defined functions must be deterministic. Due to optimization,
duplicate invocations may be eliminated or the function may even be invoked more times than
it is present in the query.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>f</strong> – python function if used as a standalone function</li>
<li><strong>returnType</strong> – a <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.DataType</span></code></a> object</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="k">import</span> <span class="n">IntegerType</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">slen</span> <span class="o">=</span> <span class="n">udf</span><span class="p">(</span><span class="k">lambda</span> <span class="n">s</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">),</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="p">:</span><span class="n">udf</span>
<span class="gp">... </span><span class="k">def</span> <span class="nf">to_upper</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="gp">... </span> <span class="k">if</span> <span class="n">s</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
<span class="gp">... </span> <span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">upper</span><span class="p">()</span>
<span class="gp">...</span>
<span class="gp">&gt;&gt;&gt; </span><span class="p">:</span><span class="n">udf</span><span class="p">(</span><span class="n">returnType</span><span class="o">=</span><span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">... </span><span class="k">def</span> <span class="nf">add_one</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="gp">... </span> <span class="k">if</span> <span class="n">x</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
<span class="gp">... </span> <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span>
<span class="gp">...</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="s2">&quot;John Doe&quot;</span><span class="p">,</span> <span class="mi">21</span><span class="p">)],</span> <span class="p">(</span><span class="s2">&quot;id&quot;</span><span class="p">,</span> <span class="s2">&quot;name&quot;</span><span class="p">,</span> <span class="s2">&quot;age&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">slen</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;slen(name)&quot;</span><span class="p">),</span> <span class="n">to_upper</span><span class="p">(</span><span class="s2">&quot;name&quot;</span><span class="p">),</span> <span class="n">add_one</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----------+--------------+------------+</span>
<span class="go">|slen(name)|to_upper(name)|add_one(age)|</span>
<span class="go">+----------+--------------+------------+</span>
<span class="go">| 8| JOHN DOE| 22|</span>
<span class="go">+----------+--------------+------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.unbase64">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">unbase64</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.unbase64" title="Permalink to this definition"></a></dt>
<dd><p>Decodes a BASE64 encoded string column and returns it as a binary column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.unhex">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">unhex</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#unhex"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.unhex" title="Permalink to this definition"></a></dt>
<dd><p>Inverse of hex. Interprets each pair of characters as a hexadecimal number
and converts to the byte representation of number.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;414243&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">unhex</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(unhex(a)=bytearray(b&#39;ABC&#39;))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.unix_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">unix_timestamp</code><span class="sig-paren">(</span><em>timestamp=None</em>, <em>format='yyyy-MM-dd HH:mm:ss'</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#unix_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.unix_timestamp" title="Permalink to this definition"></a></dt>
<dd><p>Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default)
to Unix time stamp (in seconds), using the default timezone and the default
locale, return null if fail.</p>
<p>if <cite>timestamp</cite> is None, then it returns current timestamp.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.upper">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">upper</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.upper" title="Permalink to this definition"></a></dt>
<dd><p>Converts a string column to upper case.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.var_pop">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">var_pop</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.var_pop" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the population variance of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.var_samp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">var_samp</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.var_samp" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the unbiased variance of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.variance">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">variance</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.variance" title="Permalink to this definition"></a></dt>
<dd><p>Aggregate function: returns the population variance of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.6.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.weekofyear">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">weekofyear</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#weekofyear"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.weekofyear" title="Permalink to this definition"></a></dt>
<dd><p>Extract the week number of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">weekofyear</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;week&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(week=15)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.when">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">when</code><span class="sig-paren">(</span><em>condition</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#when"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.when" title="Permalink to this definition"></a></dt>
<dd><p>Evaluates a list of conditions and returns one of multiple possible result expressions.
If <code class="xref py py-func docutils literal"><span class="pre">Column.otherwise()</span></code> is not invoked, None is returned for unmatched conditions.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>condition</strong> – a boolean <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expression.</li>
<li><strong>value</strong> – a literal value, or a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expression.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;age&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=3), Row(age=4)]</span>
</pre></div>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="mi">2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;age&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=3), Row(age=None)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.window">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">window</code><span class="sig-paren">(</span><em>timeColumn</em>, <em>windowDuration</em>, <em>slideDuration=None</em>, <em>startTime=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#window"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.window" title="Permalink to this definition"></a></dt>
<dd><p>Bucketize rows into one or more time windows given a timestamp specifying column. Window
starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window
[12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in
the order of months are not supported.</p>
<p>The time column must be of <a class="reference internal" href="#pyspark.sql.types.TimestampType" title="pyspark.sql.types.TimestampType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.TimestampType</span></code></a>.</p>
<p>Durations are provided as strings, e.g. ‘1 second’, ‘1 day 12 hours’, ‘2 minutes’. Valid
interval strings are ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’.
If the <code class="docutils literal"><span class="pre">slideDuration</span></code> is not provided, the windows will be tumbling windows.</p>
<p>The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start
window intervals. For example, in order to have hourly tumbling windows that start 15 minutes
past the hour, e.g. 12:15-13:15, 13:15-14:15… provide <cite>startTime</cite> as <cite>15 minutes</cite>.</p>
<p>The output column will be a struct called ‘window’ by default with the nested columns ‘start’
and ‘end’, where ‘start’ and ‘end’ will be of <a class="reference internal" href="#pyspark.sql.types.TimestampType" title="pyspark.sql.types.TimestampType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.TimestampType</span></code></a>.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s2">&quot;2016-03-11 09:00:07&quot;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">(</span><span class="s2">&quot;date&quot;</span><span class="p">,</span> <span class="s2">&quot;val&quot;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">w</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">window</span><span class="p">(</span><span class="s2">&quot;date&quot;</span><span class="p">,</span> <span class="s2">&quot;5 seconds&quot;</span><span class="p">))</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="s2">&quot;val&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;sum&quot;</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">w</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">window</span><span class="o">.</span><span class="n">start</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="s2">&quot;string&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;start&quot;</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">w</span><span class="o">.</span><span class="n">window</span><span class="o">.</span><span class="n">end</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="s2">&quot;string&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">&quot;end&quot;</span><span class="p">),</span> <span class="s2">&quot;sum&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(start=&#39;2016-03-11 09:00:05&#39;, end=&#39;2016-03-11 09:00:10&#39;, sum=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.year">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">year</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#year"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.year" title="Permalink to this definition"></a></dt>
<dd><p>Extract the year of a given date as integer.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s1">&#39;2015-04-08&#39;</span><span class="p">,)],</span> <span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">year</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">&#39;year&#39;</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(year=2015)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
</div>
<div class="section" id="module-pyspark.sql.streaming">
<span id="pyspark-sql-streaming-module"></span><h2>pyspark.sql.streaming module<a class="headerlink" href="#module-pyspark.sql.streaming" title="Permalink to this headline"></a></h2>
<dl class="class">
<dt id="pyspark.sql.streaming.StreamingQuery">
<em class="property">class </em><code class="descclassname">pyspark.sql.streaming.</code><code class="descname">StreamingQuery</code><span class="sig-paren">(</span><em>jsq</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQuery"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery" title="Permalink to this definition"></a></dt>
<dd><p>A handle to a query that is executing continuously in the background as new data arrives.
All these methods are thread-safe.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.streaming.StreamingQuery.awaitTermination">
<code class="descname">awaitTermination</code><span class="sig-paren">(</span><em>timeout=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQuery.awaitTermination"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.awaitTermination" title="Permalink to this definition"></a></dt>
<dd><p>Waits for the termination of <cite>this</cite> query, either by <code class="xref py py-func docutils literal"><span class="pre">query.stop()</span></code> or by an
exception. If the query has terminated with an exception, then the exception will be thrown.
If <cite>timeout</cite> is set, it returns whether the query has terminated or not within the
<cite>timeout</cite> seconds.</p>
<p>If the query has terminated, then all subsequent calls to this method will either return
immediately (if the query was terminated by <a class="reference internal" href="#pyspark.sql.streaming.StreamingQuery.stop" title="pyspark.sql.streaming.StreamingQuery.stop"><code class="xref py py-func docutils literal"><span class="pre">stop()</span></code></a>), or throw the exception
immediately (if the query has terminated with exception).</p>
<p>throws <code class="xref py py-class docutils literal"><span class="pre">StreamingQueryException</span></code>, if <cite>this</cite> query has terminated with an exception</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.StreamingQuery.exception">
<code class="descname">exception</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQuery.exception"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.exception" title="Permalink to this definition"></a></dt>
<dd><table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body">the StreamingQueryException if the query was terminated by an exception, or None.</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.StreamingQuery.explain">
<code class="descname">explain</code><span class="sig-paren">(</span><em>extended=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQuery.explain"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.explain" title="Permalink to this definition"></a></dt>
<dd><p>Prints the (logical and physical) plans to the console for debugging purpose.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>extended</strong> – boolean, default <code class="docutils literal"><span class="pre">False</span></code>. If <code class="docutils literal"><span class="pre">False</span></code>, prints only the physical plan.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;memory&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">queryName</span><span class="p">(</span><span class="s1">&#39;query_explain&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">processAllAvailable</span><span class="p">()</span> <span class="c1"># Wait a bit to generate the runtime plans.</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">explain</span><span class="p">()</span>
<span class="go">== Physical Plan ==</span>
<span class="gp">...</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">explain</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span>
<span class="go">== Parsed Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Analyzed Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Optimized Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Physical Plan ==</span>
<span class="gp">...</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">stop</span><span class="p">()</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.streaming.StreamingQuery.id">
<code class="descname">id</code><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.id" title="Permalink to this definition"></a></dt>
<dd><p>Returns the unique id of this query that persists across restarts from checkpoint data.
That is, this id is generated when a query is started for the first time, and
will be the same every time it is restarted from checkpoint data.
There can only be one query with the same id active in a Spark cluster.
Also see, <cite>runId</cite>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.streaming.StreamingQuery.isActive">
<code class="descname">isActive</code><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.isActive" title="Permalink to this definition"></a></dt>
<dd><p>Whether this streaming query is currently active or not.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.streaming.StreamingQuery.lastProgress">
<code class="descname">lastProgress</code><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.lastProgress" title="Permalink to this definition"></a></dt>
<dd><p>Returns the most recent <code class="xref py py-class docutils literal"><span class="pre">StreamingQueryProgress</span></code> update of this streaming query or
None if there were no progress updates
:return: a map</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.streaming.StreamingQuery.name">
<code class="descname">name</code><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.name" title="Permalink to this definition"></a></dt>
<dd><p>Returns the user-specified name of the query, or null if not specified.
This name can be specified in the <cite>org.apache.spark.sql.streaming.DataStreamWriter</cite>
as <cite>dataframe.writeStream.queryName(“query”).start()</cite>.
This name, if set, must be unique across all active queries.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.StreamingQuery.processAllAvailable">
<code class="descname">processAllAvailable</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQuery.processAllAvailable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.processAllAvailable" title="Permalink to this definition"></a></dt>
<dd><p>Blocks until all available data in the source has been processed and committed to the
sink. This method is intended for testing.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">In the case of continually arriving data, this method may block forever.
Additionally, this method is only guaranteed to block until data that has been
synchronously appended data to a stream source prior to invocation.
(i.e. <cite>getOffset</cite> must immediately reflect the addition).</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.streaming.StreamingQuery.recentProgress">
<code class="descname">recentProgress</code><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.recentProgress" title="Permalink to this definition"></a></dt>
<dd><p>Returns an array of the most recent [[StreamingQueryProgress]] updates for this query.
The number of progress updates retained for each stream is configured by Spark session
configuration <cite>spark.sql.streaming.numRecentProgressUpdates</cite>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.streaming.StreamingQuery.runId">
<code class="descname">runId</code><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.runId" title="Permalink to this definition"></a></dt>
<dd><p>Returns the unique id of this query that does not persist across restarts. That is, every
query that is started (or restarted from checkpoint) will have a different runId.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.streaming.StreamingQuery.status">
<code class="descname">status</code><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.status" title="Permalink to this definition"></a></dt>
<dd><p>Returns the current status of the query.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.StreamingQuery.stop">
<code class="descname">stop</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQuery.stop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQuery.stop" title="Permalink to this definition"></a></dt>
<dd><p>Stop this streaming query.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.streaming.StreamingQueryManager">
<em class="property">class </em><code class="descclassname">pyspark.sql.streaming.</code><code class="descname">StreamingQueryManager</code><span class="sig-paren">(</span><em>jsqm</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQueryManager"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQueryManager" title="Permalink to this definition"></a></dt>
<dd><p>A class to manage all the <a class="reference internal" href="#pyspark.sql.streaming.StreamingQuery" title="pyspark.sql.streaming.StreamingQuery"><code class="xref py py-class docutils literal"><span class="pre">StreamingQuery</span></code></a> StreamingQueries active.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
<dl class="attribute">
<dt id="pyspark.sql.streaming.StreamingQueryManager.active">
<code class="descname">active</code><a class="headerlink" href="#pyspark.sql.streaming.StreamingQueryManager.active" title="Permalink to this definition"></a></dt>
<dd><p>Returns a list of active queries associated with this SQLContext</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;memory&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">queryName</span><span class="p">(</span><span class="s1">&#39;this_query&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sqm</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">streams</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># get the list of active streaming queries</span>
<span class="gp">&gt;&gt;&gt; </span><span class="p">[</span><span class="n">q</span><span class="o">.</span><span class="n">name</span> <span class="k">for</span> <span class="n">q</span> <span class="ow">in</span> <span class="n">sqm</span><span class="o">.</span><span class="n">active</span><span class="p">]</span>
<span class="go">[&#39;this_query&#39;]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">stop</span><span class="p">()</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination">
<code class="descname">awaitAnyTermination</code><span class="sig-paren">(</span><em>timeout=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQueryManager.awaitAnyTermination"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination" title="Permalink to this definition"></a></dt>
<dd><p>Wait until any of the queries on the associated SQLContext has terminated since the
creation of the context, or since <a class="reference internal" href="#pyspark.sql.streaming.StreamingQueryManager.resetTerminated" title="pyspark.sql.streaming.StreamingQueryManager.resetTerminated"><code class="xref py py-func docutils literal"><span class="pre">resetTerminated()</span></code></a> was called. If any query was
terminated with an exception, then the exception will be thrown.
If <cite>timeout</cite> is set, it returns whether the query has terminated or not within the
<cite>timeout</cite> seconds.</p>
<p>If a query has terminated, then subsequent calls to <a class="reference internal" href="#pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination" title="pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination"><code class="xref py py-func docutils literal"><span class="pre">awaitAnyTermination()</span></code></a> will
either return immediately (if the query was terminated by <code class="xref py py-func docutils literal"><span class="pre">query.stop()</span></code>),
or throw the exception immediately (if the query was terminated with exception). Use
<a class="reference internal" href="#pyspark.sql.streaming.StreamingQueryManager.resetTerminated" title="pyspark.sql.streaming.StreamingQueryManager.resetTerminated"><code class="xref py py-func docutils literal"><span class="pre">resetTerminated()</span></code></a> to clear past terminations and wait for new terminations.</p>
<p>In the case where multiple queries have terminated since <code class="xref py py-func docutils literal"><span class="pre">resetTermination()</span></code>
was called, if any query has terminated with exception, then <a class="reference internal" href="#pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination" title="pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination"><code class="xref py py-func docutils literal"><span class="pre">awaitAnyTermination()</span></code></a>
will throw any of the exception. For correctly documenting exceptions across multiple
queries, users need to stop all of them after any of them terminates with exception, and
then check the <cite>query.exception()</cite> for each query.</p>
<p>throws <code class="xref py py-class docutils literal"><span class="pre">StreamingQueryException</span></code>, if <cite>this</cite> query has terminated with an exception</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.StreamingQueryManager.get">
<code class="descname">get</code><span class="sig-paren">(</span><em>id</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQueryManager.get"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQueryManager.get" title="Permalink to this definition"></a></dt>
<dd><p>Returns an active query from this SQLContext or throws exception if an active query
with this name doesn’t exist.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;memory&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">queryName</span><span class="p">(</span><span class="s1">&#39;this_query&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">name</span>
<span class="go">&#39;this_query&#39;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">streams</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">sq</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">isActive</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">streams</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">sq</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">isActive</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">stop</span><span class="p">()</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.StreamingQueryManager.resetTerminated">
<code class="descname">resetTerminated</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#StreamingQueryManager.resetTerminated"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.StreamingQueryManager.resetTerminated" title="Permalink to this definition"></a></dt>
<dd><p>Forget about past terminated queries so that <a class="reference internal" href="#pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination" title="pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination"><code class="xref py py-func docutils literal"><span class="pre">awaitAnyTermination()</span></code></a> can be used
again to wait for new terminations.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">streams</span><span class="o">.</span><span class="n">resetTerminated</span><span class="p">()</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.streaming.DataStreamReader">
<em class="property">class </em><code class="descclassname">pyspark.sql.streaming.</code><code class="descname">DataStreamReader</code><span class="sig-paren">(</span><em>spark</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader" title="Permalink to this definition"></a></dt>
<dd><p>Interface used to load a streaming <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code> from external storage systems
(e.g. file systems, key-value stores, etc). Use <code class="xref py py-func docutils literal"><span class="pre">spark.readStream()</span></code>
to access this.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.csv">
<code class="descname">csv</code><span class="sig-paren">(</span><em>path</em>, <em>schema=None</em>, <em>sep=None</em>, <em>encoding=None</em>, <em>quote=None</em>, <em>escape=None</em>, <em>comment=None</em>, <em>header=None</em>, <em>inferSchema=None</em>, <em>ignoreLeadingWhiteSpace=None</em>, <em>ignoreTrailingWhiteSpace=None</em>, <em>nullValue=None</em>, <em>nanValue=None</em>, <em>positiveInf=None</em>, <em>negativeInf=None</em>, <em>dateFormat=None</em>, <em>timestampFormat=None</em>, <em>maxColumns=None</em>, <em>maxCharsPerColumn=None</em>, <em>maxMalformedLogPerPartition=None</em>, <em>mode=None</em>, <em>columnNameOfCorruptRecord=None</em>, <em>multiLine=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.csv"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.csv" title="Permalink to this definition"></a></dt>
<dd><p>Loads a CSV file stream and returns the result as a <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code>.</p>
<p>This function will go through the input once to determine the input schema if
<code class="docutils literal"><span class="pre">inferSchema</span></code> is enabled. To avoid going through the entire data once, disable
<code class="docutils literal"><span class="pre">inferSchema</span></code> option or specify the schema explicitly using <code class="docutils literal"><span class="pre">schema</span></code>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – string, or list of strings, for input path(s).</li>
<li><strong>schema</strong> – an optional <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> for the input schema.</li>
<li><strong>sep</strong> – sets the single character as a separator for each field and value.
If None is set, it uses the default value, <code class="docutils literal"><span class="pre">,</span></code>.</li>
<li><strong>encoding</strong> – decodes the CSV files by the given encoding type. If None is set,
it uses the default value, <code class="docutils literal"><span class="pre">UTF-8</span></code>.</li>
<li><strong>quote</strong> – sets the single character used for escaping quoted values where the
separator can be part of the value. If None is set, it uses the default
value, <code class="docutils literal"><span class="pre">&quot;</span></code>. If you would like to turn off quotations, you need to set an
empty string.</li>
<li><strong>escape</strong> – sets the single character used for escaping quotes inside an already
quoted value. If None is set, it uses the default value, <code class="docutils literal"><span class="pre">\</span></code>.</li>
<li><strong>comment</strong> – sets the single character used for skipping lines beginning with this
character. By default (None), it is disabled.</li>
<li><strong>header</strong> – uses the first line as names of columns. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>inferSchema</strong> – infers the input schema automatically from data. It requires one extra
pass over the data. If None is set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>ignoreLeadingWhiteSpace</strong> – a flag indicating whether or not leading whitespaces from
values being read should be skipped. If None is set, it
uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>ignoreTrailingWhiteSpace</strong> – a flag indicating whether or not trailing whitespaces from
values being read should be skipped. If None is set, it
uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>nullValue</strong> – sets the string representation of a null value. If None is set, it uses
the default value, empty string. Since 2.0.1, this <code class="docutils literal"><span class="pre">nullValue</span></code> param
applies to all supported types including the string type.</li>
<li><strong>nanValue</strong> – sets the string representation of a non-number value. If None is set, it
uses the default value, <code class="docutils literal"><span class="pre">NaN</span></code>.</li>
<li><strong>positiveInf</strong> – sets the string representation of a positive infinity value. If None
is set, it uses the default value, <code class="docutils literal"><span class="pre">Inf</span></code>.</li>
<li><strong>negativeInf</strong> – sets the string representation of a negative infinity value. If None
is set, it uses the default value, <code class="docutils literal"><span class="pre">Inf</span></code>.</li>
<li><strong>dateFormat</strong> – sets the string that indicates a date format. Custom date formats
follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>. This
applies to date type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd</span></code>.</li>
<li><strong>timestampFormat</strong> – sets the string that indicates a timestamp format. Custom date
formats follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>.
This applies to timestamp type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd'T'HH:mm:ss.SSSXXX</span></code>.</li>
<li><strong>maxColumns</strong> – defines a hard limit of how many columns a record can have. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">20480</span></code>.</li>
<li><strong>maxCharsPerColumn</strong> – defines the maximum number of characters allowed for any given
value being read. If None is set, it uses the default value,
<code class="docutils literal"><span class="pre">-1</span></code> meaning unlimited length.</li>
<li><strong>maxMalformedLogPerPartition</strong> – this parameter is no longer used since Spark 2.2.0.
If specified, it is ignored.</li>
<li><strong>mode</strong><dl class="docutils">
<dt>allows a mode for dealing with corrupt records during parsing. If None is</dt>
<dd>set, it uses the default value, <code class="docutils literal"><span class="pre">PERMISSIVE</span></code>.</dd>
</dl>
<ul>
<li><code class="docutils literal"><span class="pre">PERMISSIVE</span></code> : sets other fields to <code class="docutils literal"><span class="pre">null</span></code> when it meets a corrupted record, and puts the malformed string into a field configured by <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code>. To keep corrupt records, an user can set a string type field named <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets <cite>null</cite> for extra fields.</li>
<li><code class="docutils literal"><span class="pre">DROPMALFORMED</span></code> : ignores the whole corrupted records.</li>
<li><code class="docutils literal"><span class="pre">FAILFAST</span></code> : throws an exception when it meets corrupted records.</li>
</ul>
</li>
<li><strong>columnNameOfCorruptRecord</strong> – allows renaming the new field having malformed string
created by <code class="docutils literal"><span class="pre">PERMISSIVE</span></code> mode. This overrides
<code class="docutils literal"><span class="pre">spark.sql.columnNameOfCorruptRecord</span></code>. If None is set,
it uses the value specified in
<code class="docutils literal"><span class="pre">spark.sql.columnNameOfCorruptRecord</span></code>.</li>
<li><strong>multiLine</strong> – parse one record, which may span multiple lines. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">csv_sdf</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">csv</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="n">schema</span> <span class="o">=</span> <span class="n">sdf_schema</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">csv_sdf</span><span class="o">.</span><span class="n">isStreaming</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">csv_sdf</span><span class="o">.</span><span class="n">schema</span> <span class="o">==</span> <span class="n">sdf_schema</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.format">
<code class="descname">format</code><span class="sig-paren">(</span><em>source</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.format"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.format" title="Permalink to this definition"></a></dt>
<dd><p>Specifies the input data source format.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>source</strong> – string, name of the data source, e.g. ‘json’, ‘parquet’.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">s</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">&quot;text&quot;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.json">
<code class="descname">json</code><span class="sig-paren">(</span><em>path</em>, <em>schema=None</em>, <em>primitivesAsString=None</em>, <em>prefersDecimal=None</em>, <em>allowComments=None</em>, <em>allowUnquotedFieldNames=None</em>, <em>allowSingleQuotes=None</em>, <em>allowNumericLeadingZero=None</em>, <em>allowBackslashEscapingAnyCharacter=None</em>, <em>mode=None</em>, <em>columnNameOfCorruptRecord=None</em>, <em>dateFormat=None</em>, <em>timestampFormat=None</em>, <em>multiLine=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.json" title="Permalink to this definition"></a></dt>
<dd><p>Loads a JSON file stream and returns the results as a <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code>.</p>
<p><a class="reference external" href="http://jsonlines.org/">JSON Lines</a> (newline-delimited JSON) is supported by default.
For JSON (one record per file), set the <code class="docutils literal"><span class="pre">multiLine</span></code> parameter to <code class="docutils literal"><span class="pre">true</span></code>.</p>
<p>If the <code class="docutils literal"><span class="pre">schema</span></code> parameter is not specified, this function goes
through the input once to determine the input schema.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – string represents path to the JSON dataset,
or RDD of Strings storing JSON objects.</li>
<li><strong>schema</strong> – an optional <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> for the input schema.</li>
<li><strong>primitivesAsString</strong> – infers all primitive values as a string type. If None is set,
it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>prefersDecimal</strong> – infers all floating-point values as a decimal type. If the values
do not fit in decimal, then it infers them as doubles. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>allowComments</strong> – ignores Java/C++ style comment in JSON records. If None is set,
it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>allowUnquotedFieldNames</strong> – allows unquoted JSON field names. If None is set,
it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>allowSingleQuotes</strong> – allows single quotes in addition to double quotes. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">true</span></code>.</li>
<li><strong>allowNumericLeadingZero</strong> – allows leading zeros in numbers (e.g. 00012). If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>allowBackslashEscapingAnyCharacter</strong> – allows accepting quoting of all character
using backslash quoting mechanism. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
<li><strong>mode</strong><dl class="docutils">
<dt>allows a mode for dealing with corrupt records during parsing. If None is</dt>
<dd>set, it uses the default value, <code class="docutils literal"><span class="pre">PERMISSIVE</span></code>.</dd>
</dl>
<ul>
<li><code class="docutils literal"><span class="pre">PERMISSIVE</span></code> : sets other fields to <code class="docutils literal"><span class="pre">null</span></code> when it meets a corrupted record, and puts the malformed string into a field configured by <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code>. To keep corrupt records, an user can set a string type field named <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code class="docutils literal"><span class="pre">columnNameOfCorruptRecord</span></code> field in an output schema.</li>
<li><code class="docutils literal"><span class="pre">DROPMALFORMED</span></code> : ignores the whole corrupted records.</li>
<li><code class="docutils literal"><span class="pre">FAILFAST</span></code> : throws an exception when it meets corrupted records.</li>
</ul>
</li>
<li><strong>columnNameOfCorruptRecord</strong> – allows renaming the new field having malformed string
created by <code class="docutils literal"><span class="pre">PERMISSIVE</span></code> mode. This overrides
<code class="docutils literal"><span class="pre">spark.sql.columnNameOfCorruptRecord</span></code>. If None is set,
it uses the value specified in
<code class="docutils literal"><span class="pre">spark.sql.columnNameOfCorruptRecord</span></code>.</li>
<li><strong>dateFormat</strong> – sets the string that indicates a date format. Custom date formats
follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>. This
applies to date type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd</span></code>.</li>
<li><strong>timestampFormat</strong> – sets the string that indicates a timestamp format. Custom date
formats follow the formats at <code class="docutils literal"><span class="pre">java.text.SimpleDateFormat</span></code>.
This applies to timestamp type. If None is set, it uses the
default value, <code class="docutils literal"><span class="pre">yyyy-MM-dd'T'HH:mm:ss.SSSXXX</span></code>.</li>
<li><strong>multiLine</strong> – parse one record, which may span multiple lines, per file. If None is
set, it uses the default value, <code class="docutils literal"><span class="pre">false</span></code>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">json_sdf</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="n">schema</span> <span class="o">=</span> <span class="n">sdf_schema</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">json_sdf</span><span class="o">.</span><span class="n">isStreaming</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">json_sdf</span><span class="o">.</span><span class="n">schema</span> <span class="o">==</span> <span class="n">sdf_schema</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.load">
<code class="descname">load</code><span class="sig-paren">(</span><em>path=None</em>, <em>format=None</em>, <em>schema=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.load"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.load" title="Permalink to this definition"></a></dt>
<dd><p>Loads a data stream from a data source and returns it as a :class`DataFrame`.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – optional string for file-system backed data sources.</li>
<li><strong>format</strong> – optional string for format of the data source. Default to ‘parquet’.</li>
<li><strong>schema</strong> – optional <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> for the input schema.</li>
<li><strong>options</strong> – all other string options</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">json_sdf</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">&quot;json&quot;</span><span class="p">)</span> \
<span class="gp">... </span> <span class="o">.</span><span class="n">schema</span><span class="p">(</span><span class="n">sdf_schema</span><span class="p">)</span> \
<span class="gp">... </span> <span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">json_sdf</span><span class="o">.</span><span class="n">isStreaming</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">json_sdf</span><span class="o">.</span><span class="n">schema</span> <span class="o">==</span> <span class="n">sdf_schema</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.option">
<code class="descname">option</code><span class="sig-paren">(</span><em>key</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.option"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.option" title="Permalink to this definition"></a></dt>
<dd><p>Adds an input option for the underlying data source.</p>
<dl class="docutils">
<dt>You can set the following option(s) for reading files:</dt>
<dd><ul class="first last simple">
<li><dl class="first docutils">
<dt><code class="docutils literal"><span class="pre">timeZone</span></code>: sets the string that indicates a timezone to be used to parse timestamps</dt>
<dd>in the JSON/CSV datasources or partition values.
If it isn’t set, it uses the default value, session local timezone.</dd>
</dl>
</li>
</ul>
</dd>
</dl>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">s</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">option</span><span class="p">(</span><span class="s2">&quot;x&quot;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.options">
<code class="descname">options</code><span class="sig-paren">(</span><em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.options"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.options" title="Permalink to this definition"></a></dt>
<dd><p>Adds input options for the underlying data source.</p>
<dl class="docutils">
<dt>You can set the following option(s) for reading files:</dt>
<dd><ul class="first last simple">
<li><dl class="first docutils">
<dt><code class="docutils literal"><span class="pre">timeZone</span></code>: sets the string that indicates a timezone to be used to parse timestamps</dt>
<dd>in the JSON/CSV datasources or partition values.
If it isn’t set, it uses the default value, session local timezone.</dd>
</dl>
</li>
</ul>
</dd>
</dl>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">s</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">options</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s2">&quot;1&quot;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.parquet">
<code class="descname">parquet</code><span class="sig-paren">(</span><em>path</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.parquet"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.parquet" title="Permalink to this definition"></a></dt>
<dd><p>Loads a Parquet file stream, returning the result as a <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code>.</p>
<dl class="docutils">
<dt>You can set the following Parquet-specific option(s) for reading Parquet files:</dt>
<dd><ul class="first last simple">
<li><code class="docutils literal"><span class="pre">mergeSchema</span></code>: sets whether we should merge schemas collected from all Parquet part-files. This will override <code class="docutils literal"><span class="pre">spark.sql.parquet.mergeSchema</span></code>. The default value is specified in <code class="docutils literal"><span class="pre">spark.sql.parquet.mergeSchema</span></code>.</li>
</ul>
</dd>
</dl>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">parquet_sdf</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">schema</span><span class="p">(</span><span class="n">sdf_schema</span><span class="p">)</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">parquet_sdf</span><span class="o">.</span><span class="n">isStreaming</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">parquet_sdf</span><span class="o">.</span><span class="n">schema</span> <span class="o">==</span> <span class="n">sdf_schema</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.schema">
<code class="descname">schema</code><span class="sig-paren">(</span><em>schema</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.schema"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.schema" title="Permalink to this definition"></a></dt>
<dd><p>Specifies the input schema.</p>
<p>Some data sources (e.g. JSON) can infer the input schema automatically from data.
By specifying the schema here, the underlying data source can skip the schema
inference step, and thus speed up data loading.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>schema</strong> – a <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types.StructType</span></code></a> object</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">s</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">schema</span><span class="p">(</span><span class="n">sdf_schema</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamReader.text">
<code class="descname">text</code><span class="sig-paren">(</span><em>path</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamReader.text"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamReader.text" title="Permalink to this definition"></a></dt>
<dd><p>Loads a text file stream and returns a <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code> whose schema starts with a
string column named “value”, and followed by partitioned columns if there
are any.</p>
<p>Each line in the text file is a new row in the resulting DataFrame.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>paths</strong> – string, or list of strings, for input path(s).</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">text_sdf</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">readStream</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">())</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">text_sdf</span><span class="o">.</span><span class="n">isStreaming</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="s2">&quot;value&quot;</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">text_sdf</span><span class="o">.</span><span class="n">schema</span><span class="p">)</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.streaming.DataStreamWriter">
<em class="property">class </em><code class="descclassname">pyspark.sql.streaming.</code><code class="descname">DataStreamWriter</code><span class="sig-paren">(</span><em>df</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter" title="Permalink to this definition"></a></dt>
<dd><p>Interface used to write a streaming <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code> to external storage systems
(e.g. file systems, key-value stores, etc). Use <code class="xref py py-func docutils literal"><span class="pre">DataFrame.writeStream()</span></code>
to access this.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamWriter.format">
<code class="descname">format</code><span class="sig-paren">(</span><em>source</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter.format"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter.format" title="Permalink to this definition"></a></dt>
<dd><p>Specifies the underlying output data source.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>source</strong> – string, name of the data source, which for now can be ‘parquet’.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">writer</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;json&#39;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamWriter.option">
<code class="descname">option</code><span class="sig-paren">(</span><em>key</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter.option"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter.option" title="Permalink to this definition"></a></dt>
<dd><p>Adds an output option for the underlying data source.</p>
<dl class="docutils">
<dt>You can set the following option(s) for writing files:</dt>
<dd><ul class="first last simple">
<li><dl class="first docutils">
<dt><code class="docutils literal"><span class="pre">timeZone</span></code>: sets the string that indicates a timezone to be used to format</dt>
<dd>timestamps in the JSON/CSV datasources or partition values.
If it isn’t set, it uses the default value, session local timezone.</dd>
</dl>
</li>
</ul>
</dd>
</dl>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamWriter.options">
<code class="descname">options</code><span class="sig-paren">(</span><em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter.options"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter.options" title="Permalink to this definition"></a></dt>
<dd><p>Adds output options for the underlying data source.</p>
<blockquote>
<div><dl class="docutils">
<dt>You can set the following option(s) for writing files:</dt>
<dd><ul class="first last simple">
<li><dl class="first docutils">
<dt><code class="docutils literal"><span class="pre">timeZone</span></code>: sets the string that indicates a timezone to be used to format</dt>
<dd>timestamps in the JSON/CSV datasources or partition values.
If it isn’t set, it uses the default value, session local timezone.</dd>
</dl>
</li>
</ul>
</dd>
</dl>
</div></blockquote>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamWriter.outputMode">
<code class="descname">outputMode</code><span class="sig-paren">(</span><em>outputMode</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter.outputMode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter.outputMode" title="Permalink to this definition"></a></dt>
<dd><p>Specifies how data of a streaming DataFrame/Dataset is written to a streaming sink.</p>
<blockquote>
<div><p>Options include:</p>
<ul class="simple">
<li><dl class="first docutils">
<dt><cite>append</cite>:Only the new rows in the streaming DataFrame/Dataset will be written to</dt>
<dd>the sink</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt><cite>complete</cite>:All the rows in the streaming DataFrame/Dataset will be written to the sink</dt>
<dd>every time these is some updates</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt><cite>update</cite>:only the rows that were updated in the streaming DataFrame/Dataset will be</dt>
<dd>written to the sink every time there are some updates. If the query doesn’t contain
aggregations, it will be equivalent to <cite>append</cite> mode.</dd>
</dl>
</li>
</ul>
</div></blockquote>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p>Evolving.</p>
<div class="last highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">writer</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">outputMode</span><span class="p">(</span><span class="s1">&#39;append&#39;</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamWriter.partitionBy">
<code class="descname">partitionBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter.partitionBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter.partitionBy" title="Permalink to this definition"></a></dt>
<dd><p>Partitions the output by the given columns on the file system.</p>
<p>If specified, the output is laid out on the file system similar
to Hive’s partitioning scheme.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – name of columns</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamWriter.queryName">
<code class="descname">queryName</code><span class="sig-paren">(</span><em>queryName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter.queryName"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter.queryName" title="Permalink to this definition"></a></dt>
<dd><p>Specifies the name of the <a class="reference internal" href="#pyspark.sql.streaming.StreamingQuery" title="pyspark.sql.streaming.StreamingQuery"><code class="xref py py-class docutils literal"><span class="pre">StreamingQuery</span></code></a> that can be started with
<a class="reference internal" href="#pyspark.sql.streaming.DataStreamWriter.start" title="pyspark.sql.streaming.DataStreamWriter.start"><code class="xref py py-func docutils literal"><span class="pre">start()</span></code></a>. This name must be unique among all the currently active queries
in the associated SparkSession.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>queryName</strong> – unique name for the query</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">writer</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">queryName</span><span class="p">(</span><span class="s1">&#39;streaming_query&#39;</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamWriter.start">
<code class="descname">start</code><span class="sig-paren">(</span><em>path=None</em>, <em>format=None</em>, <em>outputMode=None</em>, <em>partitionBy=None</em>, <em>queryName=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter.start"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter.start" title="Permalink to this definition"></a></dt>
<dd><p>Streams the contents of the <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code> to a data source.</p>
<p>The data source is specified by the <code class="docutils literal"><span class="pre">format</span></code> and a set of <code class="docutils literal"><span class="pre">options</span></code>.
If <code class="docutils literal"><span class="pre">format</span></code> is not specified, the default data source configured by
<code class="docutils literal"><span class="pre">spark.sql.sources.default</span></code> will be used.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in a Hadoop supported file system</li>
<li><strong>format</strong> – the format used to save</li>
<li><strong>outputMode</strong><dl class="docutils">
<dt>specifies how data of a streaming DataFrame/Dataset is written to a</dt>
<dd>streaming sink.</dd>
</dl>
<ul>
<li><cite>append</cite>:Only the new rows in the streaming DataFrame/Dataset will be written to the
sink</li>
<li><dl class="first docutils">
<dt><cite>complete</cite>:All the rows in the streaming DataFrame/Dataset will be written to the sink</dt>
<dd>every time these is some updates</dd>
</dl>
</li>
<li><cite>update</cite>:only the rows that were updated in the streaming DataFrame/Dataset will be
written to the sink every time there are some updates. If the query doesn’t contain
aggregations, it will be equivalent to <cite>append</cite> mode.</li>
</ul>
</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
<li><strong>queryName</strong> – unique name for the query</li>
<li><strong>options</strong> – All other string options. You may want to provide a <cite>checkpointLocation</cite>
for most streams, however it is not required for a <cite>memory</cite> stream.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s1">&#39;memory&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">queryName</span><span class="p">(</span><span class="s1">&#39;this_query&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">isActive</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">name</span>
<span class="go">&#39;this_query&#39;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">stop</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">isActive</span>
<span class="go">False</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">trigger</span><span class="p">(</span><span class="n">processingTime</span><span class="o">=</span><span class="s1">&#39;5 seconds&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">start</span><span class="p">(</span>
<span class="gp">... </span> <span class="n">queryName</span><span class="o">=</span><span class="s1">&#39;that_query&#39;</span><span class="p">,</span> <span class="n">outputMode</span><span class="o">=</span><span class="s2">&quot;append&quot;</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s1">&#39;memory&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">name</span>
<span class="go">&#39;that_query&#39;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">isActive</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sq</span><span class="o">.</span><span class="n">stop</span><span class="p">()</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.streaming.DataStreamWriter.trigger">
<code class="descname">trigger</code><span class="sig-paren">(</span><em>processingTime=None</em>, <em>once=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/streaming.html#DataStreamWriter.trigger"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.streaming.DataStreamWriter.trigger" title="Permalink to this definition"></a></dt>
<dd><p>Set the trigger for the stream query. If this is not set it will run the query as fast
as possible, which is equivalent to setting the trigger to <code class="docutils literal"><span class="pre">processingTime='0</span> <span class="pre">seconds'</span></code>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Evolving.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>processingTime</strong> – a processing time interval as a string, e.g. ‘5 seconds’, ‘1 minute’.</td>
</tr>
</tbody>
</table>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="c1"># trigger the query for execution every 5 seconds</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">writer</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">trigger</span><span class="p">(</span><span class="n">processingTime</span><span class="o">=</span><span class="s1">&#39;5 seconds&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># trigger the query for just once batch of data</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">writer</span> <span class="o">=</span> <span class="n">sdf</span><span class="o">.</span><span class="n">writeStream</span><span class="o">.</span><span class="n">trigger</span><span class="p">(</span><span class="n">once</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 2.0.</span></p>
</div>
</dd></dl>
</dd></dl>
</div>
</div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<p class="logo"><a href="index.html">
<img class="logo" src="_static/spark-logo-hd.png" alt="Logo"/>
</a></p>
<h3><a href="index.html">Table Of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">pyspark.sql module</a><ul>
<li><a class="reference internal" href="#module-pyspark.sql">Module Context</a></li>
<li><a class="reference internal" href="#module-pyspark.sql.types">pyspark.sql.types module</a></li>
<li><a class="reference internal" href="#module-pyspark.sql.functions">pyspark.sql.functions module</a></li>
<li><a class="reference internal" href="#module-pyspark.sql.streaming">pyspark.sql.streaming module</a></li>
</ul>
</li>
</ul>
<h4>Previous topic</h4>
<p class="topless"><a href="pyspark.html"
title="previous chapter">pyspark package</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="pyspark.streaming.html"
title="next chapter">pyspark.streaming module</a></p>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="_sources/pyspark.sql.rst.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3>Quick search</h3>
<form class="search" action="search.html" method="get">
<div><input type="text" name="q" /></div>
<div><input type="submit" value="Go" /></div>
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="pyspark.streaming.html" title="pyspark.streaming module"
>next</a></li>
<li class="right" >
<a href="pyspark.html" title="pyspark package"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">PySpark 2.2.1 documentation</a> &#187;</li>
<li class="nav-item nav-item-1"><a href="pyspark.html" >pyspark package</a> &#187;</li>
</ul>
</div>
<div class="footer" role="contentinfo">
&#169; Copyright .
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.6.5.
</div>
</body>
</html>