blob: 51d3896b9113a9f3fc090a7a27ec73255f00ce62 [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Upgrading PySpark &#8212; PySpark 3.5.0 documentation</title>
<link href="../_static/styles/theme.css?digest=1999514e3f237ded88cf" rel="stylesheet">
<link href="../_static/styles/pydata-sphinx-theme.css?digest=1999514e3f237ded88cf" rel="stylesheet">
<link rel="stylesheet"
href="../_static/vendor/fontawesome/5.13.0/css/all.min.css">
<link rel="preload" as="font" type="font/woff2" crossorigin
href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2">
<link rel="preload" as="font" type="font/woff2" crossorigin
href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2">
<link rel="stylesheet" href="../_static/styles/pydata-sphinx-theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" type="text/css" href="../_static/css/pyspark.css" />
<link rel="preload" as="script" href="../_static/scripts/pydata-sphinx-theme.js?digest=1999514e3f237ded88cf">
<script id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script src="../_static/jquery.js"></script>
<script src="../_static/underscore.js"></script>
<script src="../_static/doctools.js"></script>
<script src="../_static/language_data.js"></script>
<script src="../_static/copybutton.js"></script>
<script crossorigin="anonymous" integrity="sha256-Ae2Vz/4ePdIu6ZyI/5ZGsYnb+m0JlOmKPjt6XZ9JJkA=" src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js"></script>
<script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({"tex2jax": {"inlineMath": [["$", "$"], ["\\(", "\\)"]], "processEscapes": true, "ignoreClass": "document", "processClass": "math|output_area"}})</script>
<link rel="canonical" href="https://spark.apache.org/docs/latest/api/python/migration_guide/pyspark_upgrade.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Migrating from Koalas to pandas API on Spark" href="koalas_to_pyspark.html" />
<link rel="prev" title="Migration Guides" href="index.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="None">
<!-- Google Analytics -->
</head>
<body data-spy="scroll" data-target="#bd-toc-nav" data-offset="80">
<div class="container-fluid" id="banner"></div>
<nav class="navbar navbar-light navbar-expand-lg bg-light fixed-top bd-navbar" id="navbar-main"><div class="container-xl">
<div id="navbar-start">
<a class="navbar-brand" href="../index.html">
<img src="../_static/spark-logo-reverse.png" class="logo" alt="logo">
</a>
</div>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbar-collapsible" aria-controls="navbar-collapsible" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar-collapsible" class="col-lg-9 collapse navbar-collapse">
<div id="navbar-center" class="mr-auto">
<div class="navbar-center-item">
<ul id="navbar-main-elements" class="navbar-nav">
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../index.html">
Overview
</a>
</li>
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../getting_started/index.html">
Getting Started
</a>
</li>
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../user_guide/index.html">
User Guides
</a>
</li>
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../reference/index.html">
API Reference
</a>
</li>
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../development/index.html">
Development
</a>
</li>
<li class="toctree-l1 current active nav-item">
<a class="reference internal nav-link" href="index.html">
Migration Guides
</a>
</li>
</ul>
</div>
</div>
<div id="navbar-end">
<div class="navbar-end-item">
<ul id="navbar-icon-links" class="navbar-nav" aria-label="Icon Links">
</ul>
</div>
</div>
</div>
</div>
</nav>
<div class="container-xl">
<div class="row">
<!-- Only show if we have sidebars configured, else just a small margin -->
<div class="col-12 col-md-3 bd-sidebar">
<div class="sidebar-start-items"><form class="bd-search d-flex align-items-center" action="../search.html" method="get">
<i class="icon fas fa-search"></i>
<input type="search" class="form-control" name="q" id="search-input" placeholder="Search the docs ..." aria-label="Search the docs ..." autocomplete="off" >
</form><nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation">
<div class="bd-toc-item active">
<ul class="current nav bd-sidenav">
<li class="toctree-l1 current active">
<a class="current reference internal" href="#">
Upgrading PySpark
</a>
</li>
</ul>
<ul class="nav bd-sidenav">
<li class="toctree-l1">
<a class="reference internal" href="koalas_to_pyspark.html">
Migrating from Koalas to pandas API on Spark
</a>
</li>
</ul>
</div>
</nav>
</div>
<div class="sidebar-end-items">
</div>
</div>
<div class="d-none d-xl-block col-xl-2 bd-toc">
<div class="toc-item">
<div class="tocsection onthispage pt-5 pb-3">
<i class="fas fa-list"></i> On this page
</div>
<nav id="bd-toc-nav">
<ul class="visible nav section-nav flex-column">
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-3-3-to-3-4">
Upgrading from PySpark 3.3 to 3.4
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-3-2-to-3-3">
Upgrading from PySpark 3.2 to 3.3
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-3-1-to-3-2">
Upgrading from PySpark 3.1 to 3.2
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-2-4-to-3-0">
Upgrading from PySpark 2.4 to 3.0
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-2-3-to-2-4">
Upgrading from PySpark 2.3 to 2.4
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-2-3-0-to-2-3-1-and-above">
Upgrading from PySpark 2.3.0 to 2.3.1 and above
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-2-2-to-2-3">
Upgrading from PySpark 2.2 to 2.3
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-1-4-to-1-5">
Upgrading from PySpark 1.4 to 1.5
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#upgrading-from-pyspark-1-0-1-2-to-1-3">
Upgrading from PySpark 1.0-1.2 to 1.3
</a>
</li>
</ul>
</nav>
</div>
<div class="toc-item">
</div>
</div>
<main class="col-12 col-md-9 col-xl-7 py-md-5 pl-md-5 pr-md-4 bd-content" role="main">
<div>
<div class="section" id="upgrading-pyspark">
<h1>Upgrading PySpark<a class="headerlink" href="#upgrading-pyspark" title="Permalink to this headline"></a></h1>
<div class="section" id="upgrading-from-pyspark-3-3-to-3-4">
<h2>Upgrading from PySpark 3.3 to 3.4<a class="headerlink" href="#upgrading-from-pyspark-3-3-to-3-4" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><p>In Spark 3.4, the schema of an array column is inferred by merging the schemas of all elements in the array. To restore the previous behavior where the schema is only inferred from the first element, you can set <code class="docutils literal notranslate"><span class="pre">spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled</span></code> to <code class="docutils literal notranslate"><span class="pre">true</span></code>.</p></li>
<li><p>In Spark 3.4, if Pandas on Spark API <code class="docutils literal notranslate"><span class="pre">Groupby.apply</span></code>’s <code class="docutils literal notranslate"><span class="pre">func</span></code> parameter return type is not specified and <code class="docutils literal notranslate"><span class="pre">compute.shortcut_limit</span></code> is set to 0, the sampling rows will be set to 2 (ensure sampling rows always &gt;= 2) to make sure infer schema is accurate.</p></li>
<li><p>In Spark 3.4, if Pandas on Spark API <code class="docutils literal notranslate"><span class="pre">Index.insert</span></code> is out of bounds, will raise IndexError with <code class="docutils literal notranslate"><span class="pre">index</span> <span class="pre">{}</span> <span class="pre">is</span> <span class="pre">out</span> <span class="pre">of</span> <span class="pre">bounds</span> <span class="pre">for</span> <span class="pre">axis</span> <span class="pre">0</span> <span class="pre">with</span> <span class="pre">size</span> <span class="pre">{}</span></code> to follow pandas 1.4 behavior.</p></li>
<li><p>In Spark 3.4, the series name will be preserved in Pandas on Spark API <code class="docutils literal notranslate"><span class="pre">Series.mode</span></code> to follow pandas 1.4 behavior.</p></li>
<li><p>In Spark 3.4, the Pandas on Spark API <code class="docutils literal notranslate"><span class="pre">Index.__setitem__</span></code> will first to check <code class="docutils literal notranslate"><span class="pre">value</span></code> type is <code class="docutils literal notranslate"><span class="pre">Column</span></code> type to avoid raising unexpected <code class="docutils literal notranslate"><span class="pre">ValueError</span></code> in <code class="docutils literal notranslate"><span class="pre">is_list_like</span></code> like <cite>Cannot convert column into bool: please use ‘&amp;’ for ‘and’, ‘|’ for ‘or’, ‘~’ for ‘not’ when building DataFrame boolean expressions.</cite>.</p></li>
<li><p>In Spark 3.4, the Pandas on Spark API <code class="docutils literal notranslate"><span class="pre">astype('category')</span></code> will also refresh <code class="docutils literal notranslate"><span class="pre">categories.dtype</span></code> according to original data <code class="docutils literal notranslate"><span class="pre">dtype</span></code> to follow pandas 1.4 behavior.</p></li>
<li><p>In Spark 3.4, the Pandas on Spark API supports groupby positional indexing in <code class="docutils literal notranslate"><span class="pre">GroupBy.head</span></code> and <code class="docutils literal notranslate"><span class="pre">GroupBy.tail</span></code> to follow pandas 1.4. Negative arguments now work correctly and result in ranges relative to the end and start of each group, Previously, negative arguments returned empty frames.</p></li>
<li><p>In Spark 3.4, the infer schema process of <code class="docutils literal notranslate"><span class="pre">groupby.apply</span></code> in Pandas on Spark, will first infer the pandas type to ensure the accuracy of the pandas <code class="docutils literal notranslate"><span class="pre">dtype</span></code> as much as possible.</p></li>
<li><p>In Spark 3.4, the <code class="docutils literal notranslate"><span class="pre">Series.concat</span></code> sort parameter will be respected to follow pandas 1.4 behaviors.</p></li>
<li><p>In Spark 3.4, the <code class="docutils literal notranslate"><span class="pre">DataFrame.__setitem__</span></code> will make a copy and replace pre-existing arrays, which will NOT be over-written to follow pandas 1.4 behaviors.</p></li>
<li><p>In Spark 3.4, the <code class="docutils literal notranslate"><span class="pre">SparkSession.sql</span></code> and the Pandas on Spark API <code class="docutils literal notranslate"><span class="pre">sql</span></code> have got new parameter <code class="docutils literal notranslate"><span class="pre">args</span></code> which provides binding of named parameters to their SQL literals.</p></li>
<li><p>In Spark 3.4, Pandas API on Spark follows for the pandas 2.0, and some APIs were deprecated or removed in Spark 3.4 according to the changes made in pandas 2.0. Please refer to the [release notes of pandas](<a class="reference external" href="https://pandas.pydata.org/docs/dev/whatsnew/">https://pandas.pydata.org/docs/dev/whatsnew/</a>) for more details.</p></li>
<li><p>In Spark 3.4, the custom monkey-patch of <code class="docutils literal notranslate"><span class="pre">collections.namedtuple</span></code> was removed, and <code class="docutils literal notranslate"><span class="pre">cloudpickle</span></code> was used by default. To restore the previous behavior for any relevant pickling issue of <code class="docutils literal notranslate"><span class="pre">collections.namedtuple</span></code>, set <code class="docutils literal notranslate"><span class="pre">PYSPARK_ENABLE_NAMEDTUPLE_PATCH</span></code> environment variable to <code class="docutils literal notranslate"><span class="pre">1</span></code>.</p></li>
</ul>
</div>
<div class="section" id="upgrading-from-pyspark-3-2-to-3-3">
<h2>Upgrading from PySpark 3.2 to 3.3<a class="headerlink" href="#upgrading-from-pyspark-3-2-to-3-3" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><p>In Spark 3.3, the <code class="docutils literal notranslate"><span class="pre">pyspark.pandas.sql</span></code> method follows [the standard Python string formatter](<a class="reference external" href="https://docs.python.org/3/library/string.html#format-string-syntax">https://docs.python.org/3/library/string.html#format-string-syntax</a>). To restore the previous behavior, set <code class="docutils literal notranslate"><span class="pre">PYSPARK_PANDAS_SQL_LEGACY</span></code> environment variable to <code class="docutils literal notranslate"><span class="pre">1</span></code>.</p></li>
<li><p>In Spark 3.3, the <code class="docutils literal notranslate"><span class="pre">drop</span></code> method of pandas API on Spark DataFrame supports dropping rows by <code class="docutils literal notranslate"><span class="pre">index</span></code>, and sets dropping by index instead of column by default.</p></li>
<li><p>In Spark 3.3, PySpark upgrades Pandas version, the new minimum required version changes from 0.23.2 to 1.0.5.</p></li>
<li><p>In Spark 3.3, the <code class="docutils literal notranslate"><span class="pre">repr</span></code> return values of SQL DataTypes have been changed to yield an object with the same value when passed to <code class="docutils literal notranslate"><span class="pre">eval</span></code>.</p></li>
</ul>
</div>
<div class="section" id="upgrading-from-pyspark-3-1-to-3-2">
<h2>Upgrading from PySpark 3.1 to 3.2<a class="headerlink" href="#upgrading-from-pyspark-3-1-to-3-2" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><p>In Spark 3.2, the PySpark methods from sql, ml, spark_on_pandas modules raise the <code class="docutils literal notranslate"><span class="pre">TypeError</span></code> instead of <code class="docutils literal notranslate"><span class="pre">ValueError</span></code> when are applied to an param of inappropriate type.</p></li>
<li><p>In Spark 3.2, the traceback from Python UDFs, pandas UDFs and pandas function APIs are simplified by default without the traceback from the internal Python workers. In Spark 3.1 or earlier, the traceback from Python workers was printed out. To restore the behavior before Spark 3.2, you can set <code class="docutils literal notranslate"><span class="pre">spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled</span></code> to <code class="docutils literal notranslate"><span class="pre">false</span></code>.</p></li>
<li><p>In Spark 3.2, pinned thread mode is enabled by default to map each Python thread to the corresponding JVM thread. Previously,
one JVM thread could be reused for multiple Python threads, which resulted in one JVM thread local being shared to multiple Python threads.
Also, note that now <code class="docutils literal notranslate"><span class="pre">pyspark.InheritableThread</span></code> or <code class="docutils literal notranslate"><span class="pre">pyspark.inheritable_thread_target</span></code> is recommended to use together for a Python thread
to properly inherit the inheritable attributes such as local properties in a JVM thread, and to avoid a potential resource leak issue.
To restore the behavior before Spark 3.2, you can set <code class="docutils literal notranslate"><span class="pre">PYSPARK_PIN_THREAD</span></code> environment variable to <code class="docutils literal notranslate"><span class="pre">false</span></code>.</p></li>
</ul>
</div>
<div class="section" id="upgrading-from-pyspark-2-4-to-3-0">
<h2>Upgrading from PySpark 2.4 to 3.0<a class="headerlink" href="#upgrading-from-pyspark-2-4-to-3-0" title="Permalink to this headline"></a></h2>
<ul>
<li><p>In Spark 3.0, PySpark requires a pandas version of 0.23.2 or higher to use pandas related functionality, such as <code class="docutils literal notranslate"><span class="pre">toPandas</span></code>, <code class="docutils literal notranslate"><span class="pre">createDataFrame</span></code> from pandas DataFrame, and so on.</p></li>
<li><p>In Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as <code class="docutils literal notranslate"><span class="pre">pandas_udf</span></code>, <code class="docutils literal notranslate"><span class="pre">toPandas</span></code> and <code class="docutils literal notranslate"><span class="pre">createDataFrame</span></code> with “spark.sql.execution.arrow.enabled=true”, etc.</p></li>
<li><p>In PySpark, when creating a <code class="docutils literal notranslate"><span class="pre">SparkSession</span></code> with <code class="docutils literal notranslate"><span class="pre">SparkSession.builder.getOrCreate()</span></code>, if there is an existing <code class="docutils literal notranslate"><span class="pre">SparkContext</span></code>, the builder was trying to update the <code class="docutils literal notranslate"><span class="pre">SparkConf</span></code> of the existing <code class="docutils literal notranslate"><span class="pre">SparkContext</span></code> with configurations specified to the builder, but the <code class="docutils literal notranslate"><span class="pre">SparkContext</span></code> is shared by all <code class="docutils literal notranslate"><span class="pre">SparkSession</span></code> s, so we should not update them. In 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a <code class="docutils literal notranslate"><span class="pre">SparkSession</span></code>.</p></li>
<li><p>In PySpark, when Arrow optimization is enabled, if Arrow version is higher than 0.11.0, Arrow can perform safe type conversion when converting pandas.Series to an Arrow array during serialization. Arrow raises errors when detecting unsafe type conversions like overflow. You enable it by setting <code class="docutils literal notranslate"><span class="pre">spark.sql.execution.pandas.convertToArrowArraySafely</span></code> to true. The default setting is false. PySpark behavior for Arrow versions is illustrated in the following table:</p>
<blockquote>
<div><table class="table">
<colgroup>
<col style="width: 49%" />
<col style="width: 20%" />
<col style="width: 31%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>PyArrow version</p></th>
<th class="head"><p>Integer overflow</p></th>
<th class="head"><p>Floating point truncation</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>0.11.0 and below</p></td>
<td><p>Raise error</p></td>
<td><p>Silently allows</p></td>
</tr>
<tr class="row-odd"><td><p>&gt; 0.11.0, arrowSafeTypeConversion=false</p></td>
<td><p>Silent overflow</p></td>
<td><p>Silently allows</p></td>
</tr>
<tr class="row-even"><td><p>&gt; 0.11.0, arrowSafeTypeConversion=true</p></td>
<td><p>Raise error</p></td>
<td><p>Raise error</p></td>
</tr>
</tbody>
</table>
</div></blockquote>
</li>
<li><p>In Spark 3.0, <code class="docutils literal notranslate"><span class="pre">createDataFrame(...,</span> <span class="pre">verifySchema=True)</span></code> validates LongType as well in PySpark. Previously, LongType was not verified and resulted in None in case the value overflows. To restore this behavior, verifySchema can be set to False to disable the validation.</p></li>
<li><p>As of Spark 3.0, <code class="docutils literal notranslate"><span class="pre">Row</span></code> field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable <code class="docutils literal notranslate"><span class="pre">PYSPARK_ROW_FIELD_SORTING_ENABLED</span></code> to true for both executors and driver - this environment variable must be consistent on all executors and driver; otherwise, it may cause failures or incorrect answers. For Python versions less than 3.6, the field names will be sorted alphabetically as the only option.</p></li>
<li><p>In Spark 3.0, <code class="docutils literal notranslate"><span class="pre">pyspark.ml.param.shared.Has*</span></code> mixins do not provide any <code class="docutils literal notranslate"><span class="pre">set*(self,</span> <span class="pre">value)</span></code> setter methods anymore, use the respective <code class="docutils literal notranslate"><span class="pre">self.set(self.*,</span> <span class="pre">value)</span></code> instead. See <a class="reference external" href="https://issues.apache.org/jira/browse/SPARK-29093">SPARK-29093</a> for details.</p></li>
</ul>
</div>
<div class="section" id="upgrading-from-pyspark-2-3-to-2-4">
<h2>Upgrading from PySpark 2.3 to 2.4<a class="headerlink" href="#upgrading-from-pyspark-2-3-to-2-4" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><p>In PySpark, when Arrow optimization is enabled, previously <code class="docutils literal notranslate"><span class="pre">toPandas</span></code> just failed when Arrow optimization is unable to be used whereas <code class="docutils literal notranslate"><span class="pre">createDataFrame</span></code> from Pandas DataFrame allowed the fallback to non-optimization. Now, both <code class="docutils literal notranslate"><span class="pre">toPandas</span></code> and <code class="docutils literal notranslate"><span class="pre">createDataFrame</span></code> from Pandas DataFrame allow the fallback by default, which can be switched off by <code class="docutils literal notranslate"><span class="pre">spark.sql.execution.arrow.fallback.enabled</span></code>.</p></li>
</ul>
</div>
<div class="section" id="upgrading-from-pyspark-2-3-0-to-2-3-1-and-above">
<h2>Upgrading from PySpark 2.3.0 to 2.3.1 and above<a class="headerlink" href="#upgrading-from-pyspark-2-3-0-to-2-3-1-and-above" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><p>As of version 2.3.1 Arrow functionality, including <code class="docutils literal notranslate"><span class="pre">pandas_udf</span></code> and <code class="docutils literal notranslate"><span class="pre">toPandas()</span></code>/<code class="docutils literal notranslate"><span class="pre">createDataFrame()</span></code> with <code class="docutils literal notranslate"><span class="pre">spark.sql.execution.arrow.enabled</span></code> set to <code class="docutils literal notranslate"><span class="pre">True</span></code>, has been marked as experimental. These are still evolving and not currently recommended for use in production.</p></li>
</ul>
</div>
<div class="section" id="upgrading-from-pyspark-2-2-to-2-3">
<h2>Upgrading from PySpark 2.2 to 2.3<a class="headerlink" href="#upgrading-from-pyspark-2-2-to-2-3" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><p>In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas related functionalities, such as <code class="docutils literal notranslate"><span class="pre">toPandas</span></code>, <code class="docutils literal notranslate"><span class="pre">createDataFrame</span></code> from Pandas DataFrame, etc.</p></li>
<li><p>In PySpark, the behavior of timestamp values for Pandas related functionalities was changed to respect session timezone. If you want to use the old behavior, you need to set a configuration <code class="docutils literal notranslate"><span class="pre">spark.sql.execution.pandas.respectSessionTimeZone</span></code> to False. See <a class="reference external" href="https://issues.apache.org/jira/browse/SPARK-22395">SPARK-22395</a> for details.</p></li>
<li><p>In PySpark, <code class="docutils literal notranslate"><span class="pre">na.fill()</span></code> or <code class="docutils literal notranslate"><span class="pre">fillna</span></code> also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.</p></li>
<li><p>In PySpark, <code class="docutils literal notranslate"><span class="pre">df.replace</span></code> does not allow to omit value when <code class="docutils literal notranslate"><span class="pre">to_replace</span></code> is not a dictionary. Previously, value could be omitted in the other cases and had None by default, which is counterintuitive and error-prone.</p></li>
</ul>
</div>
<div class="section" id="upgrading-from-pyspark-1-4-to-1-5">
<h2>Upgrading from PySpark 1.4 to 1.5<a class="headerlink" href="#upgrading-from-pyspark-1-4-to-1-5" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><p>Resolution of strings to columns in Python now supports using dots (.) to qualify the column or access nested values. For example <code class="docutils literal notranslate"><span class="pre">df['table.column.nestedField']</span></code>. However, this means that if your column name contains any dots you must now escape them using backticks (e.g., <code class="docutils literal notranslate"><span class="pre">table.`column.with.dots`.nested</span></code>).</p></li>
<li><p>DataFrame.withColumn method in PySpark supports adding a new column or replacing existing columns of the same name.</p></li>
</ul>
</div>
<div class="section" id="upgrading-from-pyspark-1-0-1-2-to-1-3">
<h2>Upgrading from PySpark 1.0-1.2 to 1.3<a class="headerlink" href="#upgrading-from-pyspark-1-0-1-2-to-1-3" title="Permalink to this headline"></a></h2>
<ul class="simple">
<li><p>When using DataTypes in Python you will need to construct them (i.e. <code class="docutils literal notranslate"><span class="pre">StringType()</span></code>) instead of referencing a singleton.</p></li>
</ul>
</div>
</div>
</div>
<!-- Previous / next buttons -->
<div class='prev-next-area'>
<a class='left-prev' id="prev-link" href="index.html" title="previous page">
<i class="fas fa-angle-left"></i>
<div class="prev-next-info">
<p class="prev-next-subtitle">previous</p>
<p class="prev-next-title">Migration Guides</p>
</div>
</a>
<a class='right-next' id="next-link" href="koalas_to_pyspark.html" title="next page">
<div class="prev-next-info">
<p class="prev-next-subtitle">next</p>
<p class="prev-next-title">Migrating from Koalas to pandas API on Spark</p>
</div>
<i class="fas fa-angle-right"></i>
</a>
</div>
</main>
</div>
</div>
<script src="../_static/scripts/pydata-sphinx-theme.js?digest=1999514e3f237ded88cf"></script>
<footer class="footer mt-5 mt-md-0">
<div class="container">
<div class="footer-item">
<p class="copyright">
&copy; Copyright .<br>
</p>
</div>
<div class="footer-item">
<p class="sphinx-version">
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 3.0.4.<br>
</p>
</div>
</div>
</footer>
</body>
</html>