| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "id": "4fa81d13", |
| "metadata": {}, |
| "source": [ |
| "# ANSI Migration Guide - Pandas API on Spark\n", |
| "ANSI mode is now on by default for Pandas API on Spark. This guide helps you understand the key behavior differences you’ll see.\n", |
| "In short, with ANSI mode on, Pandas API on Spark behavior matches native pandas in cases where Pandas API on Spark with ANSI off did not." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "6e1c7952", |
| "metadata": {}, |
| "source": [ |
| "## Behavior Change\n", |
| "### String Number Comparison\n", |
| "**ANSI off:** Spark implicitly casts numbers and strings, so `1` and `'1'` are considered equal.\n", |
| "\n", |
| "**ANSI on:** behaves like pandas, `1 == '1'` is False." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "69474e28-c1cd-40fe-8ec6-7373b56c4dee", |
| "metadata": {}, |
| "source": [ |
| "Examples are as shown below:\n", |
| "\n", |
| "```python\n", |
| ">>> pdf = pd.DataFrame({\"int\": [1, 2], \"str\": [\"1\", \"2\"]})\n", |
| ">>> psdf = ps.from_pandas(pdf)\n", |
| "\n", |
| "# ANSI on\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", |
| ">>> psdf[\"int\"] == psdf[\"str\"]\n", |
| "0 False\n", |
| "1 False\n", |
| "dtype: bool\n", |
| "\n", |
| "# ANSI off\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", |
| ">>> psdf[\"int\"] == psdf[\"str\"]\n", |
| "0 True\n", |
| "1 True\n", |
| "dtype: bool\n", |
| "\n", |
| "# Pandas\n", |
| ">>> pdf[\"int\"] == pdf[\"str\"]\n", |
| "0 False\n", |
| "1 False\n", |
| "dtype: bool\n", |
| "```" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "90a4ea8d", |
| "metadata": {}, |
| "source": [ |
| "### Strict Casting\n", |
| "**ANSI off:** invalid casts (e.g., `'a' → int`) quietly became NULL.\n", |
| "\n", |
| "**ANSI on:** the same casts raise errors." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "b361febc-4435-4bd1-9ee1-4874413d770c", |
| "metadata": {}, |
| "source": [ |
| "Examples are as shown below:\n", |
| "\n", |
| "```python\n", |
| ">>> pdf = pd.DataFrame({\"str\": [\"a\"]})\n", |
| ">>> psdf = ps.from_pandas(pdf)\n", |
| "\n", |
| "# ANSI on\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", |
| ">>> psdf[\"str\"].astype(int)\n", |
| "Traceback (most recent call last):\n", |
| "...\n", |
| "pyspark.errors.exceptions.captured.NumberFormatException: [CAST_INVALID_INPUT] ...\n", |
| "\n", |
| "# ANSI off\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", |
| ">>> psdf[\"str\"].astype(int)\n", |
| "0 NaN\n", |
| "Name: str, dtype: float64\n", |
| "\n", |
| "# Pandas\n", |
| ">>> pdf[\"str\"].astype(int)\n", |
| "Traceback (most recent call last):\n", |
| "...\n", |
| "ValueError: invalid literal for int() with base 10: 'a'\n", |
| "```" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "e11583e2", |
| "metadata": {}, |
| "source": [ |
| "### MultiIndex.to_series Return\n", |
| "**ANSI off:** Each row is returned as an `ArrayType` value, e.g. `[1, red]`.\n", |
| "\n", |
| "**ANSI on:** Each row is returned as a `StructType` value, which appears as a tuple (e.g., `(1, red)`) if the Runtime SQL Configuration `spark.sql.execution.pandas.structHandlingMode` is set to `'row'`. Otherwise, the result may vary depending on whether Arrow is used. See more in the [Spark Runtime SQL Configuration docs](https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration)." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "4671a895-ed40-4bc4-b1bc-fa9fbb86cc18", |
| "metadata": {}, |
| "source": [ |
| "Examples are as shown below:\n", |
| "\n", |
| "```python\n", |
| ">>> arrays = [[1, 2], [\"red\", \"blue\"]]\n", |
| ">>> pidx = pd.MultiIndex.from_arrays(arrays, names=(\"number\", \"color\"))\n", |
| ">>> psidx = ps.from_pandas(pidx)\n", |
| "\n", |
| "# ANSI on\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", |
| ">>> spark.conf.set(\"spark.sql.execution.pandas.structHandlingMode\", \"row\")\n", |
| ">>> psidx.to_series()\n", |
| "number color\n", |
| "1 red (1, red)\n", |
| "2 blue (2, blue)\n", |
| "dtype: object\n", |
| "\n", |
| "# ANSI off\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", |
| ">>> psidx.to_series()\n", |
| "number color\n", |
| "1 red [1, red]\n", |
| "2 blue [2, blue]\n", |
| "dtype: object\n", |
| "\n", |
| "# Pandas\n", |
| ">>> pidx.to_series()\n", |
| "number color\n", |
| "1 red (1, red)\n", |
| "2 blue (2, blue)\n", |
| "dtype: object\n", |
| "```" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "a9ceb6cb-3bc4-4c23-b74b-84e60fd64e11", |
| "metadata": {}, |
| "source": [ |
| "### Invalid Mixed-Type Operations\n", |
| "**ANSI off:** Spark implicitly coerces so these operations succeed.\n", |
| "\n", |
| "**ANSI on:** Behaves like pandas, such operations are disallowed and raise errors.\n", |
| "\n", |
| "Operation types that show behavior changes under ANSI mode:\n", |
| "\n", |
| "- **Decimal–Float Arithmetic**: `/`, `//`, `*`, `%` \n", |
| "- **Boolean vs. None**: `|`, `&`, `^`" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "2a8d5705-11ea-458c-8528-c7b1b7c88472", |
| "metadata": {}, |
| "source": [ |
| "Example: Decimal–Float Arithmetic\n", |
| "```python\n", |
| ">>> import decimal\n", |
| ">>> pser = pd.Series([decimal.Decimal(1), decimal.Decimal(2)])\n", |
| ">>> psser = ps.from_pandas(pser)\n", |
| "\n", |
| "# ANSI on\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", |
| ">>> psser * 0.1\n", |
| "Traceback (most recent call last):\n", |
| "...\n", |
| "TypeError: Multiplication can not be applied to given types.\n", |
| "\n", |
| "# ANSI off\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", |
| ">>> psser * 0.1\n", |
| "0 0.1\n", |
| "1 0.2\n", |
| "dtype: float64\n", |
| "\n", |
| "# Pandas\n", |
| ">>> pser * 0.1\n", |
| "...\n", |
| "TypeError: unsupported operand type(s) for *: 'decimal.Decimal' and 'float'\n", |
| "```" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "0d2b8268-4b98-4239-95db-5269f9c658d2", |
| "metadata": {}, |
| "source": [ |
| "Example: Boolean vs. None\n", |
| "```python\n", |
| "# ANSI on\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", True)\n", |
| ">>> ps.Series([True, False]) | None\n", |
| "Traceback (most recent call last):\n", |
| "...\n", |
| "TypeError: OR can not be applied to given types.\n", |
| "\n", |
| "# ANSI off\n", |
| ">>> spark.conf.set(\"spark.sql.ansi.enabled\", False)\n", |
| ">>> ps.Series([True, False]) | None\n", |
| "0 False \n", |
| "1 False\n", |
| "dtype: bool\n", |
| "\n", |
| "# Pandas\n", |
| ">>> pd.Series([True, False]) | None\n", |
| "...\n", |
| "TypeError: unsupported operand type(s) for |: 'bool' and 'NoneType'\n", |
| "```" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "fe146afd", |
| "metadata": {}, |
| "source": [ |
| "## Related Configurations\n", |
| "\n", |
| "### `spark.sql.ansi.enabled` (Spark config)\n", |
| "- Native Spark setting that controls ANSI mode. \n", |
| "- The most powerful config to control both SQL and pandas API behavior. \n", |
| "- If set to **False**, Spark reverts to the old behavior, and the other configs are not effective.\n", |
| "\n", |
| "### `compute.ansi_mode_support` (Pandas API on Spark option)\n", |
| "- Indicates whether ANSI mode is fully supported. \n", |
| "- Effective only when ANSI is enabled. \n", |
| "- If set to **False**, pandas API on Spark may hit unexpected results or errors. \n", |
| "- Default is **True**.\n", |
| "\n", |
| "### `compute.fail_on_ansi_mode` (Pandas API on Spark option)\n", |
| "- Controls whether pandas API on Spark fails immediately when ANSI mode is enabled. \n", |
| "- Effective only when ANSI is enabled and `compute.ansi_mode_support` is **False**. \n", |
| "- If set to **False**, forces pandas API on Spark to work with the old behavior even when ANSI is enabled." |
| ] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": "Python 3 (ipykernel)", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.11.13" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 5 |
| } |