| --- |
| layout: global |
| title: Reference |
| displayTitle: Reference |
| --- |
| |
| * Table of contents |
| {:toc} |
| |
| ## Data Types |
| |
| Spark SQL and DataFrames support the following data types: |
| |
| * Numeric types |
| - `ByteType`: Represents 1-byte signed integer numbers. |
| The range of numbers is from `-128` to `127`. |
| - `ShortType`: Represents 2-byte signed integer numbers. |
| The range of numbers is from `-32768` to `32767`. |
| - `IntegerType`: Represents 4-byte signed integer numbers. |
| The range of numbers is from `-2147483648` to `2147483647`. |
| - `LongType`: Represents 8-byte signed integer numbers. |
| The range of numbers is from `-9223372036854775808` to `9223372036854775807`. |
| - `FloatType`: Represents 4-byte single-precision floating point numbers. |
| - `DoubleType`: Represents 8-byte double-precision floating point numbers. |
| - `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale. |
| * String type |
| - `StringType`: Represents character string values. |
| * Binary type |
| - `BinaryType`: Represents byte sequence values. |
| * Boolean type |
| - `BooleanType`: Represents boolean values. |
| * Datetime type |
| - `TimestampType`: Represents values comprising values of fields year, month, day, |
| hour, minute, and second. |
| - `DateType`: Represents values comprising values of fields year, month, day. |
| * Complex types |
| - `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of |
| elements with the type of `elementType`. `containsNull` is used to indicate if |
| elements in a `ArrayType` value can have `null` values. |
| - `MapType(keyType, valueType, valueContainsNull)`: |
| Represents values comprising a set of key-value pairs. The data type of keys are |
| described by `keyType` and the data type of values are described by `valueType`. |
| For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull` |
| is used to indicate if values of a `MapType` value can have `null` values. |
| - `StructType(fields)`: Represents values with the structure described by |
| a sequence of `StructField`s (`fields`). |
| * `StructField(name, dataType, nullable)`: Represents a field in a `StructType`. |
| The name of a field is indicated by `name`. The data type of a field is indicated |
| by `dataType`. `nullable` is used to indicate if values of this fields can have |
| `null` values. |
| |
| <div class="codetabs"> |
| <div data-lang="scala" markdown="1"> |
| |
| All data types of Spark SQL are located in the package `org.apache.spark.sql.types`. |
| You can access them by doing |
| |
| {% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} |
| |
| <table class="table"> |
| <tr> |
| <th style="width:20%">Data type</th> |
| <th style="width:40%">Value type in Scala</th> |
| <th>API to access or create a data type</th></tr> |
| <tr> |
| <td> <b>ByteType</b> </td> |
| <td> Byte </td> |
| <td> |
| ByteType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ShortType</b> </td> |
| <td> Short </td> |
| <td> |
| ShortType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>IntegerType</b> </td> |
| <td> Int </td> |
| <td> |
| IntegerType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>LongType</b> </td> |
| <td> Long </td> |
| <td> |
| LongType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>FloatType</b> </td> |
| <td> Float </td> |
| <td> |
| FloatType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DoubleType</b> </td> |
| <td> Double </td> |
| <td> |
| DoubleType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DecimalType</b> </td> |
| <td> java.math.BigDecimal </td> |
| <td> |
| DecimalType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StringType</b> </td> |
| <td> String </td> |
| <td> |
| StringType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BinaryType</b> </td> |
| <td> Array[Byte] </td> |
| <td> |
| BinaryType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BooleanType</b> </td> |
| <td> Boolean </td> |
| <td> |
| BooleanType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>TimestampType</b> </td> |
| <td> java.sql.Timestamp </td> |
| <td> |
| TimestampType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DateType</b> </td> |
| <td> java.sql.Date </td> |
| <td> |
| DateType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ArrayType</b> </td> |
| <td> scala.collection.Seq </td> |
| <td> |
| ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>MapType</b> </td> |
| <td> scala.collection.Map </td> |
| <td> |
| MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructType</b> </td> |
| <td> org.apache.spark.sql.Row </td> |
| <td> |
| StructType(<i>fields</i>)<br /> |
| <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same |
| name are not allowed. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructField</b> </td> |
| <td> The value type in Scala of the data type of this field |
| (For example, Int for a StructField with the data type IntegerType) </td> |
| <td> |
| StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br /> |
| <b>Note:</b> The default value of <i>nullable</i> is <i>true</i>. |
| </td> |
| </tr> |
| </table> |
| |
| </div> |
| |
| <div data-lang="java" markdown="1"> |
| |
| All data types of Spark SQL are located in the package of |
| `org.apache.spark.sql.types`. To access or create a data type, |
| please use factory methods provided in |
| `org.apache.spark.sql.types.DataTypes`. |
| |
| <table class="table"> |
| <tr> |
| <th style="width:20%">Data type</th> |
| <th style="width:40%">Value type in Java</th> |
| <th>API to access or create a data type</th></tr> |
| <tr> |
| <td> <b>ByteType</b> </td> |
| <td> byte or Byte </td> |
| <td> |
| DataTypes.ByteType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ShortType</b> </td> |
| <td> short or Short </td> |
| <td> |
| DataTypes.ShortType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>IntegerType</b> </td> |
| <td> int or Integer </td> |
| <td> |
| DataTypes.IntegerType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>LongType</b> </td> |
| <td> long or Long </td> |
| <td> |
| DataTypes.LongType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>FloatType</b> </td> |
| <td> float or Float </td> |
| <td> |
| DataTypes.FloatType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DoubleType</b> </td> |
| <td> double or Double </td> |
| <td> |
| DataTypes.DoubleType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DecimalType</b> </td> |
| <td> java.math.BigDecimal </td> |
| <td> |
| DataTypes.createDecimalType()<br /> |
| DataTypes.createDecimalType(<i>precision</i>, <i>scale</i>). |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StringType</b> </td> |
| <td> String </td> |
| <td> |
| DataTypes.StringType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BinaryType</b> </td> |
| <td> byte[] </td> |
| <td> |
| DataTypes.BinaryType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BooleanType</b> </td> |
| <td> boolean or Boolean </td> |
| <td> |
| DataTypes.BooleanType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>TimestampType</b> </td> |
| <td> java.sql.Timestamp </td> |
| <td> |
| DataTypes.TimestampType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DateType</b> </td> |
| <td> java.sql.Date </td> |
| <td> |
| DataTypes.DateType |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ArrayType</b> </td> |
| <td> java.util.List </td> |
| <td> |
| DataTypes.createArrayType(<i>elementType</i>)<br /> |
| <b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br /> |
| DataTypes.createArrayType(<i>elementType</i>, <i>containsNull</i>). |
| </td> |
| </tr> |
| <tr> |
| <td> <b>MapType</b> </td> |
| <td> java.util.Map </td> |
| <td> |
| DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>)<br /> |
| <b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br /> |
| DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br /> |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructType</b> </td> |
| <td> org.apache.spark.sql.Row </td> |
| <td> |
| DataTypes.createStructType(<i>fields</i>)<br /> |
| <b>Note:</b> <i>fields</i> is a List or an array of StructFields. |
| Also, two fields with the same name are not allowed. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructField</b> </td> |
| <td> The value type in Java of the data type of this field |
| (For example, int for a StructField with the data type IntegerType) </td> |
| <td> |
| DataTypes.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>) |
| </td> |
| </tr> |
| </table> |
| |
| </div> |
| |
| <div data-lang="python" markdown="1"> |
| |
| All data types of Spark SQL are located in the package of `pyspark.sql.types`. |
| You can access them by doing |
| {% highlight python %} |
| from pyspark.sql.types import * |
| {% endhighlight %} |
| |
| <table class="table"> |
| <tr> |
| <th style="width:20%">Data type</th> |
| <th style="width:40%">Value type in Python</th> |
| <th>API to access or create a data type</th></tr> |
| <tr> |
| <td> <b>ByteType</b> </td> |
| <td> |
| int or long <br /> |
| <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of -128 to 127. |
| </td> |
| <td> |
| ByteType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ShortType</b> </td> |
| <td> |
| int or long <br /> |
| <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of -32768 to 32767. |
| </td> |
| <td> |
| ShortType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>IntegerType</b> </td> |
| <td> int or long </td> |
| <td> |
| IntegerType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>LongType</b> </td> |
| <td> |
| long <br /> |
| <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of |
| -9223372036854775808 to 9223372036854775807. |
| Otherwise, please convert data to decimal.Decimal and use DecimalType. |
| </td> |
| <td> |
| LongType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>FloatType</b> </td> |
| <td> |
| float <br /> |
| <b>Note:</b> Numbers will be converted to 4-byte single-precision floating |
| point numbers at runtime. |
| </td> |
| <td> |
| FloatType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DoubleType</b> </td> |
| <td> float </td> |
| <td> |
| DoubleType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DecimalType</b> </td> |
| <td> decimal.Decimal </td> |
| <td> |
| DecimalType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StringType</b> </td> |
| <td> string </td> |
| <td> |
| StringType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BinaryType</b> </td> |
| <td> bytearray </td> |
| <td> |
| BinaryType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BooleanType</b> </td> |
| <td> bool </td> |
| <td> |
| BooleanType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>TimestampType</b> </td> |
| <td> datetime.datetime </td> |
| <td> |
| TimestampType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DateType</b> </td> |
| <td> datetime.date </td> |
| <td> |
| DateType() |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ArrayType</b> </td> |
| <td> list, tuple, or array </td> |
| <td> |
| ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>MapType</b> </td> |
| <td> dict </td> |
| <td> |
| MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructType</b> </td> |
| <td> list or tuple </td> |
| <td> |
| StructType(<i>fields</i>)<br /> |
| <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same |
| name are not allowed. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructField</b> </td> |
| <td> The value type in Python of the data type of this field |
| (For example, Int for a StructField with the data type IntegerType) </td> |
| <td> |
| StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br /> |
| <b>Note:</b> The default value of <i>nullable</i> is <i>True</i>. |
| </td> |
| </tr> |
| </table> |
| |
| </div> |
| |
| <div data-lang="r" markdown="1"> |
| |
| <table class="table"> |
| <tr> |
| <th style="width:20%">Data type</th> |
| <th style="width:40%">Value type in R</th> |
| <th>API to access or create a data type</th></tr> |
| <tr> |
| <td> <b>ByteType</b> </td> |
| <td> |
| integer <br /> |
| <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of -128 to 127. |
| </td> |
| <td> |
| "byte" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ShortType</b> </td> |
| <td> |
| integer <br /> |
| <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of -32768 to 32767. |
| </td> |
| <td> |
| "short" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>IntegerType</b> </td> |
| <td> integer </td> |
| <td> |
| "integer" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>LongType</b> </td> |
| <td> |
| integer <br /> |
| <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime. |
| Please make sure that numbers are within the range of |
| -9223372036854775808 to 9223372036854775807. |
| Otherwise, please convert data to decimal.Decimal and use DecimalType. |
| </td> |
| <td> |
| "long" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>FloatType</b> </td> |
| <td> |
| numeric <br /> |
| <b>Note:</b> Numbers will be converted to 4-byte single-precision floating |
| point numbers at runtime. |
| </td> |
| <td> |
| "float" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DoubleType</b> </td> |
| <td> numeric </td> |
| <td> |
| "double" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DecimalType</b> </td> |
| <td> Not supported </td> |
| <td> |
| Not supported |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StringType</b> </td> |
| <td> character </td> |
| <td> |
| "string" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BinaryType</b> </td> |
| <td> raw </td> |
| <td> |
| "binary" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>BooleanType</b> </td> |
| <td> logical </td> |
| <td> |
| "bool" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>TimestampType</b> </td> |
| <td> POSIXct </td> |
| <td> |
| "timestamp" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>DateType</b> </td> |
| <td> Date </td> |
| <td> |
| "date" |
| </td> |
| </tr> |
| <tr> |
| <td> <b>ArrayType</b> </td> |
| <td> vector or list </td> |
| <td> |
| list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>containsNull</i> is <i>TRUE</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>MapType</b> </td> |
| <td> environment </td> |
| <td> |
| list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br /> |
| <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>TRUE</i>. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructType</b> </td> |
| <td> named list</td> |
| <td> |
| list(type="struct", fields=<i>fields</i>)<br /> |
| <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same |
| name are not allowed. |
| </td> |
| </tr> |
| <tr> |
| <td> <b>StructField</b> </td> |
| <td> The value type in R of the data type of this field |
| (For example, integer for a StructField with the data type IntegerType) </td> |
| <td> |
| list(name=<i>name</i>, type=<i>dataType</i>, nullable=[<i>nullable</i>])<br /> |
| <b>Note:</b> The default value of <i>nullable</i> is <i>TRUE</i>. |
| </td> |
| </tr> |
| </table> |
| |
| </div> |
| |
| </div> |
| |
| ## NaN Semantics |
| |
| There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that |
| does not exactly match standard floating point semantics. |
| Specifically: |
| |
| - NaN = NaN returns true. |
| - In aggregations, all NaN values are grouped together. |
| - NaN is treated as a normal value in join keys. |
| - NaN values go last when in ascending order, larger than any other numeric value. |
| |
| ## Arithmetic operations |
| |
| Operations performed on numeric types (with the exception of `decimal`) are not checked for overflow. |
| This means that in case an operation causes an overflow, the result is the same that the same operation |
| returns in a Java/Scala program (eg. if the sum of 2 integers is higher than the maximum value representable, |
| the result is a negative number). |