Quick Start

Preparation

Paimon supports the following Spark versions with their respective Java and Scala compatibility. We recommend using the latest Spark version for a better experience.

Spark 4.x (including 4.0) : Pre-built with Java 17 and Scala 2.13
Spark 3.x (including 3.5, 3.4, 3.3, 3.2) : Pre-built with Java 8 and Scala 2.12/2.13

Download the jar file with corresponding version.

Version	Jar (Scala 2.12)	Jar (Scala 2.13)
Spark 4.0	-	[paimon-spark-4.0_2.13-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-4.0_2.13/{{< version >}}/paimon-spark-4.0_2.13-{{< version >}}.jar)
Spark 3.5	[paimon-spark-3.5_2.12-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.5_2.12/{{< version >}}/paimon-spark-3.5_2.12-{{< version >}}.jar)	[paimon-spark-3.5_2.13-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.5_2.13/{{< version >}}/paimon-spark-3.5_2.13-{{< version >}}.jar)
Spark 3.4	[paimon-spark-3.4_2.12-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.4_2.12/{{< version >}}/paimon-spark-3.4_2.12-{{< version >}}.jar)	[paimon-spark-3.4_2.13-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.4_2.13/{{< version >}}/paimon-spark-3.4_2.13-{{< version >}}.jar)
Spark 3.3	[paimon-spark-3.3_2.12-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.3_2.12/{{< version >}}/paimon-spark-3.3_2.12-{{< version >}}.jar)	[paimon-spark-3.3_2.13-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.3_2.13/{{< version >}}/paimon-spark-3.3_2.13-{{< version >}}.jar)
Spark 3.2	[paimon-spark-3.2_2.12-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.2_2.12/{{< version >}}/paimon-spark-3.2_2.12-{{< version >}}.jar)	[paimon-spark-3.2_2.13-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.2_2.13/{{< version >}}/paimon-spark-3.2_2.13-{{< version >}}.jar)

Version	Jar (Scala 2.12)	Jar (Scala 2.13)
Spark 4.0	-	[paimon-spark-4.0_2.13-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-4.0_2.13/{{< version >}}/)
Spark 3.5	[paimon-spark-3.5_2.12-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.5_2.12/{{< version >}}/)	[paimon-spark-3.5_2.13-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.5_2.13/{{< version >}}/)
Spark 3.4	[paimon-spark-3.4_2.12-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.4_2.12/{{< version >}}/)	[paimon-spark-3.4_2.13-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.4_2.13/{{< version >}}/)
Spark 3.3	[paimon-spark-3.3_2.12-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.3_2.12/{{< version >}}/)	[paimon-spark-3.3_2.13-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.3_2.13/{{< version >}}/)
Spark 3.2	[paimon-spark-3.2_2.12-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.2_2.12/{{< version >}}/)	[paimon-spark-3.2_2.13-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.2_2.13/{{< version >}}/)

You can also manually build bundled jar from the source code.

To build from source code, [clone the git repository]({{< github_repo >}}), then build the bundled jar with the following command.

# build paimon spark 3.5 with scala 2.12
mvn clean package -DskipTests -pl paimon-spark/paimon-spark-3.5 -am

# build paimon spark 3.5 with scala 2.13
mvn clean package -DskipTests -pl paimon-spark/paimon-spark-3.5 -am -Pscala-2.13

# build paimon spark 4.0
mvn clean package -DskipTests -pl paimon-spark/paimon-spark-4.0 -am -Pspark4

For Spark 3.5, you can find the bundled jar in ./paimon-spark/paimon-spark-3.5/target/paimon-spark-3.5_2.12-{{< version >}}.jar.

Setup

If you are using HDFS, make sure that the environment variable HADOOP_HOME or HADOOP_CONF_DIR is set.

Step 1: Specify Paimon Jar File

Append path to paimon jar file to the --jars argument when starting spark-sql.

spark-sql ... --jars /path/to/paimon-spark-3.5_2.12-{{< version >}}.jar

OR use the --packages option.

spark-sql ... --packages org.apache.paimon:paimon-spark-3.5_2.12:{{< version >}}

Alternatively, you can copy paimon-spark-3.5_2.12-{{< version >}}.jar under spark/jars in your Spark installation directory.

Step 2: Specify Paimon Catalog

When starting spark-sql, use the following command to register Paimon’s Spark catalog with the name paimon. Table files of the warehouse is stored under /tmp/paimon.

spark-sql ... \
    --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
    --conf spark.sql.catalog.paimon.warehouse=file:/tmp/paimon \
    --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions

Catalogs are configured using properties under spark.sql.catalog.(catalog_name). In above case, ‘paimon’ is the catalog name, you can change it to your own favorite catalog name.

After spark-sql command line has started, run the following SQL to create and switch to database default.

USE paimon;
USE default;

After switching to the catalog ('USE paimon'), Spark's existing tables will not be directly accessible, you can use the spark_catalog.${database_name}.${table_name} to access Spark tables.

When starting spark-sql, use the following command to register Paimon’s Spark Generic catalog to replace Spark default catalog spark_catalog. (default warehouse is Spark spark.sql.warehouse.dir)

Currently, it is only recommended to use SparkGenericCatalog in the case of Hive metastore, Paimon will infer Hive conf from Spark session, you just need to configure Spark's Hive conf.

spark-sql ... \
    --conf spark.sql.catalog.spark_catalog=org.apache.paimon.spark.SparkGenericCatalog \
    --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions

Using SparkGenericCatalog, you can use Paimon tables in this Catalog or non-Paimon tables such as Spark's csv, parquet, Hive tables, etc.

Create Table

create table my_table (
    k int,
    v string
) tblproperties (
    'primary-key' = 'k'
);

create table my_table (
    k int,
    v string
) USING paimon
tblproperties (
    'primary-key' = 'k'
);

Insert Table

INSERT INTO my_table VALUES (1, 'Hi'), (2, 'Hello');

-- you can use
Seq((1, "Hi"), (2, "Hello")).toDF("k", "v")
  .write.format("paimon").mode("append").saveAsTable("my_table")

-- or
Seq((1, "Hi"), (2, "Hello")).toDF("k", "v")
  .write.format("paimon").mode("append").save("file:/tmp/paimon/default.db/my_table")

Query Table

SELECT * FROM my_table;

/*
1	Hi
2	Hello
*/

-- you can use
spark.read.format("paimon").table("my_table").show()

-- or
spark.read.format("paimon").load("file:/tmp/paimon/default.db/my_table").show()

/*
+---+------+
| k |     v|
+---+------+
|  1|    Hi|
|  2| Hello|
+---+------+
*/

Spark Type Conversion

This section lists all supported type conversion between Spark and Paimon. All Spark's data types are available in package org.apache.spark.sql.types.

{{< hint warning >}} Due to the previous design, in Spark3.3 and below, Paimon will map both Paimon‘s TimestampType and LocalZonedTimestamp to Spark’s TimestampType, and only correctly handle with TimestampType.

Therefore, when using Spark3.3 and below, reads Paimon table with LocalZonedTimestamp type written by other engines, such as Flink, the query result of LocalZonedTimestamp type will have time zone offset, which needs to be adjusted manually.

When using Spark3.4 and above, all timestamp types can be parsed correctly. {{< /hint >}}