Apache Spark Connect Client for Swift

Clone this repo:
  1. 13d0f4f [SPARK-56668] Upgrade `actions/setup-java` to v5 by Dongjoon Hyun · 2 days ago main
  2. c26041c [SPARK-56516] Upgrade `gRPC Swift Protobuf` to 2.3.0 by Dongjoon Hyun · 2 weeks ago
  3. 89ae02f [SPARK-56517] Upgrade `gRPC Swift NIO Transport` to 2.7.0 by Dongjoon Hyun · 2 weeks ago
  4. 9de8f17 [SPARK-56515] Upgrade `grpc-swift-2` to 2.4.0 by Dongjoon Hyun · 2 weeks ago
  5. d85682b [SPARK-56497] Upgrade the minimum `Swift` requirement to 6.3 by Dongjoon Hyun · 2 weeks ago

Apache Spark Connect Client for Swift

Release GitHub Actions Build Swift Version Compatibility Platform Compatibility

Apache Spark™ Connect for Swift is a subproject of Apache Spark and aims to provide a modern Swift library to enable Swift developers to leverage the power of Apache Spark for distributed data processing, machine learning, and analytical workloads directly from their Swift applications. For example, a user can develop and ship a lightweight Swift-based SparkPi app.

Docker Image Size

NameImage Size
apache/spark:4.1.1-python3-based SparkPiDocker Image Size
pyspark-connect-based SparkPiDocker Image Size
Swift-based SparkPiDocker Image Size

Resources

Requirement

So far, this library project is tracking the upstream changes of Apache Arrow project's Swift-support.

How to use in your apps

Create a Swift project.

mkdir SparkConnectSwiftApp
cd SparkConnectSwiftApp
swift package init --name SparkConnectSwiftApp --type executable

Add SparkConnect package to the dependency like the following

$ cat Package.swift
import PackageDescription

let package = Package(
  name: "SparkConnectSwiftApp",
  platforms: [
    .macOS(.v15)
  ],
  dependencies: [
    .package(url: "https://github.com/apache/spark-connect-swift.git", branch: "main")
  ],
  targets: [
    .executableTarget(
      name: "SparkConnectSwiftApp",
      dependencies: [.product(name: "SparkConnect", package: "spark-connect-swift")]
    )
  ]
)

Use SparkSession of SparkConnect module in Swift.

$ cat Sources/main.swift

import SparkConnect

let spark = try await SparkSession.builder.getOrCreate()
print("Connected to Apache Spark \(await spark.version) Server")

let statements = [
  "DROP TABLE IF EXISTS t",
  "CREATE TABLE IF NOT EXISTS t(a INT) USING ORC",
  "INSERT INTO t VALUES (1), (2), (3)",
]

for s in statements {
  print("EXECUTE: \(s)")
  _ = try await spark.sql(s).count()
}
print("SELECT * FROM t")
try await spark.sql("SELECT * FROM t").cache().show()

try await spark.range(10).filter("id % 2 == 0").write.mode("overwrite").orc("/tmp/orc")
try await spark.read.orc("/tmp/orc").show()

await spark.stop()

Run your Swift application.

$ swift run
...
Connected to Apache Spark 4.1.1 Server
EXECUTE: DROP TABLE IF EXISTS t
EXECUTE: CREATE TABLE IF NOT EXISTS t(a INT) USING ORC
EXECUTE: INSERT INTO t VALUES (1), (2), (3)
SELECT * FROM t
+---+
|  a|
+---+
|  1|
|  3|
|  2|
+---+

+---+
| id|
+---+
|  6|
|  8|
|  4|
|  2|
|  0|
+---+

You can find more complete examples including Spark SQL REPL, Web Server and Streaming applications in the Examples directory.

This library also supports SPARK_REMOTE environment variable to specify the Spark Connect connection string in order to provide more options.