[SPARK-52183] Update `SparkSQLRepl` example to show up to 10k rows

### What changes were proposed in this pull request?

This PR aims to update `SparkSQLRepl` example to show up to 10k rows.

### Why are the changes needed?

Currently, `SparkSQLRepl` uses the `show()` with the default parameters. Although we cannot handle large-numbers of rows due to `grpc_max_message_size`, we had better have more reasonable default value.

```SQL
spark-sql (default)> SELECT * FROM RANGE(21);
+---+
| id|
+---+
|  0|
|...|
| 19|
+---+
only showing top 20 rows
Time taken: 118 ms
```

### Does this PR introduce _any_ user-facing change?

This is an example.

### How was this patch tested?

Manual test.

```SQL
$ swift run
spark-sql (default)> SELECT * FROM RANGE(10001);
...
only showing top 10000 rows
Time taken: 142 ms
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #159 from dongjoon-hyun/SPARK-52183.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
1 file changed
tree: 7929e8e248216c4a935a64ee6a8e49d0ac617374
  1. .github/
  2. dev/
  3. Examples/
  4. Sources/
  5. Tests/
  6. .asf.yaml
  7. .gitignore
  8. .spi.yml
  9. LICENSE
  10. NOTICE
  11. Package.swift
  12. README.md
README.md

Apache Spark Connect Client for Swift

GitHub Actions Build Swift Version Compatibility Platform Compatibility

This is an experimental Swift library to show how to connect to a remote Apache Spark Connect Server and run SQL statements to manipulate remote data.

So far, this library project is tracking the upstream changes like the Apache Spark 4.0.0 RC6 release and Apache Arrow project's Swift-support.

Resources

Requirement

How to use in your apps

Create a Swift project.

mkdir SparkConnectSwiftApp
cd SparkConnectSwiftApp
swift package init --name SparkConnectSwiftApp --type executable

Add SparkConnect package to the dependency like the following

$ cat Package.swift
import PackageDescription

let package = Package(
  name: "SparkConnectSwiftApp",
  platforms: [
    .macOS(.v15)
  ],
  dependencies: [
    .package(url: "https://github.com/apache/spark-connect-swift.git", branch: "main")
  ],
  targets: [
    .executableTarget(
      name: "SparkConnectSwiftApp",
      dependencies: [.product(name: "SparkConnect", package: "spark-connect-swift")]
    )
  ]
)

Use SparkSession of SparkConnect module in Swift.

$ cat Sources/main.swift

import SparkConnect

let spark = try await SparkSession.builder.getOrCreate()
print("Connected to Apache Spark \(await spark.version) Server")

let statements = [
  "DROP TABLE IF EXISTS t",
  "CREATE TABLE IF NOT EXISTS t(a INT) USING ORC",
  "INSERT INTO t VALUES (1), (2), (3)",
]

for s in statements {
  print("EXECUTE: \(s)")
  _ = try await spark.sql(s).count()
}
print("SELECT * FROM t")
try await spark.sql("SELECT * FROM t").cache().show()

try await spark.range(10).filter("id % 2 == 0").write.mode("overwrite").orc("/tmp/orc")
try await spark.read.orc("/tmp/orc").show()

await spark.stop()

Run your Swift application.

$ swift run
...
Connected to Apache Spark 4.0.0 Server
EXECUTE: DROP TABLE IF EXISTS t
EXECUTE: CREATE TABLE IF NOT EXISTS t(a INT)
EXECUTE: INSERT INTO t VALUES (1), (2), (3)
SELECT * FROM t
+---+
| a |
+---+
| 2 |
| 1 |
| 3 |
+---+
+----+
| id |
+----+
| 2  |
| 6  |
| 0  |
| 8  |
| 4  |
+----+

You can find more complete examples including Web Server and Streaming applications in the Examples directory.

How to use Spark SQL REPL via Spark Connect for Swift

This project also provides Spark SQL REPL. You can run it directly from this repository.

$ swift run
...
Build of product 'SparkSQLRepl' complete! (2.33s)
Connected to Apache Spark 4.0.0 Server
spark-sql (default)> SHOW DATABASES;
+---------+
|namespace|
+---------+
|  default|
+---------+

Time taken: 30 ms
spark-sql (default)> CREATE DATABASE db1;
++
||
++
++

Time taken: 31 ms
spark-sql (default)> USE db1;
++
||
++
++

Time taken: 27 ms
spark-sql (db1)> CREATE TABLE t1 AS SELECT * FROM RANGE(10);
++
||
++
++

Time taken: 99 ms
spark-sql (db1)> SELECT * FROM t1;
+---+
| id|
+---+
|  1|
|  5|
|  3|
|  0|
|  6|
|  9|
|  4|
|  8|
|  7|
|  2|
+---+

Time taken: 80 ms
spark-sql (db1)> USE default;
++
||
++
++

Time taken: 26 ms
spark-sql (default)> DROP DATABASE db1 CASCADE;
++
||
++
++
spark-sql (default)> exit;

Apache Spark 4 supports SQL Pipe Syntax.

$ swift run
...
Build of product 'SparkSQLRepl' complete! (2.33s)
Connected to Apache Spark 4.0.0 Server
spark-sql (default)>
FROM ORC.`/opt/spark/examples/src/main/resources/users.orc`
|> AGGREGATE COUNT(*) cnt
   GROUP BY name
|> ORDER BY cnt DESC, name ASC
;
+------+---+
|  name|cnt|
+------+---+
|Alyssa|  1|
|   Ben|  1|
+------+---+

Time taken: 159 ms

You can use SPARK_REMOTE to specify the Spark Connect connection string in order to provide more options.

SPARK_REMOTE=sc://localhost swift run