commit | b6b089978defbf68d688273db525930a36fd0cca | [log] [tgz] |
---|---|---|
author | Martin Grund <martin.grund@databricks.com> | Thu Oct 03 16:19:02 2024 -0700 |
committer | GitHub <noreply@github.com> | Fri Oct 04 08:19:02 2024 +0900 |
tree | b274e252a62567c82f967de27bdea62810ec3e1a | |
parent | 8432dd0ed2849aa4add905962b139470e7806610 [diff] |
[#73] Replace usage of map[string]Convertible in withColumns with a sequential alternative ### What changes were proposed in this pull request? The Golang map type is not meant to provide an insertion order stable iteration. This means when a user calls: ```golang df, err = df.WithColumns(ctx, map[string]column.Convertible{ "newCol1": functions.Lit(1), "newCol2": functions.Lit(2), }) ``` There is no guarantee about the order of the columns added. However, in PySpark, this is not the case, and the order is preserved. For that reason, we need to use a different way of adding multiple columns that preserve the order. This patch changes the interface of the `withColumns` function so that the argument is a new type called `column.Alias` that implements the `column.Convertible` interface. This allows for the following method signature. ```golang WithColumns(ctx context.Context, alias ...column.Alias) (DataFrame, error) ``` In Spark code, the function can be used as follows: ```golang df, err = df.WithColumns( ctx, column.WithAlias("newCol1", functions.Lit(1)), column.WithAlias("newCol2", functions.Lit(2))) ``` Which provides a similarly convenient way of creating a sequence of new columns for the function. ### Why are the changes needed? Correctness ### Does this PR introduce _any_ user-facing change? Changes the signature of the `withColumn` function. ### How was this patch tested? Added UT and fixed integration test.
This project houses the experimental client for Spark Connect for Apache Spark written in Golang.
Currently, the Spark Connect client for Golang is highly experimental and should not be used in any production setting. In addition, the PMC of the Apache Spark project reserves the right to withdraw and abandon the development of this project if it is not sustainable.
This section explains how to run Spark Connect Go locally.
Step 1: Install Golang: https://go.dev/doc/install.
Step 2: Ensure you have installed buf CLI
installed, more info here
Step 3: Run the following commands to setup the Spark Connect client.
git clone https://github.com/apache/spark-connect-go.git git submodule update --init --recursive make gen && make test
Step 4: Setup the Spark Driver on localhost.
Download Spark distribution (3.5.0+), unzip the package.
Start the Spark Connect server with the following command (make sure to use a package version that matches your Spark distribution):
sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.2
Step 5: Run the example Go application.
go run cmd/spark-connect-example-spark-session/main.go
The overall goal of the design is to find a good balance of principle of the least surprise for develoeprs that are familiar with the APIs of Apache Spark and idiomatic Go usage. The high-level structure of the packages follows roughly the PySpark giudance but with Go idioms.
Please review the Contribution to Spark guide for information on how to get started contributing to the project.