[SPARK-44681] Fix issues when writing Go application code using Spark Connect Go client library

### What changes were proposed in this pull request?

When trying to write Go application code using Spark Connect Go client and put the Go application code in users' own repo (e.g. https://github.com/user-foo/my-go-application), the Go application code cannot resolve Spark Connect Go client from `https://github.com/apache/spark-connect-go` correctly, due to two issues:

1. The name `github.com/apache/spark-connect-go/v_3_4` cannot resolve correctly to find the module, it complains not finding `go.mod` file under `github.com/apache/spark-connect-go/v_3_4`. After change the name to `github.com/apache/spark-connect-go/v34`, it is good.

2. The Go application code needs generated Go protobuf code as well. Thus need to commit the generated Go protobuf code into `github.com/apache/spark-connect-go` repo.

### Why are the changes needed?

See above.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested by run Go application code to use Spark Connect Go client.

Closes #14 from hiboyang/bo-dev-05.

Authored-by: hiboyang <14280154+hiboyang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
20 files changed
tree: b1e35d0d053da81d80bb9431b5f51570b3fe7d81
  1. .github/
  2. client/
  3. cmd/
  4. internal/
  5. .asf.yaml
  6. .gitignore
  7. .gitmodules
  8. buf.gen.yaml
  9. buf.work.yaml
  10. CONTRIBUTING.md
  11. go.mod
  12. go.sum
  13. LICENSE
  14. Makefile
  15. merge_connect_go_pr.py
  16. README.md
README.md

Apache Spark Connect Client for Golang

This project houses the experimental client for Spark Connect for Apache Spark written in Golang.

Current State of the Project

Currently, the Spark Connect client for Golang is highly experimental and should not be used in any production setting. In addition, the PMC of the Apache Spark project reserves the right to withdraw and abandon the development of this project if it is not sustainable.

Getting started

git clone https://github.com/apache/spark-connect-go.git
git submodule update --init --recursive

make gen && make test

Ensure you have installed buf CLI; more info

Spark Connect Go Application Example

A very simple example in Go looks like following:

func main() {
	remote := "localhost:15002"
	spark, _ := sql.SparkSession.Builder.Remote(remote).Build()
	defer spark.Stop()

	df, _ := spark.Sql("select 'apple' as word, 123 as count union all select 'orange' as word, 456 as count")
	df.Show(100, false)
}

High Level Design

Following diagram shows main code in current prototype:

    +-------------------+                                                                              
    |                   |                                                                              
    |   dataFrameImpl   |                                                                              
    |                   |                                                                              
    +-------------------+                                                                              
              |                                                                                        
              |                                                                                        
              +                                                                                        
    +-------------------+                                                                              
    |                   |                                                                              
    | sparkSessionImpl  |                                                                              
    |                   |                                                                              
    +-------------------+                                                                              
              |                                                                                        
              |                                                                                        
              +                                                                                        
+---------------------------+               +----------------+                                         
|                           |               |                |                                         
| SparkConnectServiceClient |--------------+|  Spark Driver  |                                         
|                           |               |                |                                         
+---------------------------+               +----------------+

SparkConnectServiceClient is GRPC client which talks to Spark Driver. sparkSessionImpl generates dataFrameImpl instances. dataFrameImpl uses the GRPC client in sparkSessionImpl to communicate with Spark Driver.

We will mimic the logic in Spark Connect Scala implementation, and adopt Go common practices, e.g. returning error object for error handling.

How to Run Spark Connect Go Application

  1. Install Golang: https://go.dev/doc/install.

  2. Download Spark distribution (3.4.0+), unzip the folder.

  3. Start Spark Connect server by running command:

sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
  1. In this repo, run Go application:
go run cmd/spark-connect-example-spark-session/main.go

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.