Improve Package Structure

### What changes were proposed in this pull request?
This patch fixes some of the issues in the current code and the way the interfaces are used and exposed to the other packages.  In particular, now all DF functions should take a context to deal with resolving schema etc dynamically and for API consistency.

Second, all columns and expressions need to support the `column.Convertible` interface to indicate that it can be converted to Proto. Lastly, the column package adds a `HasSchema` like interface that allows to pass a DF without passing the actual interface type.

Lastly it changes the `ToPlan` method on `Column` / `Expression` / `Convertible` to `ToProto` for more clarity.

### Why are the changes needed?
Ease of use.

### Does this PR introduce _any_ user-facing change?
Slight package changes. But still pre-release.

### How was this patch tested?
Existing tests.

Closes #65 from grundprinzip/package_refactorign.

Authored-by: Martin Grund <martin.grund@databricks.com>
Signed-off-by: Martin Grund <martin.grund@databricks.com>
13 files changed
tree: debe20e41e58c2d4e0dc447a6c6f56919dc49353
  1. .github/
  2. cmd/
  3. dev/
  4. internal/
  5. spark/
  6. .asf.yaml
  7. .gitignore
  8. .gitmodules
  9. .golangci.yml
  10. buf.gen.yaml
  11. buf.work.yaml
  12. CONTRIBUTING.md
  13. go.mod
  14. go.sum
  15. LICENSE
  16. Makefile
  17. merge_connect_go_pr.py
  18. quick-start.md
  19. README.md
README.md

Apache Spark Connect Client for Golang

This project houses the experimental client for Spark Connect for Apache Spark written in Golang.

Current State of the Project

Currently, the Spark Connect client for Golang is highly experimental and should not be used in any production setting. In addition, the PMC of the Apache Spark project reserves the right to withdraw and abandon the development of this project if it is not sustainable.

Getting started

This section explains how to run Spark Connect Go locally.

Step 1: Install Golang: https://go.dev/doc/install.

Step 2: Ensure you have installed buf CLI installed, more info here

Step 3: Run the following commands to setup the Spark Connect client.

git clone https://github.com/apache/spark-connect-go.git
git submodule update --init --recursive

make gen && make test

Step 4: Setup the Spark Driver on localhost.

  1. Download Spark distribution (3.5.0+), unzip the package.

  2. Start the Spark Connect server with the following command (make sure to use a package version that matches your Spark distribution):

sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.2

Step 5: Run the example Go application.

go run cmd/spark-connect-example-spark-session/main.go

How to write Spark Connect Go Application in your own project

See Quick Start Guide

High Level Design

The overall goal of the design is to find a good balance of principle of the least surprise for develoeprs that are familiar with the APIs of Apache Spark and idiomatic Go usage. The high-level structure of the packages follows roughly the PySpark giudance but with Go idioms.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.