feat(table): add fanout partition writer and rolling data writer (#524)

# Partitioned Fanout Writer with Rolling Data File Support (Append Mode)

This PR completes the implementation of partitioned writing with support
for rolling data files in append mode. It enables efficient,
parallelized ingestion into partitioned tables while maintaining
manifest and snapshot correctness.

**Slack Thread Discussion**:
[Link](https://apache-iceberg.slack.com/archives/C05J3MJ42BD/p1751002533414969)
**Proposal Document**: [Google
Drive](https://drive.google.com/file/d/18CwR9nhwkThs-Q-JZZvisBEaDICvp5Z7/view?usp=drive_link)


### Details

* Introduced parallel processing of `arrow.Record` using a user-defined
number of goroutines.
* Each goroutine maintains its own hash table to map partition keys to
row indices.
* After partitioning, `compute.Take()` is used to extract per-partition
data slices.
* Integrated dedicated rolling writers per partition to manage data file
size thresholds and output constraints.


### Tests Performed

* [x] Compatible with all partition transforms
* [x] Handled null values in partition columns
* [x] Validated compatibility with partition spec evolution
* [x] Verified correctness for non-linear transformation cases
* [x] Confirmed schema evolution compatibility
* [x] Partition pruning verified

---

@zeroshade — would appreciate your review when you get a chance!

---------

Signed-off-by: badalprasadsingh <badal@datazip.io>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
16 files changed
tree: 163883fd0a1e9c4c5971fe0fde574c61a18a7321
  1. .github/
  2. catalog/
  3. cmd/
  4. config/
  5. dev/
  6. internal/
  7. io/
  8. table/
  9. utils/
  10. website/
  11. .asf.yaml
  12. .gitattributes
  13. .gitignore
  14. .golangci.bck.yml
  15. .golangci.yml
  16. .pre-commit-config.yaml
  17. CONTRIBUTING.md
  18. errors.go
  19. exprs.go
  20. exprs_test.go
  21. go.mod
  22. go.sum
  23. LICENSE
  24. literals.go
  25. literals_test.go
  26. manifest.go
  27. manifest_test.go
  28. name_mapping.go
  29. name_mapping_test.go
  30. NOTICE
  31. operation_string.go
  32. partitions.go
  33. partitions_test.go
  34. predicates.go
  35. README.md
  36. schema.go
  37. schema_conversions.go
  38. schema_conversions_test.go
  39. schema_test.go
  40. transforms.go
  41. transforms_test.go
  42. types.go
  43. types_test.go
  44. utils.go
  45. visitors.go
  46. visitors_test.go
README.md

Iceberg Golang

Go Reference

iceberg is a Golang implementation of the Iceberg table spec.

Build From Source

Prerequisites

  • Go 1.23 or later

Build

$ git clone https://github.com/apache/iceberg-go.git
$ cd iceberg-go/cmd/iceberg && go build .

Feature Support / Roadmap

FileSystem Support

Filesystem TypeSupported
S3X
Google Cloud StorageX
Azure Blob StorageX
Local FilesystemX

Metadata

OperationSupported
Get SchemaX
Get SnapshotsX
Get Sort OrdersX
Get Partition SpecsX
Get ManifestsX
Create New ManifestsX
Plan Scanx
Plan Scan for Snapshotx

Catalog Support

OperationRESTHiveGlueSQL
Load TableXXX
List TablesXXX
Create TableXXX
Register TableXX
Update Current SnapshotXXX
Create New SnapshotXXX
Rename TableXXX
Drop TableXXX
Alter TableXXX
Check Table ExistsXXX
Set Table PropertiesXXX
List NamespacesXXX
Create NamespaceXXX
Check Namespace ExistsXXX
Drop NamespaceXXX
Update Namespace PropertiesXXX
Create ViewXX
Load ViewX
List ViewXX
Drop ViewXX
Check View ExistsXX

Read/Write Data Support

  • Data can currently be read as an Arrow Table or as a stream of Arrow record batches.

Supported Write Operations

As long as the FileSystem is supported and the Catalog supports altering the table, the following tracks the current write support:

OperationSupported
Append StreamX
Append Data FilesX
Rewrite Files
Rewrite manifests
Overwrite Files
Write Pos Delete
Write Eq Delete
Row Delta

CLI Usage

Run go build ./cmd/iceberg from the root of this repository to build the CLI executable, alternately you can run go install github.com/apache/iceberg-go/cmd/iceberg to install it to the bin directory of your GOPATH.

The iceberg CLI usage is very similar to pyiceberg CLI
You can pass the catalog URI with --uri argument.

Example: You can start the Iceberg REST API docker image which runs on default in port 8181

docker pull apache/iceberg-rest-fixture:latest
docker run -p 8181:8181 apache/iceberg-rest-fixture:latest

and run the iceberg CLI pointing to the REST API server.

 ./iceberg --uri http://0.0.0.0:8181 list
┌─────┐
| IDs |
| --- |
└─────┘

Create Namespace

./iceberg --uri http://0.0.0.0:8181 create namespace taxitrips

List Namespace

 ./iceberg --uri http://0.0.0.0:8181 list
┌───────────┐
| IDs       |
| --------- |
| taxitrips |
└───────────┘


Get in Touch