commit	3b627022f496616a0d4e7e44fe3ca4fc4f1131e9	[log] [tgz]
author	Sreeram Ramachandran <sramachandran@linkedin.com>	Wed Feb 16 09:20:05 2022 -0800
committer	GitHub <noreply@github.com>	Wed Feb 16 09:20:05 2022 -0800
tree	f94f1bab42433e52a4f22f389cd6f2fd4147e706
parent	cae40b87ce7846779380c0d000513ade31a4da0b [diff]

Use helper-all v0.2.74 to solve issues around default values. (#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e0180c471fc4b7a6caee041b001b5b34e1cc6:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```

7 files changed

tree: f94f1bab42433e52a4f22f389cd6f2fd4147e706

README.md

Apache Gobblin

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
Data Organization within the lake (e.g. compaction, partitioning, deduplication)
Lifecycle Management of data within the lake (e.g. data retention)
Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
Supports stream and batch execution modes
Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

Java >= 1.8

If building the distribution with tests turned on:

Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

Extract the archive file to your local directory.
Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

Extract the archive file to your local directory.
Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links