Use helper-all v0.2.74 to solve issues around default values. (#3469)

The latest version of helper-all fixes the issues seen before w.r.t.
default values, so we can now revert the code and the *.avsc files back
to how they used to be, with two minor exceptions:

1. Check Schema equality using their .toString() representations. Doing
   it the old way works for two out of the three instances, but one of
   them fails, for reasons I haven't figured out yet.

2. Add a `"default":null` piece to recursive_schema_1_converted.avsc.
   This is harmless, and is caused by the fact that the compatibility
   helper always adds it if it's a valid default for the schema. See
   the comments for FieldBuilder19.setDefault():
   https://github.com/linkedin/avro-util/blob/b9e89c55980ea8e5fd3c8d8da362d7195dd2a99c/helper/impls/helper-impl-19/src/main/java/com/linkedin/avroutil1/compatibility/avro19/FieldBuilder19.java#L69

To verify that the files are otherwise the same as before:
```
$ for file in gobblin-core-base/src/test/resources/converter/*.avsc; do
> git show 928e0180c471fc4b7a6caee041b001b5b34e1cc6:$file > /tmp/before
> diff <(jq . </tmp/before) <(jq . <$file)
> done
```
7 files changed
tree: f94f1bab42433e52a4f22f389cd6f2fd4147e706
  1. .github/
  2. bin/
  3. buildSrc/
  4. conf/
  5. config/
  6. dev/
  7. gobblin-admin/
  8. gobblin-all/
  9. gobblin-api/
  10. gobblin-audit/
  11. gobblin-aws/
  12. gobblin-binary-management/
  13. gobblin-cluster/
  14. gobblin-compaction/
  15. gobblin-config-management/
  16. gobblin-core/
  17. gobblin-core-base/
  18. gobblin-data-management/
  19. gobblin-distribution/
  20. gobblin-docker/
  21. gobblin-docs/
  22. gobblin-example/
  23. gobblin-hive-registration/
  24. gobblin-iceberg/
  25. gobblin-kubernetes/
  26. gobblin-metastore/
  27. gobblin-metrics-libs/
  28. gobblin-modules/
  29. gobblin-oozie/
  30. gobblin-rest-service/
  31. gobblin-restli/
  32. gobblin-runtime/
  33. gobblin-runtime-hadoop/
  34. gobblin-salesforce/
  35. gobblin-service/
  36. gobblin-test/
  37. gobblin-test-harness/
  38. gobblin-test-utils/
  39. gobblin-tunnel/
  40. gobblin-utility/
  41. gobblin-yarn/
  42. gradle/
  43. ligradle/
  44. maven-nexus/
  45. maven-sonatype/
  46. .asf.yaml
  47. .codecov_bash
  48. .dockerignore
  49. .gitignore
  50. build.gradle
  51. CHANGELOG.md
  52. defaultEnvironment.gradle
  53. gobblin-flavored-build.gradle
  54. gradle.properties
  55. gradlew
  56. gradlew.bat
  57. HEADER
  58. LICENSE
  59. mkdocs.yml
  60. NOTICE
  61. query_github_issues.py
  62. README.md
  63. readthedocs.yml
  64. settings.gradle
README.md

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links