[GOBBLIN-1823] Improving Container Calculation and Allocation Methodology (#3692)

* address comments

* use connectionmanager when httpclient is not cloesable

* [GOBBLIN-1823] Improving Container Calculation and Allocation Methodology

* improve code style

* add more un-retriable status that we want to log out

* address comments

* add log when mismatch happens for debuggability

---------

Co-authored-by: Zihan Li <zihli@zihli-mn2.linkedin.biz>
2 files changed
tree: abfc4c66fb1829b2beedf7bc6baa0d0ea255e4d8
  1. .github/
  2. bin/
  3. buildSrc/
  4. conf/
  5. config/
  6. dev/
  7. gobblin-admin/
  8. gobblin-all/
  9. gobblin-api/
  10. gobblin-audit/
  11. gobblin-aws/
  12. gobblin-binary-management/
  13. gobblin-cluster/
  14. gobblin-compaction/
  15. gobblin-completeness/
  16. gobblin-config-management/
  17. gobblin-core/
  18. gobblin-core-base/
  19. gobblin-data-management/
  20. gobblin-distribution/
  21. gobblin-docker/
  22. gobblin-docs/
  23. gobblin-example/
  24. gobblin-hive-registration/
  25. gobblin-iceberg/
  26. gobblin-kubernetes/
  27. gobblin-metastore/
  28. gobblin-metrics-libs/
  29. gobblin-modules/
  30. gobblin-oozie/
  31. gobblin-rest-service/
  32. gobblin-restli/
  33. gobblin-runtime/
  34. gobblin-runtime-hadoop/
  35. gobblin-salesforce/
  36. gobblin-service/
  37. gobblin-test/
  38. gobblin-test-harness/
  39. gobblin-test-utils/
  40. gobblin-tunnel/
  41. gobblin-utility/
  42. gobblin-yarn/
  43. gradle/
  44. ligradle/
  45. maven-nexus/
  46. maven-sonatype/
  47. .asf.yaml
  48. .codecov_bash
  49. .dockerignore
  50. .gitignore
  51. build.gradle
  52. CHANGELOG.md
  53. defaultEnvironment.gradle
  54. gobblin-flavored-build.gradle
  55. gradle.properties
  56. gradlew
  57. gradlew.bat
  58. HEADER
  59. LICENSE
  60. mkdocs.yml
  61. NOTICE
  62. query_github_issues.py
  63. README.md
  64. readthedocs.yml
  65. settings.gradle
README.md

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to download gradle wrapper

If you are going to build Gobblin from the source distribution, run the following command for downloading the gradle-wrapper.jar from Gobblin git repository to gradle/wrapper directory (replace GOBBLIN_VERSION in the URL with the version you downloaded).

wget --no-check-certificate -P gradle/wrapper https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

(or)

curl --insecure -L https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar

Alternatively, you can download it manually from: https://github.com/apache/gobblin/blob/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

Make sure that you download it to gradle/wrapper directory.

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links