[GOBBLIN-1496] Initialize arrays and maps related to gobblin as a service with an in… (#3339)

* optimize some maps and arrays that can be initialized with a fixed size

* use newHashmapWithExpectedSize

* fix checkstyle

* fix bug

* undo accidental change getTimeUnit
11 files changed
tree: b64fdcbe61ea37d721450fdb0d971b70b2f70b63
  1. .github/
  2. bin/
  3. buildSrc/
  4. conf/
  5. config/
  6. dev/
  7. gobblin-admin/
  8. gobblin-all/
  9. gobblin-api/
  10. gobblin-audit/
  11. gobblin-aws/
  12. gobblin-binary-management/
  13. gobblin-cluster/
  14. gobblin-compaction/
  15. gobblin-config-management/
  16. gobblin-core/
  17. gobblin-core-base/
  18. gobblin-data-management/
  19. gobblin-distribution/
  20. gobblin-docker/
  21. gobblin-docs/
  22. gobblin-example/
  23. gobblin-hive-registration/
  24. gobblin-iceberg/
  25. gobblin-kubernetes/
  26. gobblin-metastore/
  27. gobblin-metrics-libs/
  28. gobblin-modules/
  29. gobblin-oozie/
  30. gobblin-rest-service/
  31. gobblin-restli/
  32. gobblin-runtime/
  33. gobblin-runtime-hadoop/
  34. gobblin-salesforce/
  35. gobblin-service/
  36. gobblin-test/
  37. gobblin-test-harness/
  38. gobblin-test-utils/
  39. gobblin-tunnel/
  40. gobblin-utility/
  41. gobblin-yarn/
  42. gradle/
  43. ligradle/
  44. maven-nexus/
  45. maven-sonatype/
  46. .asf.yaml
  47. .codecov_bash
  48. .dockerignore
  49. .gitignore
  50. build.gradle
  51. CHANGELOG.md
  52. defaultEnvironment.gradle
  53. gobblin-flavored-build.gradle
  54. gradle.properties
  55. gradlew
  56. gradlew.bat
  57. HEADER
  58. LICENSE
  59. mkdocs.yml
  60. NOTICE
  61. query_github_issues.py
  62. README.md
  63. readthedocs.yml
  64. settings.gradle
README.md

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links