[GOBBLIN-1667] Create new predicate - ExistingPartitionSkipPredicate (#3526)
Currently the hive.dataset.existing.entity.policy.ABORT will not
abort if there is an existing partition. One option to resolve this
is to support the ABORT configuration but that might be backwards
incompatible, so introducing a new skip predicate called
ExistingPartitionSkipPredicate that will skip any partition that
already exists in the target table
3 files changed
tree: 067842ed4343094758b62f785a40fe2a9f2f6ccd
- .github/
- bin/
- buildSrc/
- conf/
- config/
- dev/
- gobblin-admin/
- gobblin-all/
- gobblin-api/
- gobblin-audit/
- gobblin-aws/
- gobblin-binary-management/
- gobblin-cluster/
- gobblin-compaction/
- gobblin-completeness/
- gobblin-config-management/
- gobblin-core/
- gobblin-core-base/
- gobblin-data-management/
- gobblin-distribution/
- gobblin-docker/
- gobblin-docs/
- gobblin-example/
- gobblin-hive-registration/
- gobblin-iceberg/
- gobblin-kubernetes/
- gobblin-metastore/
- gobblin-metrics-libs/
- gobblin-modules/
- gobblin-oozie/
- gobblin-rest-service/
- gobblin-restli/
- gobblin-runtime/
- gobblin-runtime-hadoop/
- gobblin-salesforce/
- gobblin-service/
- gobblin-test/
- gobblin-test-harness/
- gobblin-test-utils/
- gobblin-tunnel/
- gobblin-utility/
- gobblin-yarn/
- gradle/
- ligradle/
- maven-nexus/
- maven-sonatype/
- .asf.yaml
- .codecov_bash
- .dockerignore
- .gitignore
- build.gradle
- CHANGELOG.md
- defaultEnvironment.gradle
- gobblin-flavored-build.gradle
- gradle.properties
- gradlew
- gradlew.bat
- HEADER
- LICENSE
- mkdocs.yml
- NOTICE
- query_github_issues.py
- README.md
- readthedocs.yml
- settings.gradle
README.md
Apache Gobblin
Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.
Capabilities
- Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
- Data Organization within the lake (e.g. compaction, partitioning, deduplication)
- Lifecycle Management of data within the lake (e.g. data retention)
- Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)
Highlights
- Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
- Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
- Supports stream and batch execution modes
- Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.
Common Patterns used in production
- Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
- Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
- Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
- Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
- Enforcing Data retention policies and GDPR deletion on HDFS / ADLS
Apache Gobblin is NOT
- A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
- A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
- A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.
Requirements
If building the distribution with tests turned on:
Instructions to run Apache RAT (Release Audit Tool)
- Extract the archive file to your local directory.
- Run
./gradlew rat
. Report will be generated under build/rat/rat-report.html
Instructions to build the distribution
- Extract the archive file to your local directory.
- Skip tests and build the distribution: Run
./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain
The distribution will be created in build/gobblin-distribution/distributions directory. (or) - Run tests and build the distribution (requires Maven): Run
./gradlew build
The distribution will be created in build/gobblin-distribution/distributions directory.
Quick Links