[GOBBLIN-1438] Only load failed dags at time of resume

Closes #3272 from jack-moseley/failed-dag-memory
8 files changed
tree: aab0ac103087974ff322df289637ab77685d1d31
  1. .github/
  2. bin/
  3. buildSrc/
  4. conf/
  5. config/
  6. dev/
  7. gobblin-admin/
  8. gobblin-all/
  9. gobblin-api/
  10. gobblin-audit/
  11. gobblin-aws/
  12. gobblin-binary-management/
  13. gobblin-cluster/
  14. gobblin-compaction/
  15. gobblin-config-management/
  16. gobblin-core/
  17. gobblin-core-base/
  18. gobblin-data-management/
  19. gobblin-distribution/
  20. gobblin-docker/
  21. gobblin-docs/
  22. gobblin-example/
  23. gobblin-hive-registration/
  24. gobblin-iceberg/
  25. gobblin-kubernetes/
  26. gobblin-metastore/
  27. gobblin-metrics-libs/
  28. gobblin-modules/
  29. gobblin-oozie/
  30. gobblin-rest-service/
  31. gobblin-restli/
  32. gobblin-runtime/
  33. gobblin-runtime-hadoop/
  34. gobblin-salesforce/
  35. gobblin-service/
  36. gobblin-test/
  37. gobblin-test-harness/
  38. gobblin-test-utils/
  39. gobblin-tunnel/
  40. gobblin-utility/
  41. gobblin-yarn/
  42. gradle/
  43. ligradle/
  44. maven-nexus/
  45. maven-sonatype/
  46. travis/
  47. .asf.yaml
  48. .codecov_bash
  49. .dockerignore
  50. .gitignore
  51. .travis.yml
  52. build.gradle
  53. CHANGELOG.md
  54. defaultEnvironment.gradle
  55. gobblin-flavored-build.gradle
  56. gradle.properties
  57. gradlew
  58. gradlew.bat
  59. HEADER
  60. LICENSE
  61. mkdocs.yml
  62. NOTICE
  63. query_github_issues.py
  64. README.md
  65. readthedocs.yml
  66. settings.gradle
README.md

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links