Skip directory entries in RecursiveCopyableDataset to fix IOException on empty source dirs (#4181)

* [GOBBLIN-XXXX] Skip directory entries in RecursiveCopyableDataset to avoid IOException on empty source dirs

When source.path is an empty directory, FileListUtils.listFilesToCopyAtPath
(includeEmptyDirectories=true) returns the directory itself as the sole
FileStatus entry. The subsequent call to
resolveReplicatedOwnerAndPermissionsRecursively passes file.getPath().getParent()
as fromPath — which is *above* replacedPrefix — inverting the ancestry check
and throwing an IOException.

Skip FileStatus entries where isDirectory=true; empty source directories
produce no copy work units by design. Log a warning so operators can
diagnose misconfigured source paths.

* [ETL-19035] Fix ancestor path resolution for empty source directories in RecursiveCopyableDataset

When includeEmptyDirectories=true and the source root is empty, FileListUtils
returns the root directory itself as a FileStatus entry. Calling .getParent()
on it produces a path above replacedPrefix, breaking the ancestry check in
resolveReplicatedOwnerAndPermissionsRecursively.

Fix: guard with isAncestor(replacedPrefix, parentPath) — fall back to the
file's own path only when parentPath is above the dataset root. This preserves
correct ancestor permission replication for nested empty subdirs (where
.getParent() is still within replacedPrefix) and ensures the empty root
directory is replicated at the destination rather than skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2 files changed
tree: e9befd91661221432b9e14f050d2069dea54c00f
  1. .github/
  2. bin/
  3. buildSrc/
  4. conf/
  5. config/
  6. dev/
  7. gobblin-admin/
  8. gobblin-all/
  9. gobblin-api/
  10. gobblin-audit/
  11. gobblin-aws/
  12. gobblin-binary-management/
  13. gobblin-cluster/
  14. gobblin-compaction/
  15. gobblin-completeness/
  16. gobblin-config-management/
  17. gobblin-core/
  18. gobblin-core-base/
  19. gobblin-data-management/
  20. gobblin-distribution/
  21. gobblin-docker/
  22. gobblin-docs/
  23. gobblin-example/
  24. gobblin-hive-registration/
  25. gobblin-iceberg/
  26. gobblin-kubernetes/
  27. gobblin-metastore/
  28. gobblin-metrics-libs/
  29. gobblin-modules/
  30. gobblin-oozie/
  31. gobblin-rest-service/
  32. gobblin-restli/
  33. gobblin-runtime/
  34. gobblin-runtime-hadoop/
  35. gobblin-salesforce/
  36. gobblin-service/
  37. gobblin-temporal/
  38. gobblin-test/
  39. gobblin-test-harness/
  40. gobblin-test-utils/
  41. gobblin-tunnel/
  42. gobblin-utility/
  43. gobblin-yarn/
  44. gradle/
  45. ligradle/
  46. maven-nexus/
  47. maven-sonatype/
  48. .asf.yaml
  49. .codecov_bash
  50. .dockerignore
  51. .gitignore
  52. build.gradle
  53. CHANGELOG.md
  54. defaultEnvironment.gradle
  55. FlowTriggerHandlerTest.java
  56. gobblin-flavored-build.gradle
  57. gradle.properties
  58. gradlew
  59. gradlew.bat
  60. HEADER
  61. LICENSE
  62. mkdocs.yml
  63. NOTICE
  64. query_github_issues.py
  65. README.md
  66. readthedocs.yml
  67. settings.gradle
README.md

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to download gradle wrapper

If you are going to build Gobblin from the source distribution, run the following command for downloading the gradle-wrapper.jar from Gobblin git repository to gradle/wrapper directory (replace GOBBLIN_VERSION in the URL with the version you downloaded).

wget --no-check-certificate -P gradle/wrapper https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

(or)

curl --insecure -L https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar

Alternatively, you can download it manually from: https://github.com/apache/gobblin/blob/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

Make sure that you download it to gradle/wrapper directory.

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links