The Gobblin Compliance module allows for data purging to meet regulatory compliance requirements. The module includes purging, retention and restore functionality for datasets.
The purging is performed using Hive meaning that purging of datasets is supported in any format that Hive can read from and write to, including for example ORC and Parquet. Further the purger is built on top of the Gobblin framework which means that the fault-tolerance, scalability and flexibility that Gobblin provides is taken full advantage of.
The User Guide describes how to onboard a dataset for purging.
The elements of the Compliance design are:
A dataset is onboarded to the Purger with these steps:
The purger iterates over all the tables that are whitelisted, and of those tables further looks for the presence of the dataset descriptor to specify the information required by the purger to proceed with the purge process.
With this information, the purger iterates over the partitions of the table that needs to be purged and proceeds to purge each partition of the table individually.
The purger code is mostly in the gobblin.compliance.purger
package.
The elements of the purger are:
The Gobblin constructs that make up the Purger are:
HivePurgerSource
generates a WorkUnit per partition that needs to be purgedHivePurgerExtractor
instantiates a PurgeableHivePartitionDataset
object that encapsulates all the information required to purge the partitionHivePurgerConverter
populates the purge queries into the PurgeableHivePartitionDataset
objectHivePurgerWriter
HivePurgerPublisher
moves successful Workunits to the COMMITTED
stateThe purging process operates as follows:
LIKE
construct of the current table that is being purgedLEFT OUTER JOIN
of the original table against the table containing the ids whose data is to be purged and INSERT OVERWRITE
s this data into the staging table, and thereby location. Once this query returns, the location will contain the purged dataALTER
the original partition location next to the new staging table location, we preserve the location of the current/original location of the partition by creating a backup table pointing to this location. We do not move this immediately to avoid breaking any in-flight queries.ALTER
the partition location to the location containing the purged dataDROP
the staging table, this only drops the metadata and not the dataTaking as an example, a tracking.event
table, and the datepartition=2017-02-16-00/is_guest=0
partition, the purge process would be the following:
tracking.event
table is located at the location /user/tracking/event/
tracking@event@datepartition=2017-02-16-00/is_guest=0
per Hive, and let's assume the data is located at /user/tracking/event/original/datepartition=2017-02-16-00/is_guest=0/
tracking.event_staging_1234567890123
(1234567890123
is the example timestamp we will use for clarity, a real timestamp looks more like ‘1487154972824’) is created LIKE tracking.event
with the location /user/tracking/event/1234567890123/datepartition=2017-02-16-00/is_guest=0/
. This would be within the original table locationINSERT OVERWRITE TABLE tracking@event_staging_1234567890123 PARTITION (datepartition='2017-02-16-00',is_guest='0') SELECT /*+MAPJOIN(b) */ a.metadata.guestid, a.col_a, a.col_b FROM tracking.event a LEFT JOIN u_purger.guestids b ON a.metadata.guestid=b.guestid WHERE b.guestid IS NULL AND a.datepartition='2017-02-16-00' AND a.is_guest='0'
tracking.event_backup_1234567890123
is created with PARTITION datepartition=2017-02-16-00,is_guest=0
pointing to the original location /user/tracking/event/original/datepartition=2017-02-16-00/is_guest=0
tracking@event@2017-02-16-00
is updated to be /user/tracking/event/1234567890123/datepartition=2017-02-16-00/is_guest=0
tracking.event_staging_1234567890123
table is droppedThe retention code is mostly in the gobblin.compliance.retention
package.
The retention process builds on top of Gobblin Retention and performs the following operations:
The restore code is mostly in the gobblin.compliance.restore
package.
The restore process allows for restoration to a backup dataset if required.