blob: ec4f16cf0a7897af92b85ec1ded6c0104e4da567 [file] [log] [blame]
---+HDFS Snapshot based Mirroring
---++Overview
HDFS snapshots are very cost effective to create ( cost is O(1) excluding iNode lookup time). Once created, it is very
efficient to find modifications relative to a snapshot and copy over these modifications for disaster recovery (DR).
This makes for cost effective HDFS mirroring.
---++Prerequisites
Following is the prerequisite to use HDFS Snapshot based Mirrroring.
* Hadoop version 2.7.0 or higher.
* User submitting and scheduling falcon snapshot based mirroring job should have permission to create and manage snapshots on both source and target directories.
---++ Use Case
Create and manage snapshots on source/target directories. Mirror data from source to target for disaster
recovery using these snapshots. Perform retention on the snapshots created on source and target.
---++ Usage
---+++ Setup
* Submit a source cluster and target cluster entities to Falcon.
<verbatim>
$FALCON_HOME/bin/falcon entity -submit -type cluster -file source-cluster-definition.xml
$FALCON_HOME/bin/falcon entity -submit -type cluster -file target-cluster-definition.xml
</verbatim>
* Ensure that source directory on source cluster and target directory on target cluster exists.
* Ensure that these dirs are snapshot-able by user submitting extension. You can find more [[https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html][information on snapshots here]].
---+++ HDFS Snapshot based mirroring extension properties
Extension artifacts are expected to be installed on HDFS at the path specified by "extension.store.uri" in startup properties.
hdfs-snapshot-mirroring-properties.json file located at "<extension.store.uri>/hdfs-snapshot-mirroring/META/hdfs-snapshot-mirroring-properties.json"
lists all the required and optional parameters/arguments for scheduling the mirroring job.
Here is a sample set of properties,
<verbatim>
## Job Properties
jobName=hdfs-snapshot-test
jobClusterName=backupCluster
jobValidityStart=2016-01-01T00:00Z
jobValidityEnd=2016-04-01T00:00Z
jobFrequency=hours(12)
jobTimezone=UTC
jobTags=consumer=consumer@xyz.com
jobRetryPolicy=periodic
jobRetryDelay=minutes(30)
jobRetryAttempts=3
## Job owner
jobAclOwner=ambari-qa
jobAclGroup=users
jobAclPermission=*
## Source information
sourceCluster=primaryCluster
sourceSnapshotDir=/apps/falcon/snapshots/source/
sourceSnapshotRetentionPolicy=delete
sourceSnapshotRetentionAgeLimit=days(15)
sourceSnapshotRetentionNumber=10
## Target information
targetCluster=backupCluster
targetSnapshotDir=/apps/falcon/snapshots/target/
targetSnapshotRetentionPolicy=delete
targetSnapshotRetentionAgeLimit=months(6)
targetSnapshotRetentionNumber=20
## Distcp properties
distcpMaxMaps=1
distcpMapBandwidth=100
tdeEncryptionEnabled=false
</verbatim>
The above properties ensure Falcon hdfs snapshot based mirroring extension does the following every 12 hours.
* Create snapshot on dir /apps/falcon/snapshots/source/ on primaryCluster.
* DistCP data from /apps/falcon/snapshots/source/ on primaryCluster to /apps/falcon/snapshots/target/ on backupCluster.
* Create snapshot on dir /apps/falcon/snapshots/target/ on backupCluster.
* Perform retention job on source and target.
* Maintain at least N latest snapshots and delete all other snapshots older than specified age limit.
* Today, only "delete" policy is supported for snapshot retention.
*Note:*
When TDE encryption is enabled on source/target directories, DistCP ignores the snapshots and treats it like a regular
replication. While user may not get the performance benefit of using snapshot based DistCP, the extension is still useful
for creating and maintaining snapshots.
---+++ Submit and schedule HDFS snapshot mirroring extension
User can submit extension using CLI or RestAPI. CLI command looks as follows
<verbatim>
$FALCON_HOME/bin/falcon extension -submitAndSchedule -extensionName hdfs-snapshot-mirroring -file propeties-file.txt
</verbatim>
Please Refer to [[falconcli/FalconCLI][Falcon CLI]] and [[restapi/ResourceList][REST API]] for more details on usage of CLI and REST API's.