blob: f8264c2c44966136ffbe0bedb804ca0b5095d257 [file] [log] [blame]
<!DOCTYPE html>
<!--
| Generated by Apache Maven Doxia at 2018-03-12
| Rendered using Apache Maven Fluido Skin 1.3.0
-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="Date-Revision-yyyymmdd" content="20180312" />
<meta http-equiv="Content-Language" content="en" />
<title>Falcon - Data Replication between On-premise Hadoop Clusters and Azure Cloud</title>
<link rel="stylesheet" href="./css/apache-maven-fluido-1.3.0.min.css" />
<link rel="stylesheet" href="./css/site.css" />
<link rel="stylesheet" href="./css/print.css" media="print" />
<script type="text/javascript" src="./js/apache-maven-fluido-1.3.0.min.js"></script>
<script type="text/javascript">$( document ).ready( function() { $( '.carousel' ).carousel( { interval: 3500 } ) } );</script>
</head>
<body class="topBarDisabled">
<div class="container">
<div id="banner">
<div class="pull-left">
<div id="bannerLeft">
<img src="images/falcon-logo.png" alt="Apache Falcon" width="200px" height="45px"/>
</div>
</div>
<div class="pull-right"> </div>
<div class="clear"><hr/></div>
</div>
<div id="breadcrumbs">
<ul class="breadcrumb">
<li class="">
<a href="index.html" title="Falcon">
Falcon</a>
</li>
<li class="divider ">/</li>
<li class="">Data Replication between On-premise Hadoop Clusters and Azure Cloud</li>
<li id="publishDate" class="pull-right">Last Published: 2018-03-12</li> <li class="divider pull-right">|</li>
<li id="projectVersion" class="pull-right">Version: 0.11</li>
</ul>
</div>
<div id="bodyColumn" >
<div class="section">
<h2>Data Replication between On-premise Hadoop Clusters and Azure Cloud<a name="Data_Replication_between_On-premise_Hadoop_Clusters_and_Azure_Cloud"></a></h2></div>
<div class="section">
<h3>Overview<a name="Overview"></a></h3>
<p>Falcon provides an easy way to replicate data between on-premise Hadoop clusters and Azure cloud. With this feature, users would be able to build a hybrid data pipeline, e.g. processing sensitive data on-premises for privacy and compliance reasons while leverage cloud for elastic scale and online services (e.g. Azure machine learning) with non-sensitive data.</p></div>
<div class="section">
<h3>Use Case<a name="Use_Case"></a></h3>
<p>1. Copy data from on-premise Hadoop clusters to Azure cloud 2. Copy data from Azure cloud to on-premise Hadoop clusters 3. Copy data within Azure cloud (i.e. from one Azure location to another).</p></div>
<div class="section">
<h3>Usage<a name="Usage"></a></h3></div>
<div class="section">
<h4>Set Up Azure Blob Credentials<a name="Set_Up_Azure_Blob_Credentials"></a></h4>
<p>To move data to/from Azure blobs, we need to add Azure blob credentials in HDFS. This can be done by adding the credential property through Ambari HDFS configs, and HDFS needs to be restarted after adding the credential. You can also add the credential property to core-site.xml directly, but make sure you restart HDFS from command line instead of Ambari. Otherwise, Ambari will take the previous HDFS configuration without your Azure blob credentials.</p>
<div class="source">
<pre>
&lt;property&gt;
&lt;name&gt;fs.azure.account.key.{AZURE_BLOB_ACCOUNT_NAME}.blob.core.windows.net&lt;/name&gt;
&lt;value&gt;{AZURE_BLOB_ACCOUNT_KEY}&lt;/value&gt;
&lt;/property&gt;
</pre></div>
<p>To verify you set up Azure credential properly, you can check if you are able to access Azure blob through HDFS, e.g.</p>
<div class="source">
<pre>
hadoop fs &#xc2;&#xad;ls wasb://{AZURE_BLOB_CONTAINER}@{AZURE_BLOB_ACCOUNT_NAME}.blob.core.windows.net/
</pre></div></div>
<div class="section">
<h4>Replication Feed<a name="Replication_Feed"></a></h4>
<p><a href="./EntitySpecification.html">Falcon replication feed</a> can be used for data replication to/from Azure cloud. You can specify WASB (i.e. Windows Azure Storage Blob) url in source or target locations. See below for an example of data replication from Hadoop cluster to Azure blob. Note that the clusters for the source and the target need to be different. Analogously, if you want to copy data from Azure blob, you can add Azure blob location to the source.</p>
<div class="source">
<pre>
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;feed name=&quot;AzureReplication&quot; xmlns=&quot;uri:falcon:feed:0.1&quot;&gt;
&lt;frequency&gt;months(1)&lt;/frequency&gt;
&lt;clusters&gt;
&lt;cluster name=&quot;SampleCluster1&quot; type=&quot;source&quot;&gt;
&lt;validity start=&quot;2010-06-01T00:00Z&quot; end=&quot;2010-06-02T00:00Z&quot;/&gt;
&lt;retention limit=&quot;days(90)&quot; action=&quot;delete&quot;/&gt;
&lt;/cluster&gt;
&lt;cluster name=&quot;SampleCluster2&quot; type=&quot;target&quot;&gt;
&lt;validity start=&quot;2010-06-01T00:00Z&quot; end=&quot;2010-06-02T00:00Z&quot;/&gt;
&lt;retention limit=&quot;days(90)&quot; action=&quot;delete&quot;/&gt;
&lt;locations&gt;
&lt;location type=&quot;data&quot; path=&quot;wasb://replication-test@mystorage.blob.core.windows.net/replicated-${YEAR}-${MONTH}&quot;/&gt;
&lt;/locations&gt;
&lt;/cluster&gt;
&lt;/clusters&gt;
&lt;locations&gt;
&lt;location type=&quot;data&quot; path=&quot;/apps/falcon/demo/data-${YEAR}-${MONTH}&quot; /&gt;
&lt;/locations&gt;
&lt;ACL owner=&quot;ambari-qa&quot; group=&quot;users&quot; permission=&quot;0755&quot;/&gt;
&lt;schema location=&quot;hcat&quot; provider=&quot;hcat&quot;/&gt;
&lt;/feed&gt;
</pre></div></div>
</div>
</div>
<hr/>
<footer>
<div class="container">
<div class="row span12">Copyright &copy; 2013-2018
<a href="http://www.apache.org">Apache Software Foundation</a>.
All Rights Reserved.
</div>
<p id="poweredBy" class="pull-right">
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img class="builtBy" alt="Built by Maven" src="./images/logos/maven-feather.png" />
</a>
</p>
</div>
</footer>
</body>
</html>