hadoop-mapreduce/src/docs/src/documentation/content/xdocs/hadoop_archives.xml - hadoop - Git at Google

 <?xml version="1.0"?>
 <!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 -->
 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
 <document>
 	<header>
 		<title>Hadoop Archive Guide</title>
 	</header>
 	<body>
 		<section>
 			<title> What are Hadoop archives?</title>
 			<p> Hadoop archives are special format archives. The main use case of
 				using archives is to reduce the namespace of the NameNode. Hadoop
 				Archives collapses a set of files into a smaller number of files and
 				provides a very efficient and easy interface to access these
 				collapsed files. A Hadoop archive maps to a HDFS directory. A Hadoop
 				archive always has a *.har extension.</p>
 		</section>
 		<section>
 			<title> How to create an archive?</title>
 			<p>
 				<code>Usage: hadoop archive -archiveName name -p &lt;parent&gt; &lt;src&gt;* &lt;dest&gt;</code>
         </p>
         <p>
         -archiveName is the name of the archive you would like to create.
         An example would be foo.har. The name should have a *.har extension.
        	The parent argument is to specify the relative path to which the files
        	should be
        	archived to. Example would be :
         </p><p><code> -p /foo/bar a/b/c e/f/g </code></p><p>
         Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
         parent.
         Note that this is a Map/Reduce job that creates the archives. You would
         need a map reduce cluster to run this. For a detailed example the later
         sections. </p>
         <p> If you just want to archive a single directory /foo/bar then you
         can just use </p>
         <p><code> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir
         </code></p>
         </section>

         <section>
         <title> How to look up files in archives? </title>
         <p>
         The archive exposes itself as a file system layer. So all the fs shell
         commands in the archives work but with a different URI. Also, note that
         archives are immutable. So, rename's, deletes and creates return
         an error. URI for Hadoop Archives is
         </p><p><code>har://scheme-hostname:port/archivepath/fileinarchive</code>
         </p><p>
         If no scheme is provided it assumes the underlying filesystem.
         In that case the URI would look like </p>
         <p><code>har:///archivepath/fileinarchive</code></p>
         </section>

  		<section>
  		<title> Example on creating and looking up archives </title>
         <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
          /user/zoo </code></p>
         <p>
          The above example is creating an archive using /user/hadoop as the
          relative archive directory.
          The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be
         archived in the following file system directory -- /user/zoo/foo.har.
         Archiving does not delete the input files. If you want to delete
         the input files after creating the archives (to reduce namespace), you
         will have to do it on your own.
         </p>

         <section>
         <title> Looking up files and understanding the -p option </title>
 		 <p> Looking up files in hadoop archives is as easy as doing an ls on the
 		 filesystem. After you have archived the directories /user/hadoop/dir1 and
 		  /user/hadoop/dir2 as in the exmaple above, to see all the files in the
 		  archives you can just run: </p>
 		 <p><code>hadoop dfs -lsr har:///user/zoo/foo.har/</code></p>
 		 <p> To understand the significance of the -p argument, lets go through the
 		 above example again. If you just do
 		 an ls (not lsr) on the hadoop archive using </p>
 		 <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
 		 <p>The output should be:</p>
 		 <source>
 har:///user/zoo/foo.har/dir1
 har:///user/zoo/foo.har/dir2
 		 </source>
 		 <p> As you can recall the archives were created with the following command </p>
         <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
          /user/zoo </code></p>
         <p> If we were to change the command to: </p>
         <p><code>hadoop archive -archiveName foo.har -p /user/
         hadoop/dir1 hadoop/dir2 /user/zoo </code></p>
         <p> then a ls on the hadoop archive using </p>
         <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
         <p>would give you</p>
         <source>
 har:///user/zoo/foo.har/hadoop/dir1
 har:///user/zoo/foo.har/hadoop/dir2
 		</source>
 		<p>
 		Notice that the archived files have been archived relative to
 		/user/ rather than /user/hadoop.
 		</p>
 		</section>
 		</section>

 		<section>
 		<title> Using Hadoop Archive with Map Reduce </title>
 		<p>Using Hadoop Archive in Map Reduce is as easy as specifying a
 		 different input filesystem than the default file system.
 		If you have a hadoop archive stored in HDFS in /user/zoo/foo.har
 		then for using this archive for Map Reduce input, all
 		you need to specify the input directory as har:///user/zoo/foo.har.
 		Since Hadoop Archives is exposed as a file system Map Reduce will be able
 		 to use all the logical input files in Hadoop Archives as input.</p>
         </section>

         <section>
         <title> File Replication and Permissions of Hadoop Archive </title>
         <p>
        	Hadoop Archive currently does not store the file information metadata
        	that the files had before they were archived. The file permissions of
        	Hadoop Archive created is the default permissions that a user creates
        	file with. The file replication of data files in Hadoop Archive is set to
        	 3 and the metadata information files have a replication factor of 5. You
        	 can increase this by increasing the replication factor of files under the har
        	 file directory.
        	On restoration of hadoop archive files using something like:
        	</p>
        	<p><code>
        	distcp har:///path_to_har dest_path
        	</code></p>
        	<p>
        	the restored files will not have the permissions/replication of the original
        	files that were archived.
         </p>
         </section>

         <section>
         <title> Creating different block size and part size Hadoop Archive </title>
         <p>
         You can create different hadoop block size and part size using the following
         options:
         </p> <p><code>
         bin/hadoop archive -Dhar.block.size=512 -Dhar.partfile.size=1024 -archiveName ...
         </code></p>
         <p>
         The above example allows you to set a block size of 512 bytes for part files
          and part file size of 1K. These numbers are only as examples. Using such low
          number for block size and part file size is not advisable at all!
         </p>
         </section>

         <section>
         <title> Limitations of Hadoop Archive </title>
         <p>
         Currently Hadoop archive do not support input paths with spaces in it. It
         throws out an exception in such a case. You can create archives with names
         in which space can be replaced by a valid character.
         Below is an example:
         </p>
         <p>
         <code>
         bin/hadoop archive -Dhar.space.replacement.enable=true -Dhar.space.replacement="_"
         -archiveName ......
         </code>
         </p>
 	    <p> The above example replaces space with "_" in the archived file names.
 	    </p>
 	    </section>

         <section>
         <title>Internals of Hadoop Archive </title>
         <p>
         A Hadoop Archive directory contains metadata (in the form
         of _index and _masterindex) and data (part-*) files. The _index file contains
         the name of the files that are part of the archive and the location
         within the part files. The _masterindex file stores offsets into the _index
         file to make it easier to seek into the _index file for faster lookups.
         </p>
         </section>

   </body>
 </document>
	<?xml version="1.0"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
	<document>
	<header>
	<title>Hadoop Archive Guide</title>
	</header>
	<body>
	<section>
	<title> What are Hadoop archives?</title>
	<p> Hadoop archives are special format archives. The main use case of
	using archives is to reduce the namespace of the NameNode. Hadoop
	Archives collapses a set of files into a smaller number of files and
	provides a very efficient and easy interface to access these
	collapsed files. A Hadoop archive maps to a HDFS directory. A Hadoop
	archive always has a *.har extension.</p>
	</section>
	<section>
	<title> How to create an archive?</title>
	<p>
	<code>Usage: hadoop archive -archiveName name -p <parent> <src>* <dest></code>
	</p>
	<p>
	-archiveName is the name of the archive you would like to create.
	An example would be foo.har. The name should have a *.har extension.
	The parent argument is to specify the relative path to which the files
	should be
	archived to. Example would be :
	</p><p><code> -p /foo/bar a/b/c e/f/g </code></p><p>
	Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
	parent.
	Note that this is a Map/Reduce job that creates the archives. You would
	need a map reduce cluster to run this. For a detailed example the later
	sections. </p>
	<p> If you just want to archive a single directory /foo/bar then you
	can just use </p>
	<p><code> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir
	</code></p>
	</section>

	<section>
	<title> How to look up files in archives? </title>
	<p>
	The archive exposes itself as a file system layer. So all the fs shell
	commands in the archives work but with a different URI. Also, note that
	archives are immutable. So, rename's, deletes and creates return
	an error. URI for Hadoop Archives is
	</p><p><code>har://scheme-hostname:port/archivepath/fileinarchive</code>
	</p><p>
	If no scheme is provided it assumes the underlying filesystem.
	In that case the URI would look like </p>
	<p><code>har:///archivepath/fileinarchive</code></p>
	</section>

	<section>
	<title> Example on creating and looking up archives </title>
	<p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
	/user/zoo </code></p>
	<p>
	The above example is creating an archive using /user/hadoop as the
	relative archive directory.
	The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be
	archived in the following file system directory -- /user/zoo/foo.har.
	Archiving does not delete the input files. If you want to delete
	the input files after creating the archives (to reduce namespace), you
	will have to do it on your own.
	</p>

	<section>
	<title> Looking up files and understanding the -p option </title>
	<p> Looking up files in hadoop archives is as easy as doing an ls on the
	filesystem. After you have archived the directories /user/hadoop/dir1 and
	/user/hadoop/dir2 as in the exmaple above, to see all the files in the
	archives you can just run: </p>
	<p><code>hadoop dfs -lsr har:///user/zoo/foo.har/</code></p>
	<p> To understand the significance of the -p argument, lets go through the
	above example again. If you just do
	an ls (not lsr) on the hadoop archive using </p>
	<p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
	<p>The output should be:</p>
	<source>
	har:///user/zoo/foo.har/dir1
	har:///user/zoo/foo.har/dir2
	</source>
	<p> As you can recall the archives were created with the following command </p>
	<p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
	/user/zoo </code></p>
	<p> If we were to change the command to: </p>
	<p><code>hadoop archive -archiveName foo.har -p /user/
	hadoop/dir1 hadoop/dir2 /user/zoo </code></p>
	<p> then a ls on the hadoop archive using </p>
	<p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
	<p>would give you</p>
	<source>
	har:///user/zoo/foo.har/hadoop/dir1
	har:///user/zoo/foo.har/hadoop/dir2
	</source>
	<p>
	Notice that the archived files have been archived relative to
	/user/ rather than /user/hadoop.
	</p>
	</section>
	</section>

	<section>
	<title> Using Hadoop Archive with Map Reduce </title>
	<p>Using Hadoop Archive in Map Reduce is as easy as specifying a
	different input filesystem than the default file system.
	If you have a hadoop archive stored in HDFS in /user/zoo/foo.har
	then for using this archive for Map Reduce input, all
	you need to specify the input directory as har:///user/zoo/foo.har.
	Since Hadoop Archives is exposed as a file system Map Reduce will be able
	to use all the logical input files in Hadoop Archives as input.</p>
	</section>

	<section>
	<title> File Replication and Permissions of Hadoop Archive </title>
	<p>
	Hadoop Archive currently does not store the file information metadata
	that the files had before they were archived. The file permissions of
	Hadoop Archive created is the default permissions that a user creates
	file with. The file replication of data files in Hadoop Archive is set to
	3 and the metadata information files have a replication factor of 5. You
	can increase this by increasing the replication factor of files under the har
	file directory.
	On restoration of hadoop archive files using something like:
	</p>
	<p><code>
	distcp har:///path_to_har dest_path
	</code></p>
	<p>
	the restored files will not have the permissions/replication of the original
	files that were archived.
	</p>
	</section>

	<section>
	<title> Creating different block size and part size Hadoop Archive </title>
	<p>
	You can create different hadoop block size and part size using the following
	options:
	</p> <p><code>
	bin/hadoop archive -Dhar.block.size=512 -Dhar.partfile.size=1024 -archiveName ...
	</code></p>
	<p>
	The above example allows you to set a block size of 512 bytes for part files
	and part file size of 1K. These numbers are only as examples. Using such low
	number for block size and part file size is not advisable at all!
	</p>
	</section>

	<section>
	<title> Limitations of Hadoop Archive </title>
	<p>
	Currently Hadoop archive do not support input paths with spaces in it. It
	throws out an exception in such a case. You can create archives with names
	in which space can be replaced by a valid character.
	Below is an example:
	</p>
	<p>
	<code>
	bin/hadoop archive -Dhar.space.replacement.enable=true -Dhar.space.replacement="_"
	-archiveName ......
	</code>
	</p>
	<p> The above example replaces space with "_" in the archived file names.
	</p>
	</section>

	<section>
	<title>Internals of Hadoop Archive </title>
	<p>
	A Hadoop Archive directory contains metadata (in the form
	of _index and _masterindex) and data (part-*) files. The _index file contains
	the name of the files that are part of the archive and the location
	within the part files. The _masterindex file stores offsets into the _index
	file to make it easier to seek into the _index file for faster lookups.
	</p>
	</section>

	</body>
	</document>