hadoop-tools/hadoop-archives/src/site/markdown/HadoopArchives.md.vm - hadoop - Git at Google

 <!---
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. See accompanying LICENSE file.
 -->

 #set ( $H3 = '###' )

 Hadoop Archives Guide
 =====================

  - [Overview](#Overview)
  - [How to Create an Archive](#How_to_Create_an_Archive)
  - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
  - [How to Unarchive an Archive](#How_to_Unarchive_an_Archive)
  - [Archives Examples](#Archives_Examples)
      - [Creating an Archive](#Creating_an_Archive)
      - [Looking Up Files](#Looking_Up_Files)
  - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)

 Overview
 --------

   Hadoop archives are special format archives. A Hadoop archive maps to a file
   system directory. A Hadoop archive always has a \*.har extension. A Hadoop
   archive directory contains metadata (in the form of _index and _masterindex)
   and data (part-\*) files. The _index file contains the name of the files that
   are part of the archive and the location within the part files.

 How to Create an Archive
 ------------------------

   `Usage: hadoop archive -archiveName name -p <parent> [-r <replication factor>] <src>* <dest>`

   -archiveName is the name of the archive you would like to create. An example
   would be foo.har. The name should have a \*.har extension. The parent argument
   is to specify the relative path to which the files should be archived to.
   Example would be :

   `-p /foo/bar a/b/c e/f/g`

   Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
   parent. Note that this is a Map/Reduce job that creates the archives. You
   would need a map reduce cluster to run this. For a detailed example the later
   sections.

   -r indicates the desired replication factor; if this optional argument is
   not specified, a replication factor of 3 will be used.

   If you just want to archive a single directory /foo/bar then you can just use

   `hadoop archive -archiveName zoo.har -p /foo/bar -r 3 /outputdir`

   If you specify source files that are in an encryption zone, they will be
   decrypted and written into the archive. If the har file is not located in an
   encryption zone, then they will be stored in clear (decrypted) form. If the
   har file is located in an encryption zone they will stored in encrypted form.

 How to Look Up Files in Archives
 --------------------------------

   The archive exposes itself as a file system layer. So all the fs shell
   commands in the archives work but with a different URI. Also, note that
   archives are immutable. So, rename's, deletes and creates return an error.
   URI for Hadoop Archives is

   `har://scheme-hostname:port/archivepath/fileinarchive`

   If no scheme is provided it assumes the underlying filesystem. In that case
   the URI would look like

   `har:///archivepath/fileinarchive`

 How to Unarchive an Archive
 ---------------------------

   Since all the fs shell commands in the archives work transparently,
   unarchiving is just a matter of copying.

   To unarchive sequentially:

   `hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`

   To unarchive in parallel, use DistCp:

   `hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`

 Archives Examples
 -----------------

 $H3 Creating an Archive

   `hadoop archive -archiveName foo.har -p /user/hadoop -r 3 dir1 dir2 /user/zoo`

   The above example is creating an archive using /user/hadoop as the relative
   archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
   will be archived in the following file system directory -- /user/zoo/foo.har.
   Archiving does not delete the input files. If you want to delete the input
   files after creating the archives (to reduce namespace), you will have to do
   it on your own. In this example, because `-r 3` is specified, a replication
   factor of 3 will be used.

 $H3 Looking Up Files

   Looking up files in hadoop archives is as easy as doing an ls on the
   filesystem. After you have archived the directories /user/hadoop/dir1 and
   /user/hadoop/dir2 as in the example above, to see all the files in the
   archives you can just run:

   `hdfs dfs -ls -R har:///user/zoo/foo.har/`

   To understand the significance of the -p argument, lets go through the above
   example again. If you just do an ls (not lsr) on the hadoop archive using

   `hdfs dfs -ls har:///user/zoo/foo.har`

   The output should be:

 ```
 har:///user/zoo/foo.har/dir1
 har:///user/zoo/foo.har/dir2
 ```

   As you can recall the archives were created with the following command

   `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`

   If we were to change the command to:

   `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`

   then a ls on the hadoop archive using

   `hdfs dfs -ls har:///user/zoo/foo.har`

   would give you

 ```
 har:///user/zoo/foo.har/hadoop/dir1
 har:///user/zoo/foo.har/hadoop/dir2
 ```

   Notice that the archived files have been archived relative to /user/ rather
   than /user/hadoop.

 Hadoop Archives and MapReduce
 -----------------------------

   Using Hadoop Archives in MapReduce is as easy as specifying a different input
   filesystem than the default file system. If you have a hadoop archive stored
   in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
   all you need is to specify the input directory as har:///user/zoo/foo.har. Since
   Hadoop Archives is exposed as a file system MapReduce will be able to use all
   the logical input files in Hadoop Archives as input.
	<!---
	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License. See accompanying LICENSE file.
	-->

	#set ( $H3 = '###' )

	Hadoop Archives Guide
	=====================

	- [Overview](#Overview)
	- [How to Create an Archive](#How_to_Create_an_Archive)
	- [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
	- [How to Unarchive an Archive](#How_to_Unarchive_an_Archive)
	- [Archives Examples](#Archives_Examples)
	- [Creating an Archive](#Creating_an_Archive)
	- [Looking Up Files](#Looking_Up_Files)
	- [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)

	Overview
	--------

	Hadoop archives are special format archives. A Hadoop archive maps to a file
	system directory. A Hadoop archive always has a \*.har extension. A Hadoop
	archive directory contains metadata (in the form of _index and _masterindex)
	and data (part-\*) files. The _index file contains the name of the files that
	are part of the archive and the location within the part files.

	How to Create an Archive
	------------------------

	`Usage: hadoop archive -archiveName name -p <parent> [-r <replication factor>] <src>* <dest>`

	-archiveName is the name of the archive you would like to create. An example
	would be foo.har. The name should have a \*.har extension. The parent argument
	is to specify the relative path to which the files should be archived to.
	Example would be :

	`-p /foo/bar a/b/c e/f/g`

	Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
	parent. Note that this is a Map/Reduce job that creates the archives. You
	would need a map reduce cluster to run this. For a detailed example the later
	sections.

	-r indicates the desired replication factor; if this optional argument is
	not specified, a replication factor of 3 will be used.

	If you just want to archive a single directory /foo/bar then you can just use

	`hadoop archive -archiveName zoo.har -p /foo/bar -r 3 /outputdir`

	If you specify source files that are in an encryption zone, they will be
	decrypted and written into the archive. If the har file is not located in an
	encryption zone, then they will be stored in clear (decrypted) form. If the
	har file is located in an encryption zone they will stored in encrypted form.

	How to Look Up Files in Archives
	--------------------------------

	The archive exposes itself as a file system layer. So all the fs shell
	commands in the archives work but with a different URI. Also, note that
	archives are immutable. So, rename's, deletes and creates return an error.
	URI for Hadoop Archives is

	`har://scheme-hostname:port/archivepath/fileinarchive`

	If no scheme is provided it assumes the underlying filesystem. In that case
	the URI would look like

	`har:///archivepath/fileinarchive`

	How to Unarchive an Archive
	---------------------------

	Since all the fs shell commands in the archives work transparently,
	unarchiving is just a matter of copying.

	To unarchive sequentially:

	`hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`

	To unarchive in parallel, use DistCp:

	`hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`

	Archives Examples
	-----------------

	$H3 Creating an Archive

	`hadoop archive -archiveName foo.har -p /user/hadoop -r 3 dir1 dir2 /user/zoo`

	The above example is creating an archive using /user/hadoop as the relative
	archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
	will be archived in the following file system directory -- /user/zoo/foo.har.
	Archiving does not delete the input files. If you want to delete the input
	files after creating the archives (to reduce namespace), you will have to do
	it on your own. In this example, because `-r 3` is specified, a replication
	factor of 3 will be used.

	$H3 Looking Up Files

	Looking up files in hadoop archives is as easy as doing an ls on the
	filesystem. After you have archived the directories /user/hadoop/dir1 and
	/user/hadoop/dir2 as in the example above, to see all the files in the
	archives you can just run:

	`hdfs dfs -ls -R har:///user/zoo/foo.har/`

	To understand the significance of the -p argument, lets go through the above
	example again. If you just do an ls (not lsr) on the hadoop archive using

	`hdfs dfs -ls har:///user/zoo/foo.har`

	The output should be:

	```
	har:///user/zoo/foo.har/dir1
	har:///user/zoo/foo.har/dir2
	```

	As you can recall the archives were created with the following command

	`hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`

	If we were to change the command to:

	`hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`

	then a ls on the hadoop archive using

	`hdfs dfs -ls har:///user/zoo/foo.har`

	would give you

	```
	har:///user/zoo/foo.har/hadoop/dir1
	har:///user/zoo/foo.har/hadoop/dir2
	```

	Notice that the archived files have been archived relative to /user/ rather
	than /user/hadoop.

	Hadoop Archives and MapReduce
	-----------------------------

	Using Hadoop Archives in MapReduce is as easy as specifying a different input
	filesystem than the default file system. If you have a hadoop archive stored
	in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
	all you need is to specify the input directory as har:///user/zoo/foo.har. Since
	Hadoop Archives is exposed as a file system MapReduce will be able to use all
	the logical input files in Hadoop Archives as input.