| <?xml version="1.0"?> |
| <!-- |
| Copyright 2002-2004 The Apache Software Foundation |
| |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| <document> |
| <header> |
| <title>Hadoop Archives Guide</title> |
| </header> |
| <body> |
| <section> |
| <title>Overview</title> |
| <p> |
| Hadoop archives are special format archives. A Hadoop archive |
| maps to a file system directory. A Hadoop archive always has a *.har |
| extension. A Hadoop archive directory contains metadata (in the form |
| of _index and _masterindex) and data (part-*) files. The _index file contains |
| the name of the files that are part of the archive and the location |
| within the part files. |
| </p> |
| </section> |
| |
| <section> |
| <title>How to Create an Archive</title> |
| <p> |
| <code>Usage: hadoop archive -archiveName name -p <parent> <src>* <dest></code> |
| </p> |
| <p> |
| -archiveName is the name of the archive you would like to create. |
| An example would be foo.har. The name should have a *.har extension. |
| The parent argument is to specify the relative path to which the files should be |
| archived to. Example would be : |
| </p><p><code> -p /foo/bar a/b/c e/f/g </code></p><p> |
| Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent. |
| Note that this is a Map/Reduce job that creates the archives. You would |
| need a map reduce cluster to run this. For a detailed example the later sections. </p> |
| <p> If you just want to archive a single directory /foo/bar then you can just use </p> |
| <p><code> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir </code></p> |
| </section> |
| |
| <section> |
| <title>How to Look Up Files in Archives</title> |
| <p> |
| The archive exposes itself as a file system layer. So all the fs shell |
| commands in the archives work but with a different URI. Also, note that |
| archives are immutable. So, rename's, deletes and creates return |
| an error. URI for Hadoop Archives is |
| </p><p><code>har://scheme-hostname:port/archivepath/fileinarchive</code></p><p> |
| If no scheme is provided it assumes the underlying filesystem. |
| In that case the URI would look like </p> |
| <p><code>har:///archivepath/fileinarchive</code></p> |
| </section> |
| |
| <section> |
| <title>Archives Examples</title> |
| <section> |
| <title>Creating an Archive</title> |
| |
| <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo </code></p> |
| <p> |
| The above example is creating an archive using /user/hadoop as the relative archive directory. |
| The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be |
| archived in the following file system directory -- /user/zoo/foo.har. Archiving does not delete the input |
| files. If you want to delete the input files after creating the archives (to reduce namespace), you |
| will have to do it on your own. |
| </p> |
| </section> |
| <section> |
| <title> Looking Up Files</title> |
| <p> Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have |
| archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the example above, to see all |
| the files in the archives you can just run: </p> |
| <p><code>hadoop dfs -lsr har:///user/zoo/foo.har/</code></p> |
| <p> To understand the significance of the -p argument, lets go through the above example again. If you just do |
| an ls (not lsr) on the hadoop archive using </p> |
| <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p> |
| <p>The output should be:</p> |
| <source> |
| har:///user/zoo/foo.har/dir1 |
| har:///user/zoo/foo.har/dir2 |
| </source> |
| <p> As you can recall the archives were created with the following command </p> |
| <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo </code></p> |
| <p> If we were to change the command to: </p> |
| <p><code>hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo </code></p> |
| <p> then a ls on the hadoop archive using </p> |
| <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p> |
| <p>would give you</p> |
| <source> |
| har:///user/zoo/foo.har/hadoop/dir1 |
| har:///user/zoo/foo.har/hadoop/dir2 |
| </source> |
| <p> |
| Notice that the archived files have been archived relative to /user/ rather than /user/hadoop. |
| </p> |
| </section> |
| </section> |
| |
| <section> |
| <title>Hadoop Archives and MapReduce </title> |
| <p>Using Hadoop Archives in MapReduce is as easy as specifying a different input filesystem than the default file system. |
| If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, all |
| you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives is exposed as a file system |
| MapReduce will be able to use all the logical input files in Hadoop Archives as input.</p> |
| </section> |
| </body> |
| </document> |