| <?xml version="1.0"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| <document> |
| <header> |
| <title>Hadoop Archive Guide</title> |
| </header> |
| <body> |
| <section> |
| <title> What are Hadoop archives?</title> |
| <p> Hadoop archives are special format archives. The main use case of |
| using archives is to reduce the namespace of the NameNode. Hadoop |
| Archives collapses a set of files into a smaller number of files and |
| provides a very efficient and easy interface to access these |
| collapsed files. A Hadoop archive maps to a HDFS directory. A Hadoop |
| archive always has a *.har extension.</p> |
| </section> |
| <section> |
| <title> How to create an archive?</title> |
| <p> |
| <code>Usage: hadoop archive -archiveName name -p <parent> <src>* <dest></code> |
| </p> |
| <p> |
| -archiveName is the name of the archive you would like to create. |
| An example would be foo.har. The name should have a *.har extension. |
| The parent argument is to specify the relative path to which the files |
| should be |
| archived to. Example would be : |
| </p><p><code> -p /foo/bar a/b/c e/f/g </code></p><p> |
| Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to |
| parent. |
| Note that this is a Map/Reduce job that creates the archives. You would |
| need a map reduce cluster to run this. For a detailed example the later |
| sections. </p> |
| <p> If you just want to archive a single directory /foo/bar then you |
| can just use </p> |
| <p><code> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir |
| </code></p> |
| </section> |
| |
| <section> |
| <title> How to look up files in archives? </title> |
| <p> |
| The archive exposes itself as a file system layer. So all the fs shell |
| commands in the archives work but with a different URI. Also, note that |
| archives are immutable. So, rename's, deletes and creates return |
| an error. URI for Hadoop Archives is |
| </p><p><code>har://scheme-hostname:port/archivepath/fileinarchive</code> |
| </p><p> |
| If no scheme is provided it assumes the underlying filesystem. |
| In that case the URI would look like </p> |
| <p><code>har:///archivepath/fileinarchive</code></p> |
| </section> |
| |
| <section> |
| <title> Example on creating and looking up archives </title> |
| <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 |
| /user/zoo </code></p> |
| <p> |
| The above example is creating an archive using /user/hadoop as the |
| relative archive directory. |
| The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be |
| archived in the following file system directory -- /user/zoo/foo.har. |
| Archiving does not delete the input files. If you want to delete |
| the input files after creating the archives (to reduce namespace), you |
| will have to do it on your own. |
| </p> |
| |
| <section> |
| <title> Looking up files and understanding the -p option </title> |
| <p> Looking up files in hadoop archives is as easy as doing an ls on the |
| filesystem. After you have archived the directories /user/hadoop/dir1 and |
| /user/hadoop/dir2 as in the exmaple above, to see all the files in the |
| archives you can just run: </p> |
| <p><code>hadoop dfs -lsr har:///user/zoo/foo.har/</code></p> |
| <p> To understand the significance of the -p argument, lets go through the |
| above example again. If you just do |
| an ls (not lsr) on the hadoop archive using </p> |
| <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p> |
| <p>The output should be:</p> |
| <source> |
| har:///user/zoo/foo.har/dir1 |
| har:///user/zoo/foo.har/dir2 |
| </source> |
| <p> As you can recall the archives were created with the following command </p> |
| <p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 |
| /user/zoo </code></p> |
| <p> If we were to change the command to: </p> |
| <p><code>hadoop archive -archiveName foo.har -p /user/ |
| hadoop/dir1 hadoop/dir2 /user/zoo </code></p> |
| <p> then a ls on the hadoop archive using </p> |
| <p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p> |
| <p>would give you</p> |
| <source> |
| har:///user/zoo/foo.har/hadoop/dir1 |
| har:///user/zoo/foo.har/hadoop/dir2 |
| </source> |
| <p> |
| Notice that the archived files have been archived relative to |
| /user/ rather than /user/hadoop. |
| </p> |
| </section> |
| </section> |
| |
| <section> |
| <title> Using Hadoop Archive with Map Reduce </title> |
| <p>Using Hadoop Archive in Map Reduce is as easy as specifying a |
| different input filesystem than the default file system. |
| If you have a hadoop archive stored in HDFS in /user/zoo/foo.har |
| then for using this archive for Map Reduce input, all |
| you need to specify the input directory as har:///user/zoo/foo.har. |
| Since Hadoop Archives is exposed as a file system Map Reduce will be able |
| to use all the logical input files in Hadoop Archives as input.</p> |
| </section> |
| |
| <section> |
| <title> File Replication and Permissions of Hadoop Archive </title> |
| <p> |
| Hadoop Archive currently does not store the file information metadata |
| that the files had before they were archived. The file permissions of |
| Hadoop Archive created is the default permissions that a user creates |
| file with. The file replication of data files in Hadoop Archive is set to |
| 3 and the metadata information files have a replication factor of 5. You |
| can increase this by increasing the replication factor of files under the har |
| file directory. |
| On restoration of hadoop archive files using something like: |
| </p> |
| <p><code> |
| distcp har:///path_to_har dest_path |
| </code></p> |
| <p> |
| the restored files will not have the permissions/replication of the original |
| files that were archived. |
| </p> |
| </section> |
| |
| <section> |
| <title> Creating different block size and part size Hadoop Archive </title> |
| <p> |
| You can create different hadoop block size and part size using the following |
| options: |
| </p> <p><code> |
| bin/hadoop archive -Dhar.block.size=512 -Dhar.partfile.size=1024 -archiveName ... |
| </code></p> |
| <p> |
| The above example allows you to set a block size of 512 bytes for part files |
| and part file size of 1K. These numbers are only as examples. Using such low |
| number for block size and part file size is not advisable at all! |
| </p> |
| </section> |
| |
| <section> |
| <title> Limitations of Hadoop Archive </title> |
| <p> |
| Currently Hadoop archive do not support input paths with spaces in it. It |
| throws out an exception in such a case. You can create archives with names |
| in which space can be replaced by a valid character. |
| Below is an example: |
| </p> |
| <p> |
| <code> |
| bin/hadoop archive -Dhar.space.replacement.enable=true -Dhar.space.replacement="_" |
| -archiveName ...... |
| </code> |
| </p> |
| <p> The above example replaces space with "_" in the archived file names. |
| </p> |
| </section> |
| |
| <section> |
| <title>Internals of Hadoop Archive </title> |
| <p> |
| A Hadoop Archive directory contains metadata (in the form |
| of _index and _masterindex) and data (part-*) files. The _index file contains |
| the name of the files that are part of the archive and the location |
| within the part files. The _masterindex file stores offsets into the _index |
| file to make it easier to seek into the _index file for faster lookups. |
| </p> |
| </section> |
| |
| </body> |
| </document> |