blob: 4c872e47081e4c2126394e28ebe28187aa1fa649 [file] [log] [blame]
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
<document>
<header>
<title>Hadoop Archive Guide</title>
</header>
<body>
<section>
<title> What are Hadoop archives?</title>
<p> Hadoop archives are special format archives. The main use case of
using archives is to reduce the namespace of the NameNode. Hadoop
Archives collapses a set of files into a smaller number of files and
provides a very efficient and easy interface to access these
collapsed files. A Hadoop archive maps to a HDFS directory. A Hadoop
archive always has a *.har extension.</p>
</section>
<section>
<title> How to create an archive?</title>
<p>
<code>Usage: hadoop archive -archiveName name -p &lt;parent&gt; &lt;src&gt;* &lt;dest&gt;</code>
</p>
<p>
-archiveName is the name of the archive you would like to create.
An example would be foo.har. The name should have a *.har extension.
The parent argument is to specify the relative path to which the files
should be
archived to. Example would be :
</p><p><code> -p /foo/bar a/b/c e/f/g </code></p><p>
Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
parent.
Note that this is a Map/Reduce job that creates the archives. You would
need a map reduce cluster to run this. For a detailed example the later
sections. </p>
<p> If you just want to archive a single directory /foo/bar then you
can just use </p>
<p><code> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir
</code></p>
</section>
<section>
<title> How to look up files in archives? </title>
<p>
The archive exposes itself as a file system layer. So all the fs shell
commands in the archives work but with a different URI. Also, note that
archives are immutable. So, rename's, deletes and creates return
an error. URI for Hadoop Archives is
</p><p><code>har://scheme-hostname:port/archivepath/fileinarchive</code>
</p><p>
If no scheme is provided it assumes the underlying filesystem.
In that case the URI would look like </p>
<p><code>har:///archivepath/fileinarchive</code></p>
</section>
<section>
<title> Example on creating and looking up archives </title>
<p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
/user/zoo </code></p>
<p>
The above example is creating an archive using /user/hadoop as the
relative archive directory.
The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be
archived in the following file system directory -- /user/zoo/foo.har.
Archiving does not delete the input files. If you want to delete
the input files after creating the archives (to reduce namespace), you
will have to do it on your own.
</p>
<section>
<title> Looking up files and understanding the -p option </title>
<p> Looking up files in hadoop archives is as easy as doing an ls on the
filesystem. After you have archived the directories /user/hadoop/dir1 and
/user/hadoop/dir2 as in the exmaple above, to see all the files in the
archives you can just run: </p>
<p><code>hadoop dfs -lsr har:///user/zoo/foo.har/</code></p>
<p> To understand the significance of the -p argument, lets go through the
above example again. If you just do
an ls (not lsr) on the hadoop archive using </p>
<p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
<p>The output should be:</p>
<source>
har:///user/zoo/foo.har/dir1
har:///user/zoo/foo.har/dir2
</source>
<p> As you can recall the archives were created with the following command </p>
<p><code>hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
/user/zoo </code></p>
<p> If we were to change the command to: </p>
<p><code>hadoop archive -archiveName foo.har -p /user/
hadoop/dir1 hadoop/dir2 /user/zoo </code></p>
<p> then a ls on the hadoop archive using </p>
<p><code>hadoop dfs -ls har:///user/zoo/foo.har</code></p>
<p>would give you</p>
<source>
har:///user/zoo/foo.har/hadoop/dir1
har:///user/zoo/foo.har/hadoop/dir2
</source>
<p>
Notice that the archived files have been archived relative to
/user/ rather than /user/hadoop.
</p>
</section>
</section>
<section>
<title> Using Hadoop Archive with Map Reduce </title>
<p>Using Hadoop Archive in Map Reduce is as easy as specifying a
different input filesystem than the default file system.
If you have a hadoop archive stored in HDFS in /user/zoo/foo.har
then for using this archive for Map Reduce input, all
you need to specify the input directory as har:///user/zoo/foo.har.
Since Hadoop Archives is exposed as a file system Map Reduce will be able
to use all the logical input files in Hadoop Archives as input.</p>
</section>
<section>
<title> File Replication and Permissions of Hadoop Archive </title>
<p>
Hadoop Archive currently does not store the file information metadata
that the files had before they were archived. The file permissions of
Hadoop Archive created is the default permissions that a user creates
file with. The file replication of data files in Hadoop Archive is set to
3 and the metadata information files have a replication factor of 5. You
can increase this by increasing the replication factor of files under the har
file directory.
On restoration of hadoop archive files using something like:
</p>
<p><code>
distcp har:///path_to_har dest_path
</code></p>
<p>
the restored files will not have the permissions/replication of the original
files that were archived.
</p>
</section>
<section>
<title> Creating different block size and part size Hadoop Archive </title>
<p>
You can create different hadoop block size and part size using the following
options:
</p> <p><code>
bin/hadoop archive -Dhar.block.size=512 -Dhar.partfile.size=1024 -archiveName ...
</code></p>
<p>
The above example allows you to set a block size of 512 bytes for part files
and part file size of 1K. These numbers are only as examples. Using such low
number for block size and part file size is not advisable at all!
</p>
</section>
<section>
<title> Limitations of Hadoop Archive </title>
<p>
Currently Hadoop archive do not support input paths with spaces in it. It
throws out an exception in such a case. You can create archives with names
in which space can be replaced by a valid character.
Below is an example:
</p>
<p>
<code>
bin/hadoop archive -Dhar.space.replacement.enable=true -Dhar.space.replacement="_"
-archiveName ......
</code>
</p>
<p> The above example replaces space with "_" in the archived file names.
</p>
</section>
<section>
<title>Internals of Hadoop Archive </title>
<p>
A Hadoop Archive directory contains metadata (in the form
of _index and _masterindex) and data (part-*) files. The _index file contains
the name of the files that are part of the archive and the location
within the part files. The _masterindex file stores offsets into the _index
file to make it easier to seek into the _index file for faster lookups.
</p>
</section>
</body>
</document>