| <!--- |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. See accompanying LICENSE file. |
| --> |
| |
| #set ( $H3 = '###' ) |
| |
| Hadoop Archives Guide |
| ===================== |
| |
| - [Overview](#Overview) |
| - [How to Create an Archive](#How_to_Create_an_Archive) |
| - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives) |
| - [How to Unarchive an Archive](#How_to_Unarchive_an_Archive) |
| - [Archives Examples](#Archives_Examples) |
| - [Creating an Archive](#Creating_an_Archive) |
| - [Looking Up Files](#Looking_Up_Files) |
| - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce) |
| |
| Overview |
| -------- |
| |
| Hadoop archives are special format archives. A Hadoop archive maps to a file |
| system directory. A Hadoop archive always has a \*.har extension. A Hadoop |
| archive directory contains metadata (in the form of _index and _masterindex) |
| and data (part-\*) files. The _index file contains the name of the files that |
| are part of the archive and the location within the part files. |
| |
| How to Create an Archive |
| ------------------------ |
| |
| `Usage: hadoop archive -archiveName name -p <parent> [-r <replication factor>] <src>* <dest>` |
| |
| -archiveName is the name of the archive you would like to create. An example |
| would be foo.har. The name should have a \*.har extension. The parent argument |
| is to specify the relative path to which the files should be archived to. |
| Example would be : |
| |
| `-p /foo/bar a/b/c e/f/g` |
| |
| Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to |
| parent. Note that this is a Map/Reduce job that creates the archives. You |
| would need a map reduce cluster to run this. For a detailed example the later |
| sections. |
| |
| -r indicates the desired replication factor; if this optional argument is |
| not specified, a replication factor of 3 will be used. |
| |
| If you just want to archive a single directory /foo/bar then you can just use |
| |
| `hadoop archive -archiveName zoo.har -p /foo/bar -r 3 /outputdir` |
| |
| If you specify source files that are in an encryption zone, they will be |
| decrypted and written into the archive. If the har file is not located in an |
| encryption zone, then they will be stored in clear (decrypted) form. If the |
| har file is located in an encryption zone they will stored in encrypted form. |
| |
| How to Look Up Files in Archives |
| -------------------------------- |
| |
| The archive exposes itself as a file system layer. So all the fs shell |
| commands in the archives work but with a different URI. Also, note that |
| archives are immutable. So, rename's, deletes and creates return an error. |
| URI for Hadoop Archives is |
| |
| `har://scheme-hostname:port/archivepath/fileinarchive` |
| |
| If no scheme is provided it assumes the underlying filesystem. In that case |
| the URI would look like |
| |
| `har:///archivepath/fileinarchive` |
| |
| How to Unarchive an Archive |
| --------------------------- |
| |
| Since all the fs shell commands in the archives work transparently, |
| unarchiving is just a matter of copying. |
| |
| To unarchive sequentially: |
| |
| `hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir` |
| |
| To unarchive in parallel, use DistCp: |
| |
| `hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir` |
| |
| Archives Examples |
| ----------------- |
| |
| $H3 Creating an Archive |
| |
| `hadoop archive -archiveName foo.har -p /user/hadoop -r 3 dir1 dir2 /user/zoo` |
| |
| The above example is creating an archive using /user/hadoop as the relative |
| archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 |
| will be archived in the following file system directory -- /user/zoo/foo.har. |
| Archiving does not delete the input files. If you want to delete the input |
| files after creating the archives (to reduce namespace), you will have to do |
| it on your own. In this example, because `-r 3` is specified, a replication |
| factor of 3 will be used. |
| |
| $H3 Looking Up Files |
| |
| Looking up files in hadoop archives is as easy as doing an ls on the |
| filesystem. After you have archived the directories /user/hadoop/dir1 and |
| /user/hadoop/dir2 as in the example above, to see all the files in the |
| archives you can just run: |
| |
| `hdfs dfs -ls -R har:///user/zoo/foo.har/` |
| |
| To understand the significance of the -p argument, lets go through the above |
| example again. If you just do an ls (not lsr) on the hadoop archive using |
| |
| `hdfs dfs -ls har:///user/zoo/foo.har` |
| |
| The output should be: |
| |
| ``` |
| har:///user/zoo/foo.har/dir1 |
| har:///user/zoo/foo.har/dir2 |
| ``` |
| |
| As you can recall the archives were created with the following command |
| |
| `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo` |
| |
| If we were to change the command to: |
| |
| `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo` |
| |
| then a ls on the hadoop archive using |
| |
| `hdfs dfs -ls har:///user/zoo/foo.har` |
| |
| would give you |
| |
| ``` |
| har:///user/zoo/foo.har/hadoop/dir1 |
| har:///user/zoo/foo.har/hadoop/dir2 |
| ``` |
| |
| Notice that the archived files have been archived relative to /user/ rather |
| than /user/hadoop. |
| |
| Hadoop Archives and MapReduce |
| ----------------------------- |
| |
| Using Hadoop Archives in MapReduce is as easy as specifying a different input |
| filesystem than the default file system. If you have a hadoop archive stored |
| in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, |
| all you need is to specify the input directory as har:///user/zoo/foo.har. Since |
| Hadoop Archives is exposed as a file system MapReduce will be able to use all |
| the logical input files in Hadoop Archives as input. |