| <?xml version="1.0"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| |
| <document> |
| |
| <header> |
| <title>Offline Image Viewer Guide</title> |
| </header> |
| |
| <body> |
| |
| <section> |
| <title>Overview</title> |
| |
| <p>The Offline Image Viewer is a tool to dump the contents of hdfs |
| fsimage files to human-readable formats in order to allow offline analysis |
| and examination of an Hadoop cluster's namespace. The tool is able to |
| process very large image files relatively quickly, converting them to |
| one of several output formats. The tool handles the layout formats that |
| were included with Hadoop versions 16 and up. If the tool is not able to |
| process an image file, it will exit cleanly. The Offline Image Viewer does not require |
| an Hadoop cluster to be running; it is entirely offline in its operation.</p> |
| |
| <p>The Offline Image Viewer provides several output processors:</p> |
| <ol> |
| <li><strong>Ls</strong> is the default output processor. It closely mimics the format of |
| the <code>lsr </code> command. It includes the same fields, in the same order, as |
| <code>lsr </code>: directory or file flag, permissions, replication, owner, group, |
| file size, modification date, and full path. Unlike the <code>lsr </code> command, |
| the root path is included. One important difference between the output |
| of the <code>lsr </code> command this processor, is that this output is not sorted |
| by directory name and contents. Rather, the files are listed in the |
| order in which they are stored in the fsimage file. Therefore, it is |
| not possible to directly compare the output of the <code>lsr </code> command this |
| this tool. The Ls processor uses information contained within the Inode blocks to |
| calculate file sizes and ignores the <code>-skipBlocks</code> option.</li> |
| <li><strong>Indented</strong> provides a more complete view of the fsimage's contents, |
| including all of the information included in the image, such as image |
| version, generation stamp and inode- and block-specific listings. This |
| processor uses indentation to organize the output into a hierarchal manner. |
| The <code>lsr </code> format is suitable for easy human comprehension.</li> |
| <li><strong>Delimited</strong> provides one file per line consisting of the path, |
| replication, modification time, access time, block size, number of blocks, file size, |
| namespace quota, diskspace quota, permissions, username and group name. If run against |
| an fsimage that does not contain any of these fields, the field's column will be included, |
| but no data recorded. The default record delimiter is a tab, but this may be changed |
| via the <code>-delimiter</code> command line argument. This processor is designed to |
| create output that is easily analyzed by other tools, such as <a href="http://hadoop.apache.org/pig/">Apache Pig</a>. |
| See the <a href="#analysis">Analyzing Results</a> section |
| for further information on using this processor to analyze the contents of fsimage files.</li> |
| <li><strong>XML</strong> creates an XML document of the fsimage and includes all of the |
| information within the fsimage, similar to the <code>lsr </code> processor. The output |
| of this processor is amenable to automated processing and analysis with XML tools. |
| Due to the verbosity of the XML syntax, this processor will also generate |
| the largest amount of output.</li> |
| <li><strong>FileDistribution</strong> is the tool for analyzing file |
| sizes in the namespace image. In order to run the tool one should |
| define a range of integers <code>[0, maxSize]</code> by specifying |
| <code>maxSize</code> and a <code>step</code>. |
| The range of integers is divided into segments of size |
| <code>step</code>: |
| <code>[0, s</code><sub>1</sub><code>, ..., s</code><sub>n-1</sub><code>, maxSize]</code>, |
| and the processor calculates how many files in the system fall into |
| each segment <code>[s</code><sub>i-1</sub><code>, s</code><sub>i</sub><code>)</code>. |
| Note that files larger than <code>maxSize</code> always fall into |
| the very last segment. |
| The output file is formatted as a tab separated two column table: |
| Size and NumFiles. Where Size represents the start of the segment, |
| and numFiles is the number of files form the image which size falls |
| in this segment.</li> |
| </ol> |
| |
| </section> <!-- overview --> |
| |
| <section> |
| <title>Usage</title> |
| |
| <section> |
| <title>Basic</title> |
| <p>The simplest usage of the Offline Image Viewer is to provide just an input and output |
| file, via the <code>-i</code> and <code>-o</code> command-line switches:</p> |
| |
| <p><code>bash$ bin/hdfs oiv -i fsimage -o fsimage.txt</code><br/></p> |
| |
| <p>This will create a file named fsimage.txt in the current directory using |
| the Ls output processor. For very large image files, this process may take |
| several minutes.</p> |
| |
| <p>One can specify which output processor via the command-line switch <code>-p</code>. |
| For instance:</p> |
| <p><code>bash$ bin/hdfs oiv -i fsimage -o fsimage.xml -p XML</code><br/></p> |
| |
| <p>or</p> |
| |
| <p><code>bash$ bin/hdfs oiv -i fsimage -o fsimage.txt -p Indented</code><br/></p> |
| |
| <p>This will run the tool using either the XML or Indented output processor, |
| respectively.</p> |
| |
| <p>One command-line option worth considering is <code>-skipBlocks</code>, which |
| prevents the tool from explicitly enumerating all of the blocks that make up |
| a file in the namespace. This is useful for file systems that have very large |
| files. Enabling this option can significantly decrease the size of the resulting |
| output, as individual blocks are not included. Note, however, that the Ls processor |
| needs to enumerate the blocks and so overrides this option.</p> |
| |
| </section> <!-- Basic --> |
| <section id="Example"> |
| <title>Example</title> |
| |
| <p>Consider the following contrived namespace:</p> |
| <source> |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 21:17 /anotherDir |
| |
| -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 21:15 /anotherDir/biggerfile |
| |
| -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 21:17 /anotherDir/smallFile |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem |
| |
| drwx-wx-wx - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one/two |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 21:16 /user |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 21:19 /user/theuser |
| </source> |
| |
| <p>Applying the Offline Image Processor against this file with default options would result in the following output:</p> |
| <source> |
| machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -o fsimage.txt |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 / |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:17 /anotherDir |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /user |
| |
| -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 14:15 /anotherDir/biggerfile |
| |
| -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 14:17 /anotherDir/smallFile |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem |
| |
| drwx-wx-wx - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one/two |
| |
| drwxr-xr-x - theuser supergroup 0 2009-03-16 14:19 /user/theuser |
| </source> |
| |
| <p>Similarly, applying the Indented processor would generate output that begins with:</p> |
| <source> |
| machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt |
| |
| FSImage |
| |
| ImageVersion = -19 |
| |
| NamespaceID = 2109123098 |
| |
| GenerationStamp = 1003 |
| |
| INodes [NumInodes = 12] |
| |
| Inode |
| |
| INodePath = |
| |
| Replication = 0 |
| |
| ModificationTime = 2009-03-16 14:16 |
| |
| AccessTime = 1969-12-31 16:00 |
| |
| BlockSize = 0 |
| |
| Blocks [NumBlocks = -1] |
| |
| NSQuota = 2147483647 |
| |
| DSQuota = -1 |
| |
| Permissions |
| |
| Username = theuser |
| |
| GroupName = supergroup |
| |
| PermString = rwxr-xr-x |
| |
| ���remaining output omitted��� |
| </source> |
| |
| </section> <!-- example--> |
| |
| </section> |
| |
| <section id="options"> |
| <title>Options</title> |
| |
| <section> |
| <title>Option Index</title> |
| <table> |
| <tr><th> Flag </th><th> Description </th></tr> |
| <tr><td><code>[-i|--inputFile] <input file></code></td> |
| <td>Specify the input fsimage file to process. Required.</td></tr> |
| <tr><td><code>[-o|--outputFile] <output file></code></td> |
| <td>Specify the output filename, if the specified output processor |
| generates one. If the specified file already exists, it is silently overwritten. Required. |
| </td></tr> |
| <tr><td><code>[-p|--processor] <processor></code></td> |
| <td>Specify the image processor to apply against the image file. Currently |
| valid options are Ls (default), XML and Indented.. |
| </td></tr> |
| <tr><td><code>-skipBlocks</code></td> |
| <td>Do not enumerate individual blocks within files. This may save processing time |
| and outfile file space on namespaces with very large files. The <code>Ls</code> processor reads |
| the blocks to correctly determine file sizes and ignores this option.</td></tr> |
| <tr><td><code>-printToScreen</code></td> |
| <td>Pipe output of processor to console as well as specified file. On extremely |
| large namespaces, this may increase processing time by an order of magnitude.</td></tr> |
| <tr><td><code>-delimiter <arg></code></td> |
| <td>When used in conjunction with the Delimited processor, replaces the default |
| tab delimiter with the string specified by <code>arg</code>.</td></tr> |
| <tr><td><code>[-h|--help]</code></td> |
| <td>Display the tool usage and help information and exit.</td></tr> |
| </table> |
| </section> <!-- options --> |
| </section> |
| |
| <section id="analysis"> |
| <title>Analyzing Results</title> |
| <p>The Offline Image Viewer makes it easy to gather large amounts of data about the hdfs namespace. |
| This information can then be used to explore file system usage patterns or find |
| specific files that match arbitrary criteria, along with other types of namespace analysis. The Delimited |
| image processor in particular creates |
| output that is amenable to further processing by tools such as <a href="http://hadoop.apache.org/pig/">Apache Pig</a>. Pig provides a particularly |
| good choice for analyzing these data as it is able to deal with the output generated from a small fsimage |
| but also scales up to consume data from extremely large file systems.</p> |
| <p>The Delimited image processor generates lines of text separated, by default, by tabs and includes |
| all of the fields that are common between constructed files and files that were still under constructed |
| when the fsimage was generated. Examples scripts are provided demonstrating how to use this output to |
| accomplish three tasks: determine the number of files each user has created on the file system, |
| find files were created but have not accessed, and find probable duplicates of large files by comparing |
| the size of each file.</p> |
| <p>Each of the following scripts assumes you have generated an output file using the Delimited processor named |
| <code>foo</code> and will be storing the results of the Pig analysis in a file named <code>results</code>.</p> |
| <section> |
| <title>Total Number of Files for Each User</title> |
| <p>This script processes each path within the namespace, groups them by the file owner and determines the total |
| number of files each user owns.</p> |
| <p><strong>numFilesOfEachUser.pig:</strong></p> |
| <source> |
| -- This script determines the total number of files each user has in |
| -- the namespace. Its output is of the form: |
| -- username, totalNumFiles |
| |
| -- Load all of the fields from the file |
| A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray, |
| replication:int, |
| modTime:chararray, |
| accessTime:chararray, |
| blockSize:long, |
| numBlocks:int, |
| fileSize:long, |
| NamespaceQuota:int, |
| DiskspaceQuota:int, |
| perms:chararray, |
| username:chararray, |
| groupname:chararray); |
| |
| |
| -- Grab just the path and username |
| B = FOREACH A GENERATE path, username; |
| |
| -- Generate the sum of the number of paths for each user |
| C = FOREACH (GROUP B BY username) GENERATE group, COUNT(B.path); |
| |
| -- Save results |
| STORE C INTO '$outputFile'; |
| </source> |
| <p>This script can be run against pig with the following command:</p> |
| <p><code>bin/pig -x local -param inputFile=../foo -param outputFile=../results ../numFilesOfEachUser.pig</code><br/></p> |
| <p>The output file's content will be similar to that below:</p> |
| <p> |
| <code>bart 1</code><br/> |
| <code>lisa 16</code><br/> |
| <code>homer 28</code><br/> |
| <code>marge 2456</code><br/> |
| </p> |
| </section> |
| |
| <section><title>Files That Have Never Been Accessed</title> |
| <p>This script finds files that were created but whose access times were never changed, meaning they were never opened or viewed.</p> |
| <p><strong>neverAccessed.pig:</strong></p> |
| <source> |
| -- This script generates a list of files that were created but never |
| -- accessed, based on their AccessTime |
| |
| -- Load all of the fields from the file |
| A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray, |
| replication:int, |
| modTime:chararray, |
| accessTime:chararray, |
| blockSize:long, |
| numBlocks:int, |
| fileSize:long, |
| NamespaceQuota:int, |
| DiskspaceQuota:int, |
| perms:chararray, |
| username:chararray, |
| groupname:chararray); |
| |
| -- Grab just the path and last time the file was accessed |
| B = FOREACH A GENERATE path, accessTime; |
| |
| -- Drop all the paths that don't have the default assigned last-access time |
| C = FILTER B BY accessTime == '1969-12-31 16:00'; |
| |
| -- Drop the accessTimes, since they're all the same |
| D = FOREACH C GENERATE path; |
| |
| -- Save results |
| STORE D INTO '$outputFile'; |
| </source> |
| <p>This script can be run against pig with the following command and its output file's content will be a list of files that were created but never viewed afterwards.</p> |
| <p><code>bin/pig -x local -param inputFile=../foo -param outputFile=../results ../neverAccessed.pig</code><br/></p> |
| </section> |
| <section><title>Probable Duplicated Files Based on File Size</title> |
| <p>This script groups files together based on their size, drops any that are of less than 100mb and returns a list of the file size, number of files found and a tuple of the file paths. This can be used to find likely duplicates within the filesystem namespace.</p> |
| |
| <p><strong>probableDuplicates.pig:</strong></p> |
| <source> |
| -- This script finds probable duplicate files greater than 100 MB by |
| -- grouping together files based on their byte size. Files of this size |
| -- with exactly the same number of bytes can be considered probable |
| -- duplicates, but should be checked further, either by comparing the |
| -- contents directly or by another proxy, such as a hash of the contents. |
| -- The scripts output is of the type: |
| -- fileSize numProbableDuplicates {(probableDup1), (probableDup2)} |
| |
| -- Load all of the fields from the file |
| A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray, |
| replication:int, |
| modTime:chararray, |
| accessTime:chararray, |
| blockSize:long, |
| numBlocks:int, |
| fileSize:long, |
| NamespaceQuota:int, |
| DiskspaceQuota:int, |
| perms:chararray, |
| username:chararray, |
| groupname:chararray); |
| |
| -- Grab the pathname and filesize |
| B = FOREACH A generate path, fileSize; |
| |
| -- Drop files smaller than 100 MB |
| C = FILTER B by fileSize > 100L * 1024L * 1024L; |
| |
| -- Gather all the files of the same byte size |
| D = GROUP C by fileSize; |
| |
| -- Generate path, num of duplicates, list of duplicates |
| E = FOREACH D generate group AS fileSize, COUNT(C) as numDupes, C.path AS files; |
| |
| -- Drop all the files where there are only one of them |
| F = FILTER E by numDupes > 1L; |
| |
| -- Sort by the size of the files |
| G = ORDER F by fileSize; |
| |
| -- Save results |
| STORE G INTO '$outputFile'; |
| </source> |
| <p>This script can be run against pig with the following command:</p> |
| <p><code>bin/pig -x local -param inputFile=../foo -param outputFile=../results ../probableDuplicates.pig</code><br/></p> |
| <p> The output file's content will be similar to that below:</p> |
| |
| <source> |
| 1077288632 2 {(/user/tennant/work1/part-00501),(/user/tennant/work1/part-00993)} |
| 1077288664 4 {(/user/tennant/work0/part-00567),(/user/tennant/work0/part-03980),(/user/tennant/work1/part-00725),(/user/eccelston/output/part-03395)} |
| 1077288668 3 {(/user/tennant/work0/part-03705),(/user/tennant/work0/part-04242),(/user/tennant/work1/part-03839)} |
| 1077288698 2 {(/user/tennant/work0/part-00435),(/user/eccelston/output/part-01382)} |
| 1077288702 2 {(/user/tennant/work0/part-03864),(/user/eccelston/output/part-03234)} |
| </source> |
| <p>Each line includes the file size in bytes that was found to be duplicated, the number of duplicates found, and a list of the duplicated paths. |
| Files less than 100MB are ignored, providing a reasonable likelihood that files of these exact sizes may be duplicates.</p> |
| </section> |
| </section> |
| |
| |
| </body> |
| |
| </document> |