hadoop-tools/hadoop-distcp/src/site/xdoc/architecture.xml - hadoop - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
   Copyright 2002-2004 The Apache Software Foundation

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 -->
 <document xmlns="http://maven.apache.org/XDOC/2.0"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 http://maven.apache.org/xsd/xdoc-2.0.xsd">
     <head>
         <title>Architecture of DistCp</title>
     </head>
     <body>
       <section name="Architecture">

         <p>The components of the new DistCp may be classified into the following
            categories: </p>

         <ul>

           <li>DistCp Driver</li>
           <li>Copy-listing generator</li>
           <li>Input-formats and Map-Reduce components</li>

         </ul>

         <subsection name="DistCp Driver">
           <p>The DistCp Driver components are responsible for:</p>

           <ul>
             <li>Parsing the arguments passed to the DistCp command on the
                 command-line, via:
               <ul>
                 <li>OptionsParser, and</li>
                 <li>DistCpOptionsSwitch</li>
               </ul>
             </li>
             <li>Assembling the command arguments into an appropriate
                 DistCpOptions object, and initializing DistCp. These arguments
                 include:
               <ul>
                 <li>Source-paths</li>
                 <li>Target location</li>
                 <li>Copy options (e.g. whether to update-copy, overwrite, which
                     file-attributes to preserve, etc.)</li>
               </ul>
             </li>
             <li>Orchestrating the copy operation by:
               <ul>
                 <li>Invoking the copy-listing-generator to create the list of
                     files to be copied.</li>
                 <li>Setting up and launching the Hadoop Map-Reduce Job to carry
                     out the copy.</li>
                 <li>Based on the options, either returning a handle to the
                     Hadoop MR Job immediately, or waiting till completion.</li>
               </ul>
             </li>
           </ul>
           <br/>

           <p>The parser-elements are exercised only from the command-line (or if
              DistCp::run() is invoked). The DistCp class may also be used
              programmatically, by constructing the DistCpOptions object, and
              initializing a DistCp object appropriately.</p>

         </subsection>

         <subsection name="Copy-listing generator">

           <p>The copy-listing-generator classes are responsible for creating the
              list of files/directories to be copied from source. They examine
              the contents of the source-paths (files/directories, including
              wild-cards), and record all paths that need copy into a sequence-
              file, for consumption by the DistCp Hadoop Job. The main classes in
              this module include:</p>

           <ol>

             <li>CopyListing: The interface that should be implemented by any
                 copy-listing-generator implementation. Also provides the factory
                 method by which the concrete CopyListing implementation is
                 chosen.</li>

             <li>SimpleCopyListing: An implementation of CopyListing that accepts
                 multiple source paths (files/directories), and recursively lists
                 all the individual files and directories under each, for
                 copy.</li>

             <li>GlobbedCopyListing: Another implementation of CopyListing that
                 expands wild-cards in the source paths.</li>

             <li>FileBasedCopyListing: An implementation of CopyListing that
                 reads the source-path list from a specified file.</li>

           </ol>
           <p/>

           <p>Based on whether a source-file-list is specified in the
              DistCpOptions, the source-listing is generated in one of the
              following ways:</p>

           <ol>

             <li>If there's no source-file-list, the GlobbedCopyListing is used.
                 All wild-cards are expanded, and all the expansions are
                 forwarded to the SimpleCopyListing, which in turn constructs the
                 listing (via recursive descent of each path). </li>

             <li>If a source-file-list is specified, the FileBasedCopyListing is
                 used. Source-paths are read from the specified file, and then
                 forwarded to the GlobbedCopyListing. The listing is then
                 constructed as described above.</li>

           </ol>

           <br/>

           <p>One may customize the method by which the copy-listing is
              constructed by providing a custom implementation of the CopyListing
              interface. The behaviour of DistCp differs here from the legacy
              DistCp, in how paths are considered for copy. </p>

           <p>The legacy implementation only lists those paths that must
              definitely be copied on to target.
              E.g. if a file already exists at the target (and -overwrite isn't
              specified), the file isn't even considered in the Map-Reduce Copy
              Job. Determining this during setup (i.e. before the Map-Reduce Job)
              involves file-size and checksum-comparisons that are potentially
              time-consuming.</p>

           <p>The new DistCp postpones such checks until the Map-Reduce Job, thus
              reducing setup time. Performance is enhanced further since these
              checks are parallelized across multiple maps.</p>

         </subsection>

         <subsection name="Input-formats and Map-Reduce components">

           <p> The Input-formats and Map-Reduce components are responsible for
               the actual copy of files and directories from the source to the
               destination path. The listing-file created during copy-listing
               generation is consumed at this point, when the copy is carried
               out. The classes of interest here include:</p>

           <ul>
             <li><strong>UniformSizeInputFormat:</strong> This implementation of
                 org.apache.hadoop.mapreduce.InputFormat provides equivalence
                 with Legacy DistCp in balancing load across maps.
                 The aim of the UniformSizeInputFormat is to make each map copy
                 roughly the same number of bytes. Apropos, the listing file is
                 split into groups of paths, such that the sum of file-sizes in
                 each InputSplit is nearly equal to every other map. The splitting
                 isn't always perfect, but its trivial implementation keeps the
                 setup-time low.</li>

             <li><strong>DynamicInputFormat and DynamicRecordReader:</strong>
                 <p> The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat,
                 and is new to DistCp. The listing-file is split into several
                 "chunk-files", the exact number of chunk-files being a multiple
                 of the number of maps requested for in the Hadoop Job. Each map
                 task is "assigned" one of the chunk-files (by renaming the chunk
                 to the task's id), before the Job is launched.</p>

                 <p>Paths are read from each chunk using the DynamicRecordReader,
                 and processed in the CopyMapper. After all the paths in a chunk
                 are processed, the current chunk is deleted and a new chunk is
                 acquired. The process continues until no more chunks are
                 available.</p>
                 <p>This "dynamic" approach allows faster map-tasks to consume
                 more paths than slower ones, thus speeding up the DistCp job
                 overall. </p>
             </li>

             <li><strong>CopyMapper:</strong> This class implements the physical
                 file-copy. The input-paths are checked against the input-options
                 (specified in the Job's Configuration), to determine whether a
                 file needs copy. A file will be copied only if at least one of
                 the following is true:
               <ul>
                 <li>A file with the same name doesn't exist at target.</li>
                 <li>A file with the same name exists at target, but has a
                     different file size.</li>
                 <li>A file with the same name exists at target, but has a
                     different checksum, and -skipcrccheck isn't mentioned.</li>
                 <li>A file with the same name exists at target, but -overwrite
                     is specified.</li>
                 <li>A file with the same name exists at target, but differs in
                     block-size (and block-size needs to be preserved.</li>
               </ul>
             </li>

             <li><strong>CopyCommitter:</strong>
                 This class is responsible for the commit-phase of the DistCp
                 job, including:
               <ul>
                 <li>Preservation of directory-permissions (if specified in the
                     options)</li>
                 <li>Clean-up of temporary-files, work-directories, etc.</li>
               </ul>
             </li>
           </ul>
         </subsection>
       </section>
     </body>
 </document>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Copyright 2002-2004 The Apache Software Foundation

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<document xmlns="http://maven.apache.org/XDOC/2.0"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 http://maven.apache.org/xsd/xdoc-2.0.xsd">
	<head>
	<title>Architecture of DistCp</title>
	</head>
	<body>
	<section name="Architecture">

	<p>The components of the new DistCp may be classified into the following
	categories: </p>

	<ul>

	<li>DistCp Driver</li>
	<li>Copy-listing generator</li>
	<li>Input-formats and Map-Reduce components</li>

	</ul>

	<subsection name="DistCp Driver">
	<p>The DistCp Driver components are responsible for:</p>

	<ul>
	<li>Parsing the arguments passed to the DistCp command on the
	command-line, via:
	<ul>
	<li>OptionsParser, and</li>
	<li>DistCpOptionsSwitch</li>
	</ul>
	</li>
	<li>Assembling the command arguments into an appropriate
	DistCpOptions object, and initializing DistCp. These arguments
	include:
	<ul>
	<li>Source-paths</li>
	<li>Target location</li>
	<li>Copy options (e.g. whether to update-copy, overwrite, which
	file-attributes to preserve, etc.)</li>
	</ul>
	</li>
	<li>Orchestrating the copy operation by:
	<ul>
	<li>Invoking the copy-listing-generator to create the list of
	files to be copied.</li>
	<li>Setting up and launching the Hadoop Map-Reduce Job to carry
	out the copy.</li>
	<li>Based on the options, either returning a handle to the
	Hadoop MR Job immediately, or waiting till completion.</li>
	</ul>
	</li>
	</ul>
	<br/>

	<p>The parser-elements are exercised only from the command-line (or if
	DistCp::run() is invoked). The DistCp class may also be used
	programmatically, by constructing the DistCpOptions object, and
	initializing a DistCp object appropriately.</p>

	</subsection>

	<subsection name="Copy-listing generator">

	<p>The copy-listing-generator classes are responsible for creating the
	list of files/directories to be copied from source. They examine
	the contents of the source-paths (files/directories, including
	wild-cards), and record all paths that need copy into a sequence-
	file, for consumption by the DistCp Hadoop Job. The main classes in
	this module include:</p>

	<ol>

	<li>CopyListing: The interface that should be implemented by any
	copy-listing-generator implementation. Also provides the factory
	method by which the concrete CopyListing implementation is
	chosen.</li>

	<li>SimpleCopyListing: An implementation of CopyListing that accepts
	multiple source paths (files/directories), and recursively lists
	all the individual files and directories under each, for
	copy.</li>

	<li>GlobbedCopyListing: Another implementation of CopyListing that
	expands wild-cards in the source paths.</li>

	<li>FileBasedCopyListing: An implementation of CopyListing that
	reads the source-path list from a specified file.</li>

	</ol>
	<p/>

	<p>Based on whether a source-file-list is specified in the
	DistCpOptions, the source-listing is generated in one of the
	following ways:</p>

	<ol>

	<li>If there's no source-file-list, the GlobbedCopyListing is used.
	All wild-cards are expanded, and all the expansions are
	forwarded to the SimpleCopyListing, which in turn constructs the
	listing (via recursive descent of each path). </li>

	<li>If a source-file-list is specified, the FileBasedCopyListing is
	used. Source-paths are read from the specified file, and then
	forwarded to the GlobbedCopyListing. The listing is then
	constructed as described above.</li>

	</ol>

	<br/>

	<p>One may customize the method by which the copy-listing is
	constructed by providing a custom implementation of the CopyListing
	interface. The behaviour of DistCp differs here from the legacy
	DistCp, in how paths are considered for copy. </p>

	<p>The legacy implementation only lists those paths that must
	definitely be copied on to target.
	E.g. if a file already exists at the target (and -overwrite isn't
	specified), the file isn't even considered in the Map-Reduce Copy
	Job. Determining this during setup (i.e. before the Map-Reduce Job)
	involves file-size and checksum-comparisons that are potentially
	time-consuming.</p>

	<p>The new DistCp postpones such checks until the Map-Reduce Job, thus
	reducing setup time. Performance is enhanced further since these
	checks are parallelized across multiple maps.</p>

	</subsection>

	<subsection name="Input-formats and Map-Reduce components">

	<p> The Input-formats and Map-Reduce components are responsible for
	the actual copy of files and directories from the source to the
	destination path. The listing-file created during copy-listing
	generation is consumed at this point, when the copy is carried
	out. The classes of interest here include:</p>

	<ul>
	<li><strong>UniformSizeInputFormat:</strong> This implementation of
	org.apache.hadoop.mapreduce.InputFormat provides equivalence
	with Legacy DistCp in balancing load across maps.
	The aim of the UniformSizeInputFormat is to make each map copy
	roughly the same number of bytes. Apropos, the listing file is
	split into groups of paths, such that the sum of file-sizes in
	each InputSplit is nearly equal to every other map. The splitting
	isn't always perfect, but its trivial implementation keeps the
	setup-time low.</li>

	<li><strong>DynamicInputFormat and DynamicRecordReader:</strong>
	<p> The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat,
	and is new to DistCp. The listing-file is split into several
	"chunk-files", the exact number of chunk-files being a multiple
	of the number of maps requested for in the Hadoop Job. Each map
	task is "assigned" one of the chunk-files (by renaming the chunk
	to the task's id), before the Job is launched.</p>

	<p>Paths are read from each chunk using the DynamicRecordReader,
	and processed in the CopyMapper. After all the paths in a chunk
	are processed, the current chunk is deleted and a new chunk is
	acquired. The process continues until no more chunks are
	available.</p>
	<p>This "dynamic" approach allows faster map-tasks to consume
	more paths than slower ones, thus speeding up the DistCp job
	overall. </p>
	</li>

	<li><strong>CopyMapper:</strong> This class implements the physical
	file-copy. The input-paths are checked against the input-options
	(specified in the Job's Configuration), to determine whether a
	file needs copy. A file will be copied only if at least one of
	the following is true:
	<ul>
	<li>A file with the same name doesn't exist at target.</li>
	<li>A file with the same name exists at target, but has a
	different file size.</li>
	<li>A file with the same name exists at target, but has a
	different checksum, and -skipcrccheck isn't mentioned.</li>
	<li>A file with the same name exists at target, but -overwrite
	is specified.</li>
	<li>A file with the same name exists at target, but differs in
	block-size (and block-size needs to be preserved.</li>
	</ul>
	</li>

	<li><strong>CopyCommitter:</strong>
	This class is responsible for the commit-phase of the DistCp
	job, including:
	<ul>
	<li>Preservation of directory-permissions (if specified in the
	options)</li>
	<li>Clean-up of temporary-files, work-directories, etc.</li>
	</ul>
	</li>
	</ul>
	</subsection>
	</section>
	</body>
	</document>