| <?xml version="1.0" encoding="ISO-8859-1" ?> |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <faqs xmlns="http://maven.apache.org/FML/1.0.1" |
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
| xsi:schemaLocation="http://maven.apache.org/FML/1.0.1 http://maven.apache.org/xsd/fml-1.0.1.xsd" |
| title="Frequently Asked Questions"> |
| <part id="General"> |
| <title>General</title> |
| |
| <faq id="Update"> |
| <question>Why does -update not create the parent source-directory under |
| a pre-existing target directory?</question> |
| <answer>The behaviour of <code>-update</code> and <code>-overwrite</code> |
| is described in detail in the Usage section of this document. In short, |
| if either option is used with a pre-existing destination directory, the |
| <strong>contents</strong> of each source directory is copied over, rather |
| than the source-directory itself. |
| This behaviour is consistent with the legacy DistCp implementation as well. |
| </answer> |
| </faq> |
| |
| <faq id="Deviation"> |
| <question>How does the new DistCp differ in semantics from the Legacy |
| DistCp?</question> |
| <answer> |
| <ul> |
| <li>Files that are skipped during copy used to also have their |
| file-attributes (permissions, owner/group info, etc.) unchanged, |
| when copied with Legacy DistCp. These are now updated, even if |
| the file-copy is skipped.</li> |
| <li>Empty root directories among the source-path inputs were not |
| created at the target, in Legacy DistCp. These are now created.</li> |
| </ul> |
| </answer> |
| </faq> |
| |
| <faq id="nMaps"> |
| <question>Why does the new DistCp use more maps than legacy DistCp?</question> |
| <answer> |
| <p>Legacy DistCp works by figuring out what files need to be actually |
| copied to target <strong>before</strong> the copy-job is launched, and then |
| launching as many maps as required for copy. So if a majority of the files |
| need to be skipped (because they already exist, for example), fewer maps |
| will be needed. As a consequence, the time spent in setup (i.e. before the |
| M/R job) is higher.</p> |
| <p>The new DistCp calculates only the contents of the source-paths. It |
| doesn't try to filter out what files can be skipped. That decision is put- |
| off till the M/R job runs. This is much faster (vis-a-vis execution-time), |
| but the number of maps launched will be as specified in the <code>-m</code> |
| option, or 20 (default) if unspecified.</p> |
| </answer> |
| </faq> |
| |
| <faq id="more_maps"> |
| <question>Why does DistCp not run faster when more maps are specified?</question> |
| <answer> |
| <p>At present, the smallest unit of work for DistCp is a file. i.e., |
| a file is processed by only one map. Increasing the number of maps to |
| a value exceeding the number of files would yield no performance |
| benefit. The number of maps lauched would equal the number of files.</p> |
| </answer> |
| </faq> |
| |
| <faq id="client_mem"> |
| <question>Why does DistCp run out of memory?</question> |
| <answer> |
| <p>If the number of individual files/directories being copied from |
| the source path(s) is extremely large (e.g. 1,000,000 paths), DistCp might |
| run out of memory while determining the list of paths for copy. This is |
| not unique to the new DistCp implementation.</p> |
| <p>To get around this, consider changing the <code>-Xmx</code> JVM |
| heap-size parameters, as follows:</p> |
| <p><code>bash$ export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m"</code></p> |
| <p><code>bash$ hadoop distcp /source /target</code></p> |
| </answer> |
| </faq> |
| |
| </part> |
| </faqs> |