| <?xml version="1.0" encoding="UTF-8"?> |
| <document xmlns="http://maven.apache.org/XDOC/2.0" |
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
| xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 http://maven.apache.org/xsd/xdoc-2.0.xsd"> |
| <head> |
| <title>Appendix</title> |
| </head> |
| <body> |
| <section name="Map sizing"> |
| |
| <p> By default, DistCp makes an attempt to size each map comparably so |
| that each copies roughly the same number of bytes. Note that files are the |
| finest level of granularity, so increasing the number of simultaneous |
| copiers (i.e. maps) may not always increase the number of |
| simultaneous copies nor the overall throughput.</p> |
| |
| <p> The new DistCp also provides a strategy to "dynamically" size maps, |
| allowing faster data-nodes to copy more bytes than slower nodes. Using |
| <code>-strategy dynamic</code> (explained in the Architecture), rather |
| than to assign a fixed set of source-files to each map-task, files are |
| instead split into several sets. The number of sets exceeds the number of |
| maps, usually by a factor of 2-3. Each map picks up and copies all files |
| listed in a chunk. When a chunk is exhausted, a new chunk is acquired and |
| processed, until no more chunks remain.</p> |
| |
| <p> By not assigning a source-path to a fixed map, faster map-tasks (i.e. |
| data-nodes) are able to consume more chunks, and thus copy more data, |
| than slower nodes. While this distribution isn't uniform, it is |
| <strong>fair</strong> with regard to each mapper's capacity.</p> |
| |
| <p>The dynamic-strategy is implemented by the DynamicInputFormat. It |
| provides superior performance under most conditions. </p> |
| |
| <p>Tuning the number of maps to the size of the source and |
| destination clusters, the size of the copy, and the available |
| bandwidth is recommended for long-running and regularly run jobs.</p> |
| |
| </section> |
| |
| <section name="Copying between versions of HDFS"> |
| |
| <p>For copying between two different versions of Hadoop, one will |
| usually use HftpFileSystem. This is a read-only FileSystem, so DistCp |
| must be run on the destination cluster (more specifically, on |
| TaskTrackers that can write to the destination cluster). Each source is |
| specified as <code>hftp://<dfs.http.address>/<path></code> |
| (the default <code>dfs.http.address</code> is |
| <namenode>:50070).</p> |
| |
| </section> |
| |
| <section name="Map/Reduce and other side-effects"> |
| |
| <p>As has been mentioned in the preceding, should a map fail to copy |
| one of its inputs, there will be several side-effects.</p> |
| |
| <ul> |
| |
| <li>Unless <code>-overwrite</code> is specified, files successfully |
| copied by a previous map on a re-execution will be marked as |
| "skipped".</li> |
| |
| <li>If a map fails <code>mapred.map.max.attempts</code> times, the |
| remaining map tasks will be killed (unless <code>-i</code> is |
| set).</li> |
| |
| <li>If <code>mapred.speculative.execution</code> is set set |
| <code>final</code> and <code>true</code>, the result of the copy is |
| undefined.</li> |
| |
| </ul> |
| |
| </section> |
| |
| <section name="SSL Configurations for HSFTP sources:"> |
| |
| <p>To use an HSFTP source (i.e. using the hsftp protocol), a Map-Red SSL |
| configuration file needs to be specified (via the <code>-mapredSslConf</code> |
| option). This must specify 3 parameters:</p> |
| |
| <ul> |
| <li><code>ssl.client.truststore.location</code>: The local-filesystem |
| location of the trust-store file, containing the certificate for |
| the namenode.</li> |
| |
| <li><code>ssl.client.truststore.type</code>: (Optional) The format of |
| the trust-store file.</li> |
| |
| <li><code>ssl.client.truststore.password</code>: (Optional) Password |
| for the trust-store file.</li> |
| |
| </ul> |
| |
| <p>The following is an example of the contents of the contents of |
| a Map-Red SSL Configuration file:</p> |
| |
| <p> <br/> <code> <configuration> </code> </p> |
| |
| <p> <br/> <code><property> </code> </p> |
| <p> <code><name>ssl.client.truststore.location</name> </code> </p> |
| <p> <code><value>/work/keystore.jks</value> </code> </p> |
| <p> <code><description>Truststore to be used by clients like distcp. Must be specified. </description></code> </p> |
| <p> <br/> <code></property> </code> </p> |
| |
| <p><code> <property> </code> </p> |
| <p> <code><name>ssl.client.truststore.password</name> </code> </p> |
| <p> <code><value>changeme</value> </code> </p> |
| <p> <code><description>Optional. Default value is "". </description> </code> </p> |
| <p> <code></property> </code> </p> |
| |
| <p> <br/> <code> <property> </code> </p> |
| <p> <code> <name>ssl.client.truststore.type</name></code> </p> |
| <p> <code> <value>jks</value></code> </p> |
| <p> <code> <description>Optional. Default value is "jks". </description></code> </p> |
| <p> <code> </property> </code> </p> |
| |
| <p> <code> <br/> </configuration> </code> </p> |
| |
| <p><br/>The SSL configuration file must be in the class-path of the |
| DistCp program.</p> |
| |
| </section> |
| |
| </body> |
| </document> |