| <html> |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <body> |
| |
| <p>A software framework for easily writing applications which process vast |
| amounts of data (multi-terabyte data-sets) parallelly on large clusters |
| (thousands of nodes) built of commodity hardware in a reliable, fault-tolerant |
| manner.</p> |
| |
| <p>A Map-Reduce <i>job</i> usually splits the input data-set into independent |
| chunks which processed by <i>map</i> tasks in completely parallel manner, |
| followed by <i>reduce</i> tasks which aggregating their output. Typically both |
| the input and the output of the job are stored in a |
| {@link org.apache.hadoop.fs.FileSystem}. The framework takes care of monitoring |
| tasks and re-executing failed ones. Since, usually, the compute nodes and the |
| storage nodes are the same i.e. Hadoop's Map-Reduce framework and Distributed |
| FileSystem are running on the same set of nodes, tasks are effectively scheduled |
| on the nodes where data is already present, resulting in very high aggregate |
| bandwidth across the cluster.</p> |
| |
| <p>The Map-Reduce framework operates exclusively on <tt><key, value></tt> |
| pairs i.e. the input to the job is viewed as a set of <tt><key, value></tt> |
| pairs and the output as another, possibly different, set of |
| <tt><key, value></tt> pairs. The <tt>key</tt>s and <tt>value</tt>s have to |
| be serializable as {@link org.apache.hadoop.io.Writable}s and additionally the |
| <tt>key</tt>s have to be {@link org.apache.hadoop.io.WritableComparable}s in |
| order to facilitate grouping by the framework.</p> |
| |
| <p>Data flow:</p> |
| <pre> |
| (input) |
| <tt><k1, v1></tt> |
| |
| | |
| V |
| |
| <b>map</b> |
| |
| | |
| V |
| |
| <tt><k2, v2></tt> |
| |
| | |
| V |
| |
| <b>combine</b> |
| |
| | |
| V |
| |
| <tt><k2, v2></tt> |
| |
| | |
| V |
| |
| <b>reduce</b> |
| |
| | |
| V |
| |
| <tt><k3, v3></tt> |
| (output) |
| </pre> |
| |
| <p>Applications typically implement |
| {@link org.apache.hadoop.mapred.Mapper#map(Object, Object, OutputCollector, Reporter)} |
| and |
| {@link org.apache.hadoop.mapred.Reducer#reduce(Object, Iterator, OutputCollector, Reporter)} |
| methods. The application-writer also specifies various facets of the job such |
| as input and output locations, the <tt>Partitioner</tt>, <tt>InputFormat</tt> |
| & <tt>OutputFormat</tt> implementations to be used etc. as |
| a {@link org.apache.hadoop.mapred.JobConf}. The client program, |
| {@link org.apache.hadoop.mapred.JobClient}, then submits the job to the framework |
| and optionally monitors it.</p> |
| |
| <p>The framework spawns one map task per |
| {@link org.apache.hadoop.mapred.InputSplit} generated by the |
| {@link org.apache.hadoop.mapred.InputFormat} of the job and calls |
| {@link org.apache.hadoop.mapred.Mapper#map(Object, Object, OutputCollector, Reporter)} |
| with each <key, value> pair read by the |
| {@link org.apache.hadoop.mapred.RecordReader} from the <tt>InputSplit</tt> for |
| the task. The intermediate outputs of the maps are then grouped by <tt>key</tt>s |
| and optionally aggregated by <i>combiner</i>. The key space of intermediate |
| outputs are paritioned by the {@link org.apache.hadoop.mapred.Partitioner}, where |
| the number of partitions is exactly the number of reduce tasks for the job.</p> |
| |
| <p>The reduce tasks fetch the sorted intermediate outputs of the maps, via http, |
| merge the <key, value> pairs and call |
| {@link org.apache.hadoop.mapred.Reducer#reduce(Object, Iterator, OutputCollector, Reporter)} |
| for each <key, list of values> pair. The output of the reduce tasks' is |
| stored on the <tt>FileSystem</tt> by the |
| {@link org.apache.hadoop.mapred.RecordWriter} provided by the |
| {@link org.apache.hadoop.mapred.OutputFormat} of the job.</p> |
| |
| <p>Map-Reduce application to perform a distributed <i>grep</i>:</p> |
| <pre><tt> |
| public class Grep extends Configured implements Tool { |
| |
| // <i>map: Search for the pattern specified by 'grep.mapper.regex' &</i> |
| // <i>'grep.mapper.regex.group'</i> |
| |
| class GrepMapper<K, Text> |
| extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> { |
| |
| private Pattern pattern; |
| private int group; |
| |
| public void configure(JobConf job) { |
| pattern = Pattern.compile(job.get("grep.mapper.regex")); |
| group = job.getInt("grep.mapper.regex.group", 0); |
| } |
| |
| public void map(K key, Text value, |
| OutputCollector<Text, LongWritable> output, |
| Reporter reporter) |
| throws IOException { |
| String text = value.toString(); |
| Matcher matcher = pattern.matcher(text); |
| while (matcher.find()) { |
| output.collect(new Text(matcher.group(group)), new LongWritable(1)); |
| } |
| } |
| } |
| |
| // <i>reduce: Count the number of occurrences of the pattern</i> |
| |
| class GrepReducer<K> extends MapReduceBase |
| implements Reducer<K, LongWritable, K, LongWritable> { |
| |
| public void reduce(K key, Iterator<LongWritable> values, |
| OutputCollector<K, LongWritable> output, |
| Reporter reporter) |
| throws IOException { |
| |
| // sum all values for this key |
| long sum = 0; |
| while (values.hasNext()) { |
| sum += values.next().get(); |
| } |
| |
| // output sum |
| output.collect(key, new LongWritable(sum)); |
| } |
| } |
| |
| public int run(String[] args) throws Exception { |
| if (args.length < 3) { |
| System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); |
| ToolRunner.printGenericCommandUsage(System.out); |
| return -1; |
| } |
| |
| JobConf grepJob = new JobConf(getConf(), Grep.class); |
| |
| grepJob.setJobName("grep"); |
| |
| FileInputFormat.setInputPaths(grepJob, new Path(args[0])); |
| FileOutputFormat.setOutputPath(grepJob, args[1]); |
| |
| grepJob.setMapperClass(GrepMapper.class); |
| grepJob.setCombinerClass(GrepReducer.class); |
| grepJob.setReducerClass(GrepReducer.class); |
| |
| grepJob.set("mapreduce.mapper.regex", args[2]); |
| if (args.length == 4) |
| grepJob.set("mapreduce.mapper.regexmapper..group", args[3]); |
| |
| grepJob.setOutputFormat(SequenceFileOutputFormat.class); |
| grepJob.setOutputKeyClass(Text.class); |
| grepJob.setOutputValueClass(LongWritable.class); |
| |
| JobClient.runJob(grepJob); |
| |
| return 0; |
| } |
| |
| public static void main(String[] args) throws Exception { |
| int res = ToolRunner.run(new Configuration(), new Grep(), args); |
| System.exit(res); |
| } |
| |
| } |
| </tt></pre> |
| |
| <p>Notice how the data-flow of the above grep job is very similar to doing the |
| same via the unix pipeline:</p> |
| |
| <pre> |
| cat input/* | grep | sort | uniq -c > out |
| </pre> |
| |
| <pre> |
| input | map | shuffle | reduce > out |
| </pre> |
| |
| <p>Hadoop Map-Reduce applications need not be written in |
| Java<small><sup>TM</sup></small> only. |
| <a href="../streaming/package-summary.html">Hadoop Streaming</a> is a utility |
| which allows users to create and run jobs with any executables (e.g. shell |
| utilities) as the mapper and/or the reducer. |
| <a href="pipes/package-summary.html">Hadoop Pipes</a> is a |
| <a href="http://www.swig.org/">SWIG</a>-compatible <em>C++ API</em> to implement |
| Map-Reduce applications (non JNI<small><sup>TM</sup></small> based).</p> |
| |
| <p>See <a href="http://labs.google.com/papers/mapreduce.html">Google's original |
| Map/Reduce paper</a> for background information.</p> |
| |
| <p><i>Java and JNI are trademarks or registered trademarks of |
| Sun Microsystems, Inc. in the United States and other countries.</i></p> |
| |
| </body> |
| </html> |