blob: a9b4ff712ce809f3df574d0cbd074c0cb1c5b4ed [file] [log] [blame]
<html>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<body>
Run <a href="http://hadoop.apache.org/">Hadoop</a> MapReduce jobs over
Avro data, with map and reduce functions written in Java.
<p>Avro data files do not contain key/value pairs as expected by
Hadoop's MapReduce API, but rather just a sequence of values. Thus
we provide here a layer on top of Hadoop's MapReduce API.</p>
<p>In all cases, input and output paths are set and jobs are submitted
as with standard Hadoop jobs:
<ul>
<li>Specify input files with {@link
org.apache.hadoop.mapred.FileInputFormat#setInputPaths}</li>
<li>Specify an output directory with {@link
org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}</li>
<li>Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}</li>
</ul>
</p>
<p>For jobs whose input and output are Avro data files:
<ul>
<li>Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} and
{@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
job's input and output schemas.</li>
<li>Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
this as your job's mapper with {@link
org.apache.avro.mapred.AvroJob#setMapperClass}</li>
<li>Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
this as your job's reducer and perhaps combiner, with {@link
org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
org.apache.avro.mapred.AvroJob#setCombinerClass}</li>
</ul>
</p>
<p>For jobs whose input is an Avro data file and which use an {@link
org.apache.avro.mapred.AvroMapper}, but whose reducer is a non-Avro
{@link org.apache.hadoop.mapred.Reducer} and whose output is a
non-Avro format:
<ul>
<li>Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} with your
job's input schema.</li>
<li>Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
this as your job's mapper with {@link
org.apache.avro.mapred.AvroJob#setMapperClass}</li>
<li>Implement {@link org.apache.hadoop.mapred.Reducer} and specify
your job's reducer and combiner with {@link
org.apache.hadoop.mapred.JobConf#setReducerClass} and {@link
org.apache.hadoop.mapred.JobConf#setCombinerClass}. The input key
and value types should be {@link org.apache.avro.mapred.AvroKey} and {@link
org.apache.avro.mapred.AvroValue}.</li>
<li>Specify your job's output key and value types {@link
org.apache.hadoop.mapred.JobConf#setOutputKeyClass} and {@link
org.apache.hadoop.mapred.JobConf#setOutputValueClass}.</li>
<li>Specify your job's output format {@link
org.apache.hadoop.mapred.JobConf#setOutputFormat}.</li>
</ul>
</p>
<p>For jobs whose input is non-Avro data file and which use a
non-Avro {@link org.apache.hadoop.mapred.Mapper}, but whose reducer
is an {@link org.apache.avro.mapred.AvroReducer} and whose output is
an Avro data file:
<ul>
<li>Set your input file format with {@link
org.apache.hadoop.mapred.JobConf#setInputFormat}.</li>
<li>Implement {@link org.apache.hadoop.mapred.Mapper} and specify
your job's mapper with {@link
org.apache.hadoop.mapred.JobConf#setMapperClass}. The output key
and value type should be {@link org.apache.avro.mapred.AvroKey} and
{@link org.apache.avro.mapred.AvroValue}.</li>
<li>Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
this as your job's reducer and perhaps combiner, with {@link
org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
org.apache.avro.mapred.AvroJob#setCombinerClass}</li>
<li>Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
job's output schema.</li>
</ul>
</p>
</body>
</html>