docs/bulkIngest.html - accumulo - Git at Google

 <!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 -->
 <html>
 <head>
 <title>Accumulo Bulk Ingest</title>
 <link rel='stylesheet' type='text/css' href='documentation.css' media='screen'/>
 </head>
 <body>

 <h1>Apache Accumulo Documentation : Bulk Ingest</h2>

 <p>Accumulo supports the ability to import map files produced by an
 external process into an online table.  Often, it is much faster to churn
 through large amounts of data using map/reduce to produce the map files.
 The new map files can be incorporated into Accumulo using bulk ingest.

 <P>An important caveat is that the map/reduce job must
 use a range partitioner instead of the default hash partitioner.
 The range partitioner uses the current split points of the
 Accumulo table you want to ingest data into.  To bulk insert data
 using map/reduce, the following high level steps should be taken.

 <ul>
 <li>Construct an <code>org.apache.accumulo.core.client.Connector</code> instance</li>
 <li>Call <code>connector.tableOperations().getSplits()</code></li>
 <li>Run a map/reduce job using <a href='apidocs/org/apache/accumulo/core/client/mapreduce/lib/partition/RangePartitioner.html'>RangePartitioner</a> with splits from the previous step</li>
 <li>Call <code>connector.tableOperations().importDirectory()</code> passing the output directory of the MapReduce job</li>
 </ul>

 <p>A complete example is available in <a href='examples/README.bulkIngest'>README.bulkIngest</a>

 <p>The reason hash partition is not recommended is that it could
 potentially place a lot of load on the system.  Accumulo will look at
 each map file and determine the tablets to which it should be
 assigned.  When hash partitioning, every map file could get assigned
 to every tablet.  If a tablet has too many map files it will not allow
 them to be opened for a query (opening too many map files can kill a
 Hadoop Data Node).  So queries would be disabled until major
 compactions reduced the number of map files on the tablet.  However,
 when range partitioning using a tables splits each tablet should only
 get one map file.

 <P>Any set of cut points for range partitioning can be used in a map
 reduce job, but using Accumulos current splits is probably the most
 optimal thing to do.  However in some case there may be too many
 splits.  For example if there are 2000 splits, you would need to run
 2001 reducers.  To overcome this problem use the
 <code>connector.tableOperations.getSplits(&lt;table name&gt;,&lt;max
 splits&gt;)</code> method.  This method will not return more than
 <code> &lt;max splits&gt; </code> splits, but the splits it returns
 will optimally partition the data for Accumulo.

 <p>Remember that Accumulo never splits rows across tablets.
 Therefore the range partitioner only considers rows when partitioning.

 <p>An alternative to bulk ingest is to have a map/reduce job use
 <code>AccumuloOutputFormat</code>, which can support billions of inserts per
 hour, depending on the size of your cluster. This is sufficient for
 most users, but bulk ingest remains the fastest way to incorporate
 data into Accumulo.  In addition, bulk ingest has one advantage over
 AccumuloOutputFormat: there is no duplicate data insertion.  When one uses
 map/reduce to output data to accumulo, restarted jobs may re-enter
 data from previous failed attempts. Generally, this only matters when
 there are aggregators. With bulk ingest, reducers are writing to new
 map files, so it does not matter. If a reduce fails, you create a new
 map file.  When all reducers finish, you bulk ingest the map files
 into Accumulo.  The disadvantage to bulk ingest over
 <code>AccumuloOutputFormat</code>, is that it is tightly coupled to the
 Accumulo internals. Therefore a bulk ingest user may need to make
 more changes to their code to switch to a new Accumulo version.

 </body>
 </html>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<html>
	<head>
	<title>Accumulo Bulk Ingest</title>
	<link rel='stylesheet' type='text/css' href='documentation.css' media='screen'/>
	</head>
	<body>

	<h1>Apache Accumulo Documentation : Bulk Ingest</h2>

	<p>Accumulo supports the ability to import map files produced by an
	external process into an online table. Often, it is much faster to churn
	through large amounts of data using map/reduce to produce the map files.
	The new map files can be incorporated into Accumulo using bulk ingest.

	<P>An important caveat is that the map/reduce job must
	use a range partitioner instead of the default hash partitioner.
	The range partitioner uses the current split points of the
	Accumulo table you want to ingest data into. To bulk insert data
	using map/reduce, the following high level steps should be taken.

	<ul>
	<li>Construct an <code>org.apache.accumulo.core.client.Connector</code> instance</li>
	<li>Call <code>connector.tableOperations().getSplits()</code></li>
	<li>Run a map/reduce job using <a href='apidocs/org/apache/accumulo/core/client/mapreduce/lib/partition/RangePartitioner.html'>RangePartitioner</a> with splits from the previous step</li>
	<li>Call <code>connector.tableOperations().importDirectory()</code> passing the output directory of the MapReduce job</li>
	</ul>

	<p>A complete example is available in <a href='examples/README.bulkIngest'>README.bulkIngest</a>

	<p>The reason hash partition is not recommended is that it could
	potentially place a lot of load on the system. Accumulo will look at
	each map file and determine the tablets to which it should be
	assigned. When hash partitioning, every map file could get assigned
	to every tablet. If a tablet has too many map files it will not allow
	them to be opened for a query (opening too many map files can kill a
	Hadoop Data Node). So queries would be disabled until major
	compactions reduced the number of map files on the tablet. However,
	when range partitioning using a tables splits each tablet should only
	get one map file.

	<P>Any set of cut points for range partitioning can be used in a map
	reduce job, but using Accumulos current splits is probably the most
	optimal thing to do. However in some case there may be too many
	splits. For example if there are 2000 splits, you would need to run
	2001 reducers. To overcome this problem use the
	<code>connector.tableOperations.getSplits(<table name>,<max
	splits>)</code> method. This method will not return more than
	<code> <max splits> </code> splits, but the splits it returns
	will optimally partition the data for Accumulo.

	<p>Remember that Accumulo never splits rows across tablets.
	Therefore the range partitioner only considers rows when partitioning.

	<p>An alternative to bulk ingest is to have a map/reduce job use
	<code>AccumuloOutputFormat</code>, which can support billions of inserts per
	hour, depending on the size of your cluster. This is sufficient for
	most users, but bulk ingest remains the fastest way to incorporate
	data into Accumulo. In addition, bulk ingest has one advantage over
	AccumuloOutputFormat: there is no duplicate data insertion. When one uses
	map/reduce to output data to accumulo, restarted jobs may re-enter
	data from previous failed attempts. Generally, this only matters when
	there are aggregators. With bulk ingest, reducers are writing to new
	map files, so it does not matter. If a reduce fails, you create a new
	map file. When all reducers finish, you bulk ingest the map files
	into Accumulo. The disadvantage to bulk ingest over
	<code>AccumuloOutputFormat</code>, is that it is tightly coupled to the
	Accumulo internals. Therefore a bulk ingest user may need to make
	more changes to their code to switch to a new Accumulo version.

	</body>
	</html>