blob: 9e9896e24f81f36e102f3b8da235278647fe1c98 [file] [log] [blame]
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.
<title>Accumulo Bulk Ingest</title>
<link rel='stylesheet' type='text/css' href='documentation.css' media='screen'/>
<h1>Apache Accumulo Documentation : Bulk Ingest</h2>
<p>Accumulo supports the ability to import sorted files produced by an
external process into an online table. Often, it is much faster to churn
through large amounts of data using map/reduce to produce the these files.
The new files can be incorporated into Accumulo using bulk ingest.
<li>Construct an <code>org.apache.accumulo.core.client.Connector</code> instance</li>
<li>Call <code>connector.tableOperations().getSplits()</code></li>
<li>Run a map/reduce job using <code>RangePartitioner</code>
with splits from the previous step</li>
<li>Call <code>connector.tableOperations().importDirectory()</code> passing the output directory of the MapReduce job</li>
<p>Files can also be imported using the "importdirectory" shell command.
<p>A complete example is available in <a href='examples/README.bulkIngest'>README.bulkIngest</a>
<p>Importing data using whole files of sorted data can be very efficient, but it differs
from live ingest in the following ways:
<li>Table constraints are not applied against they data in the file.
<li>Adding new files to tables are likely to trigger major compactions.
<li>The timestamp in the file could contain strange values. Accumulo can be asked to use the ingest timestamp for all values if this is a concern.
<li>It is possible to create invalid visibility values (for example "&|"). This will cause errors when the data is accessed.
<li>Bulk imports do not effect the entry counts in the monitor page until the files are compacted.
<h2>Best Practices</h2>
<p>Consider two approaches to creating ingest files using map/reduce.
<li>A large file containing the Key/Value pairs for only a single tablet.
<li>A set of small files containing Key/Value pairs for every tablet.
<p>In the first case, adding the file requires telling a single tablet server about a single file. Even if the file
is 20G in size, it is one call to the tablet server. The tablet server makes one extra file entry in the
tablet's metadata, and the data is now part of the tablet.
<p>In the second case, an request must be made for each tablet for each file to be added. If there
100 files and 100 tablets, this will be 10K requests, and the number of files needed to be opened
for scans on these tablets will be very large. Major compactions will most likely start which will eventually
fix the problem, but a lot more work needs to be done by accumulo to read these files.
<p>Getting good, fast, bulk import performance depends on creating files like the first, and avoiding files like
the second.
<p>For this reason, a RangePartitioner should be used to create files when
writing with the AccumuloFileOutputFormat.
<p>Hash partition is not recommended because it will put keys in random
groups, exactly like our bad approach.
<P>Any set of cut points for range partitioning can be used in a map
reduce job, but using Accumulo's current splits is probably the most
optimal thing to do. However in some cases there may be too many
splits. For example if there are 2000 splits, you would need to run
2001 reducers. To overcome this problem use the
<code>connector.tableOperations.getSplits(&lt;table name&gt;,&lt;max
splits&gt;)</code> method. This method will not return more than
<code> &lt;max splits&gt; </code> splits, but the splits it returns
will optimally partition the data for Accumulo.
<p>Remember that Accumulo never splits rows across tablets.
Therefore the range partitioner only considers rows when partitioning.
<p>When bulk importing many files into a new table, it might be good to pre-split the table to bring
additional resources to accepting the data. For example, if you know your data is indexed based on the
date, pre-creating splits for each day will allow files to fall into natural splits. Having more tablets
accept the new data means that more resources can be used to import the data right away.
<p>An alternative to bulk ingest is to have a map/reduce job use
<code>AccumuloOutputFormat</code>, which can support billions of inserts per
hour, depending on the size of your cluster. This is sufficient for
most users, but bulk ingest remains the fastest way to incorporate
data into Accumulo. In addition, bulk ingest has one advantage over
AccumuloOutputFormat: there is no duplicate data insertion. When one uses
map/reduce to output data to accumulo, restarted jobs may re-enter
data from previous failed attempts. Generally, this only matters when
there are aggregators. With bulk ingest, reducers are writing to new
map files, so it does not matter. If a reduce fails, you create a new
map file. When all reducers finish, you bulk ingest the map files
into Accumulo. The disadvantage to bulk ingest over <code>AccumuloOutputFormat</code> is
greater latency: the entire map/reduce job must complete
before any data is available.