| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <html> |
| <head> |
| <title>Accumulo Bulk Ingest</title> |
| <link rel='stylesheet' type='text/css' href='documentation.css' media='screen'/> |
| </head> |
| <body> |
| |
| <h1>Apache Accumulo Documentation : Bulk Ingest</h2> |
| |
| <p>Accumulo supports the ability to import sorted files produced by an |
| external process into an online table. Often, it is much faster to churn |
| through large amounts of data using map/reduce to produce the these files. |
| The new files can be incorporated into Accumulo using bulk ingest. |
| |
| <ul> |
| <li>Construct an <code>org.apache.accumulo.core.client.Connector</code> instance</li> |
| <li>Call <code>connector.tableOperations().getSplits()</code></li> |
| <li>Run a map/reduce job using <code>RangePartitioner</code> |
| with splits from the previous step</li> |
| <li>Call <code>connector.tableOperations().importDirectory()</code> passing the output directory of the MapReduce job</li> |
| </ul> |
| |
| <p>Files can also be imported using the "importdirectory" shell command. |
| |
| <p>A complete example is available in <a href='examples/README.bulkIngest'>README.bulkIngest</a> |
| |
| <p>Importing data using whole files of sorted data can be very efficient, but it differs |
| from live ingest in the following ways: |
| <ul> |
| <li>Table constraints are not applied against they data in the file. |
| <li>Adding new files to tables are likely to trigger major compactions. |
| <li>The timestamp in the file could contain strange values. Accumulo can be asked to use the ingest timestamp for all values if this is a concern. |
| <li>It is possible to create invalid visibility values (for example "&|"). This will cause errors when the data is accessed. |
| <li>Bulk imports do not effect the entry counts in the monitor page until the files are compacted. |
| </ul> |
| |
| <h2>Best Practices</h2> |
| |
| <p>Consider two approaches to creating ingest files using map/reduce. |
| |
| <ol> |
| <li>A large file containing the Key/Value pairs for only a single tablet. |
| <li>A set of small files containing Key/Value pairs for every tablet. |
| <ol> |
| |
| <p>In the first case, adding the file requires telling a single tablet server about a single file. Even if the file |
| is 20G in size, it is one call to the tablet server. The tablet server makes one extra file entry in the |
| tablet's metadata, and the data is now part of the tablet. |
| |
| <p>In the second case, an request must be made for each tablet for each file to be added. If there |
| 100 files and 100 tablets, this will be 10K requests, and the number of files needed to be opened |
| for scans on these tablets will be very large. Major compactions will most likely start which will eventually |
| fix the problem, but a lot more work needs to be done by accumulo to read these files. |
| |
| <p>Getting good, fast, bulk import performance depends on creating files like the first, and avoiding files like |
| the second. |
| |
| <p>For this reason, a RangePartitioner should be used to create files when |
| writing with the AccumuloFileOutputFormat. |
| |
| <p>Hash partition is not recommended because it will put keys in random |
| groups, exactly like our bad approach. |
| |
| <P>Any set of cut points for range partitioning can be used in a map |
| reduce job, but using Accumulo's current splits is probably the most |
| optimal thing to do. However in some cases there may be too many |
| splits. For example if there are 2000 splits, you would need to run |
| 2001 reducers. To overcome this problem use the |
| <code>connector.tableOperations.getSplits(<table name>,<max |
| splits>)</code> method. This method will not return more than |
| <code> <max splits> </code> splits, but the splits it returns |
| will optimally partition the data for Accumulo. |
| |
| <p>Remember that Accumulo never splits rows across tablets. |
| Therefore the range partitioner only considers rows when partitioning. |
| |
| <p>When bulk importing many files into a new table, it might be good to pre-split the table to bring |
| additional resources to accepting the data. For example, if you know your data is indexed based on the |
| date, pre-creating splits for each day will allow files to fall into natural splits. Having more tablets |
| accept the new data means that more resources can be used to import the data right away. |
| |
| <p>An alternative to bulk ingest is to have a map/reduce job use |
| <code>AccumuloOutputFormat</code>, which can support billions of inserts per |
| hour, depending on the size of your cluster. This is sufficient for |
| most users, but bulk ingest remains the fastest way to incorporate |
| data into Accumulo. In addition, bulk ingest has one advantage over |
| AccumuloOutputFormat: there is no duplicate data insertion. When one uses |
| map/reduce to output data to accumulo, restarted jobs may re-enter |
| data from previous failed attempts. Generally, this only matters when |
| there are aggregators. With bulk ingest, reducers are writing to new |
| map files, so it does not matter. If a reduce fails, you create a new |
| map file. When all reducers finish, you bulk ingest the map files |
| into Accumulo. The disadvantage to bulk ingest over <code>AccumuloOutputFormat</code> is |
| greater latency: the entire map/reduce job must complete |
| before any data is available. |
| |
| </body> |
| </html> |