| <!-- |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| https://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| --> |
| |
| # Running a bulk ingest test |
| |
| Continuous ingest supports bulk ingest in addition to live ingest. A map reduce |
| job that generates rfiles using the tables splits can be run. This can be run |
| in a loop like the following to continually bulk import data. |
| |
| ```bash |
| # create the ci table if necessary |
| ./bin/cingest createtable |
| |
| # Optionally, consider lowering the split threshold to make splits happen more |
| # frequently while the test runs. Choose a threshold base on the amount of data |
| # being imported and the desired number of splits. |
| # |
| # accumulo shell -u root -p secret -e 'config -t ci -s table.split.threshold=32M' |
| |
| for i in $(seq 1 10); do |
| # run map reduce job to generate data for bulk import |
| ./bin/cingest bulk /tmp/bt/$i |
| # ask accumulo to import generated data |
| echo -e "table ci\nimportdirectory /tmp/bt/$i/files true" | accumulo shell -u root -p secret |
| done |
| ./bin/cingest verify |
| ``` |
| |
| Another way to use this in test is to generate a lot of data and then bulk import it all at once as follows. |
| |
| ```bash |
| for i in $(seq 1 10); do |
| ./bin/cingest bulk /tmp/bt/$i |
| done |
| |
| # Optionally, copy data before importing. This can be useful in debugging problems. |
| hadoop distcp hdfs://$NAMENODE/tmp/bt hdfs://$NAMENODE/tmp/bt-copy |
| |
| for i in $(seq 1 10); do |
| ( |
| echo table ci |
| echo "importdirectory /tmp/bt/$i/files true" |
| ) | accumulo shell -u root -p secret |
| sleep 5 |
| done |
| |
| ./bin/cingest verify |
| ``` |
| |
| Bulk ingest could be run concurrently with live ingest into the same table. It |
| could also be run while the agitator is running. |
| |
| After bulk imports complete, could run the following commands in the Accumulo shell |
| to see if there are any BLIP (bulk load in progress) or load markers. There should |
| not be any. |
| |
| ``` |
| scan -t accumulo.metadata -b ~blip -e ~blip~ |
| scan -t accumulo.metadata -c loaded |
| ``` |
| |
| Additionally check that no rfiles exists in the source dir. |
| |
| ```bash |
| hadoop fs -ls -R /tmp/bt | grep rf |
| ``` |
| |
| The referenced counts output by `cingest verify` should equal : |
| |
| ``` |
| test.ci.bulk.map.task * (test.ci.bulk.map.nodes -1) * num_bulk_generate_jobs |
| ``` |
| |
| The unreferenced counts output by `cingest verify` should equal : |
| |
| ``` |
| test.ci.bulk.map.task * num_bulk_generate_jobs |
| ``` |
| |
| Its possible the counts could be slightly smaller because of collisions. However collisions |
| are unlikely with the default settings given there are 63 bits of randomness in the row and |
| 30 bits in the column. This gives a total of 93 bits of randomness per key. |
| |