tour/batch-scanner.md

title: Batch Scanner

Running on a single thread, a Scanner will retrieve a single Range of data and return Keys in sorted order. A [BatchScanner] will retrieve multiple Ranges of data using multiple threads. A BatchScanner can be more efficient but does not guarantee Keys will be returned in sorted order.

For this exercise, we need to generate a bunch of data to test BatchScanner. Execute the code below to create our data set.

jshell> try (BatchWriter writer = client.createBatchWriter("GothamPD")) {
   ...>   for (int i = 0; i < 10_000; i++) {
   ...>     Mutation m = new Mutation(String.format("id%04d", i));
   ...>     m.put("villain", "alias", "henchman" + i);
   ...>     m.put("villain", "yearsOfService", "" + (new Random().nextInt(50)));
   ...>     m.put("villain", "wearsCape?", "false");
   ...>     writer.addMutation(m);
   ...>   }
   ...> }

We want to calculate the average years of service from a sample of 2000 villains. A BatchScanner would be good for this task because we don't need the returned keys to be sorted. Follow these steps to efficiently scan the table with 10,000 entries.

After the above code, create a BatchScanner with five query threads. Similar to a Scanner, use the [createBatchScanner] method.
Create an ArrayList of two sample Ranges (id1000 to id1999 and id9000 to id9999) and set the ranges of the [BatchScanner] using setRanges.
We can make the scan more efficient by only bringing back the columns we want. Use [fetchColumn] to get the villain family and yearsOfService qualifier.
Finally, use the BatchScanner to calculate the average years of service of 2000 villains.

[BatchScanner]: {% jurl org.apache.accumulo.core.client.BatchScanner %} [createBatchScanner]: {% jurl org.apache.accumulo.core.client.Connector#createBatchScanner-java.lang.String-org.apache.accumulo.core.security.Authorizations-int- %} [fetchColumn]: {% jurl org.apache.accumulo.core.client.ScannerBase#fetchColumn-org.apache.hadoop.io.Text-org.apache.hadoop.io.Text- %}