Apache Accumulo Customizing the Compaction Strategy

This tutorial uses the following Java classes, which can be found in org.apache.accumulo.tserver.compaction:

  • DefaultCompactionStrategy.java - determines which files to compact based on table.compaction.major.ratio and table.file.max
  • EverythingCompactionStrategy.java - compacts all files
  • SizeLimitCompactionStrategy.java - compacts files no bigger than table.majc.compaction.strategy.opts.sizeLimit
  • TwoTierCompactionStrategy.java - uses default compression for smaller files and table.majc.compaction.strategy.opts.file.large.compress.type for larger files

This is an example of how to configure a compaction strategy. By default Accumulo will always use the DefaultCompactionStrategy, unless these steps are taken to change the configuration. Use the strategy and settings that best fits your Accumulo setup. This example shows how to configure and test one of the more complicated strategies, the TwoTierCompactionStrategy. Note that this example requires hadoop native libraries built with snappy in order to use snappy compression.

To begin, run the command to create a table for testing:

$ accumulo shell -u root -p secret -e "createtable test1"

The command below sets the compression for smaller files and minor compactions for that table.

$ accumulo shell -u root -p secret -e "config -s table.file.compress.type=snappy -t test1"

The commands below will configure the TwoTierCompactionStrategy to use gz compression for files larger than 1M.

$ accumulo shell -u root -p secret -e "config -s table.majc.compaction.strategy.opts.file.large.compress.threshold=1M -t test1"
$ accumulo shell -u root -p secret -e "config -s table.majc.compaction.strategy.opts.file.large.compress.type=gz -t test1"
$ accumulo shell -u root -p secret -e "config -s table.majc.compaction.strategy=org.apache.accumulo.tserver.compaction.TwoTierCompactionStrategy -t test1"

Generate some data and files in order to test the strategy:

$ ./bin/runex client.SequentialBatchWriter -t test1 --start 0 --num 10000 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20
$ accumulo shell -u root -p secret -e "flush -t test1"
$ ./bin/runex client.SequentialBatchWriter -t test1 --start 0 --num 11000 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20
$ accumulo shell -u root -p secret -e "flush -t test1"
$ ./bin/runex client.SequentialBatchWriter -t test1 --start 0 --num 12000 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20
$ accumulo shell -u root -p secret -e "flush -t test1"
$ ./bin/runex client.SequentialBatchWriter -t test1 --start 0 --num 13000 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20
$ accumulo shell -u root -p secret -e "flush -t test1"

View the tserver log in <accumulo_home>/logs for the compaction and find the name of the that was compacted for your table. Print info about this file using the PrintInfo tool:

$ accumulo rfile-info <rfile>

Details about the rfile will be printed and the compression type should match the type used in the compaction... Meta block : RFile.index Raw size : 512 bytes Compressed size : 278 bytes Compression type : gz