Apache Accumulo Customizing the Compaction Strategy

This is an example of how to configure a compaction strategy. By default, Accumulo will always use the DefaultCompactionStrategy, unless these steps are taken to change the configuration. Use the strategy and settings that best fits your Accumulo setup. This example shows how to configure a non-default strategy. Note that this example requires hadoop native libraries built with snappy in order to use snappy compression. Within this example, commands starting with user@uno> are run from within the Accumulo shell, whereas commands beginning with $ are executed from the command line terminal.

Start by creating a table that will be used for the compactions.

user@uno> createnamespace examples
user@uno> createtable examples.test1

Take note of the TableID for examples.test1. This will be needed later. The TableID can be found by running:

user@uno> tables -l
accumulo.metadata    =>        !0
accumulo.replication =>      +rep
accumulo.root        =>        +r
examples.test1       =>         2

The commands below will configure the desired compaction strategy. The goals are:

Avoid compacting files over 250M.
Compact files over 100M using gz.
Compact files less than 100M using snappy.
Limit the compaction throughput to 40MB/s.

Create a compaction service named cs1 that has three executors. The first executor named small has 8 threads and runs compactions less than 16M. The second executor, medium, runs compactions less than 128M with 4 threads. The last executor, large, runs all other compactions with 2 threads.

user@uno> config -s tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
user@uno> config -s 'tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":8},{"name":"medium","type":"internal","maxSize":"128M","numThreads":4},{"name":"large","type":"internal","numThreads":2}]'

Create a compaction service named cs2 that has three executors. It has a similar configuration to cs1, but its executors have fewer threads. For service, cs2, files over 250M should not be compacted. It also limits the total I/O of all compactions within the service to 40MB/s.

user@uno> config -s tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
user@uno> config -s 'tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":4},{"name":"medium","type":"internal","maxSize":"128M","numThreads":2},{"name":"large","type":"internal","maxSize":"250M","numThreads":1}]' 
user@uno> config -s tserver.compaction.major.service.cs2.rate.limit=40M

Configurations can be verified for correctness with the check-compaction-config tool in Accumulo. Place your compaction configuration into a file and run the tool. For example, if you create a file myconfig that contains the following:

tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction2.DefaultCompactionPlanner
tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":8},{"name":"medium","type":"internal","maxSize":"128M","numThreads":4},{"name":"large","type":"internal","numThreads":2}]
tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":4},{"name":"medium","type":"internal","maxSize":"128M","numThreads":2},{"name":"large","type":"internal","maxSize":"250M","numThreads":1}]
tserver.compaction.major.service.cs2.rate.limit=40M

The following command would check the configuration for errors:

$ accumulo check-compaction-config /path/to/myconfig

With the compaction configuration set, configure table specific properties.

Configure the compression for table examples.test1. Files over 100M will be compressed using gz. All others will be compressed via snappy.

user@uno> config -t examples.test1 -s table.compaction.configurer=org.apache.accumulo.core.client.admin.compaction.CompressionConfigurer
user@uno> config -t examples.test1 -s table.compaction.configurer.opts.large.compress.threshold=100M
user@uno> config -t examples.test1 -s table.compaction.configurer.opts.large.compress.type=gz
user@uno> config -t examples.test1 -s table.file.compress.type=snappy
user@uno> config -t examples.test1 -s table.compaction.dispatcher=org.apache.accumulo.core.spi.compaction.SimpleCompactionDispatcher

Set table examples.test1 to use compaction service cs1 for system compactions and service cs2 for user compactions.

user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service=cs1
user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service.user=cs2

If needed, chop compactions can be configured also.

user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service.chop=cs2

Generate some data and files in order to test the strategy:

$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 1000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"

$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 2000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"

$ accumulo shell -u <username> -p <password> -e "compact -t examples.test1 -w"

View the tserver log in <accumulo_home>/logs for the compaction and find the name of the rfile that was compacted for your table. Print info about this file using the rfile-info tool. Replace the TableID with the TableID from above. Note, your filenames will differ from the ones within this example.

accumulo rfile-info hdfs:///accumulo/tables/2/default_tablet/A000000a.rf

Details about the rfile will be printed. The compression type should match the type used in the compaction. In this case, snappy is used since the size is less than 100M.

Meta block     : RFile.index
      Raw size             : 168 bytes
      Compressed size      : 127 bytes
      Compression type     : snappy

Continue to add additional data.

$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 1000000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"

$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 1000000 --num 1000000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"

$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 2000000 --num 1000000 --size 50
$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"

$ accumulo shell -u <username> -p <password> -e "compact -t examples.test1 -w"

Again, view the tserver log in <accumulo_home>/logs for the compaction and find the name of the rfile that was compacted for your table. Print info about this file using the rfile-info tool:

accumulo rfile-info hdfs:///accumulo/tables/2/default_tablet/A000000o.rf

In this case, the compression type should be gz.

Meta block     : RFile.index
      Raw size             : 56,044 bytes
      Compressed size      : 21,460 bytes
      Compression type     : gz

Examining the size of A000000o.rf within HDFS should verify that the rfile is greater than 100M.

$ hdfs dfs -ls -h /accumulo/tables/2/default_tablet/A000000o.rf