| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| # Apache Accumulo Customizing the Compaction Strategy |
| |
| This is an example of how to configure a compaction strategy. By default, Accumulo will always use the DefaultCompactionStrategy, unless |
| these steps are taken to change the configuration. Use the strategy and settings that best fits your Accumulo setup. This example shows |
| how to configure a non-default strategy. Note that this example requires hadoop native libraries built with snappy in order to |
| use snappy compression. Within this example, commands starting with `user@uno>` are run from within the Accumulo shell, whereas |
| commands beginning with `$` are executed from the command line terminal. |
| |
| Start by creating a table that will be used for the compactions. |
| |
| user@uno> createnamespace examples |
| user@uno> createtable examples.test1 |
| |
| Take note of the TableID for examples.test1. This will be needed later. The TableID can be found by running: |
| |
| |
| user@uno> tables -l |
| accumulo.metadata => !0 |
| accumulo.replication => +rep |
| accumulo.root => +r |
| examples.test1 => 2 |
| |
| The commands below will configure the desired compaction strategy. The goals are: |
| |
| - Avoid compacting files over 250M. |
| - Compact files over 100M using gz. |
| - Compact files less than 100M using snappy. |
| - Limit the compaction throughput to 40MB/s. |
| |
| Create a compaction service named `cs1` that has three executors. The first executor named `small` has |
| 8 threads and runs compactions less than 16M. The second executor, `medium`, runs compactions less than |
| 128M with 4 threads. The last executor, `large`, runs all other compactions with 2 threads. |
| |
| user@uno> config -s tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner |
| user@uno> config -s 'tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":8},{"name":"medium","type":"internal","maxSize":"128M","numThreads":4},{"name":"large","type":"internal","numThreads":2}]' |
| |
| Create a compaction service named `cs2` that has three executors. It has a similar configuration to `cs1`, but its |
| executors have fewer threads. For service, `cs2`, files over 250M should not be compacted. It also limits |
| the total I/O of all compactions within the service to 40MB/s. |
| |
| user@uno> config -s tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner |
| user@uno> config -s 'tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":4},{"name":"medium","type":"internal","maxSize":"128M","numThreads":2},{"name":"large","type":"internal","maxSize":"250M","numThreads":1}]' |
| user@uno> config -s tserver.compaction.major.service.cs2.rate.limit=40M |
| |
| Configurations can be verified for correctness with the `check-compaction-config` tool in |
| Accumulo. Place your compaction configuration into a file and run the tool. For example, if you create a file |
| `myconfig` that contains the following: |
| |
| tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction2.DefaultCompactionPlanner |
| tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":8},{"name":"medium","type":"internal","maxSize":"128M","numThreads":4},{"name":"large","type":"internal","numThreads":2}] |
| tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner |
| tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":4},{"name":"medium","type":"internal","maxSize":"128M","numThreads":2},{"name":"large","type":"internal","maxSize":"250M","numThreads":1}] |
| tserver.compaction.major.service.cs2.rate.limit=40M |
| |
| The following command would check the configuration for errors: |
| |
| $ accumulo check-compaction-config /path/to/myconfig |
| |
| |
| With the compaction configuration set, configure table specific properties. |
| |
| Configure the compression for table `examples.test1`. Files over 100M will be compressed using `gz`. All |
| others will be compressed via `snappy`. |
| |
| user@uno> config -t examples.test1 -s table.compaction.configurer=org.apache.accumulo.core.client.admin.compaction.CompressionConfigurer |
| user@uno> config -t examples.test1 -s table.compaction.configurer.opts.large.compress.threshold=100M |
| user@uno> config -t examples.test1 -s table.compaction.configurer.opts.large.compress.type=gz |
| user@uno> config -t examples.test1 -s table.file.compress.type=snappy |
| user@uno> config -t examples.test1 -s table.compaction.dispatcher=org.apache.accumulo.core.spi.compaction.SimpleCompactionDispatcher |
| |
| Set table `examples.test1` to use compaction service `cs1` for system compactions and service `cs2` |
| for user compactions. |
| |
| user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service=cs1 |
| user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service.user=cs2 |
| |
| If needed, `chop` compactions can be configured also. |
| |
| user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service.chop=cs2 |
| |
| Generate some data and files in order to test the strategy: |
| |
| $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 1000 --size 50 |
| $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1" |
| |
| $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 2000 --size 50 |
| $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1" |
| |
| $ accumulo shell -u <username> -p <password> -e "compact -t examples.test1 -w" |
| |
| View the `tserver` log in <accumulo_home>/logs for the compaction and find the name of the `rfile` that was |
| compacted for your table. Print info about this file using the `rfile-info` tool. Replace the TableID with |
| the TableID from above. Note, your filenames will differ from the ones within this example. |
| |
| accumulo rfile-info hdfs:///accumulo/tables/2/default_tablet/A000000a.rf |
| |
| Details about the rfile will be printed. The compression type should match the type used in the compaction. |
| In this case, `snappy` is used since the size is less than 100M. |
| |
| ```bash |
| Meta block : RFile.index |
| Raw size : 168 bytes |
| Compressed size : 127 bytes |
| Compression type : snappy |
| ``` |
| |
| Continue to add additional data. |
| |
| $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 1000000 --size 50 |
| $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1" |
| |
| $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 1000000 --num 1000000 --size 50 |
| $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1" |
| |
| $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 2000000 --num 1000000 --size 50 |
| $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1" |
| |
| $ accumulo shell -u <username> -p <password> -e "compact -t examples.test1 -w" |
| |
| Again, view the tserver log in <accumulo_home>/logs for the compaction and find the name of the `rfile` that was |
| compacted for your table. Print info about this file using the `rfile-info` tool: |
| |
| accumulo rfile-info hdfs:///accumulo/tables/2/default_tablet/A000000o.rf |
| |
| In this case, the compression type should be `gz`. |
| |
| ```bash |
| Meta block : RFile.index |
| Raw size : 56,044 bytes |
| Compressed size : 21,460 bytes |
| Compression type : gz |
| ``` |
| |
| Examining the size of `A000000o.rf` within HDFS should verify that the rfile is greater than 100M. |
| |
| $ hdfs dfs -ls -h /accumulo/tables/2/default_tablet/A000000o.rf |