Apache Accumulo 1.6.0 adds some major new features and fixes many bugs. This release contains changes from 609 issues contributed by 36 contributors and committers.
Below are resources for this release:
Accumulo 1.6.0 runs on Hadoop 1, however Hadoop 2 with HA namenode is recommended for production systems. In addition to HA, Hadoop 2 also offers better data durability guarantees, in the case when nodes lose power, than Hadoop 1.
BigTable's design allows for its internal metadata to automatically spread across multiple nodes. Accumulo has followed this design and scales very well as a result. There is one impediment to scaling though, and this is the HDFS namenode. There are two problems with the namenode when it comes to scaling. First, the namenode stores all of its filesystem metadata in memory on a single machine. This introduces an upper bound on the number of files Accumulo can have. Second, there is an upper bound on the number of file operations per second that a single namenode can support. For example, a namenode can only support a few thousand delete or create file request per second.
To overcome this bottleneck, support for multiple namenodes was added under ACCUMULO-118. This change allows Accumulo to store its files across multiple namenodes. To use this feature, place comma separated list of namenode URIs in the new instance.volumes configuration property in accumulo-site.xml. When upgrading to 1.6.0 and multiple namenode support is desired, modify this setting only after a successful upgrade.
Administering an Accumulo instance with many tables is cumbersome. To ease this, ACCUMULO-802 introduced table namespaces which allow tables to be grouped into logical collections. This allows configuration and permission changes to made to a namespace, which will apply to all of its tables.
Accumulo now offers a way to make atomic read,modify,write row changes from the client side. Atomic test and set row operations make this possible. ACCUMULO-1000 added conditional mutations and a conditional writer. A conditional mutation has tests on columns that must pass before any changes are made. These test are executed in server processes while a row lock is held. Below is a simple example of making atomic row changes using conditional mutations.
The only built in test that conditional mutations support are equality and isNull. However, iterators can be configured on a conditional mutation to run before these test. This makes it possible to implement any number of test such as less than, greater than, contains, etc.
Encryption is still an experimental feature, but much progress has been made since 1.5.0. Support for encrypting rfiles and write ahead logs were added in ACCUMULO-958 and ACCUMULO-980. Support for encrypting data over the wire using SSL was added in ACCUMULO-1009.
When a tablet server fails, its write ahead logs are sorted and stored in HDFS. In 1.6.0, encrypting these sorted write ahead logs is not supported. ACCUMULO-981 is open to address this issue.
One of the key elements of the BigTable design is use of the Log Structured Merge Tree. This entails sorting data in memory, writing out sorted files, and then later merging multiple sorted files into a single file. These automatic merges happen in the background and Accumulo decides when to merge files based comparing relative sizes of files to a compaction ratio. Before 1.6.0 adjusting the compaction ratio was the only way a user could control this process. ACCUMULO-1451 introduces pluggable compaction strategies which allow users to choose when and what files to compact. ACCUMULO-1808 adds a compaction strategy that prevents compaction of files over a configurable size.
Accumulo only sorts data lexicographically. Getting something like a pair of (String,Integer) to sort correctly in Accumulo is tricky. It‘s tricky because you only want to compare the integers if the strings are equal. It’s possible to make this sort properly in Accumulo if the data is encoded properly, but can be difficult. To make this easier ACCUMULO-1336 added Lexicoders to the Accumulo API. Lexicoders provide an easy way to serialize data so that it sorts properly lexicographically. Below is a simple example.
PairLexicoder plex = new PairLexicoder(new StringLexicoder(), new IntegerLexicoder()); byte[] ba1 = plex.encode(new ComparablePair<String, Integer>("b",1)); byte[] ba2 = plex.encode(new ComparablePair<String, Integer>("aa",1)); byte[] ba3 = plex.encode(new ComparablePair<String, Integer>("a",2)); byte[] ba4 = plex.encode(new ComparablePair<String, Integer>("a",1)); byte[] ba5 = plex.encode(new ComparablePair<String, Integer>("aa",-3)); //sorting ba1,ba2,ba3,ba4, and ba5 lexicographically will result in the same order as sorting the ComparablePairs
In cases where a very small amount of data is stored in a locality group one would expect fast scans over that locality group. However this was not always the case because recently written data stored in memory was not partitioned by locality group. Therefore if a table had 100GB of data in memory and 1MB of that was in locality group A, then scanning A would have required reading all 100GB. ACCUMULO-112 changes this and partitions data by locality group as its written.
Previous versions of Accumulo always used IP addresses internally. This could be problematic in virtual machine environments where IP addresses change. In ACCUMULO-1585 this was changed, now Accumulo uses the exact hostnames from its config files for internal addressing.
All Accumulo processes running on a cluster are locatable via zookeeper. Therefore using well known ports is not really required. ACCUMULO-1664 makes it possible to for all Accumulo processes to use random ports. This makes it easier to run multiple Accumulo instances on a single node.
While Hadoop does not support IPv6 networks, attempting to run on a system that does not have IPv6 completely disabled can cause strange failures. ACCUMULO-2262 invokes the JVM-provided configuration parameter at process startup to prefer IPv4 over IPv6.
Multiple bug-fixes were made to support running Accumulo over multiple HDFS instances using ViewFS. ACCUMULO-2047 is the parent ticket that contains numerous fixes to enable this support.
This version of Accumulo is accompanied by a new maven plugin for testing client apps (ACCUMULO-1030). You can execute the accumulo-maven-plugin inside your project by adding the following to your pom.xml's build plugins section:
<plugin> <groupId>org.apache.accumulo</groupId> <artifactId>accumulo-maven-plugin</artifactId> <version>1.6.0</version> <configuration> <instanceName>plugin-it-instance</instanceName> <rootPassword>ITSecret</rootPassword> </configuration> <executions> <execution> <id>run-plugin</id> <goals> <goal>start</goal> <goal>stop</goal> </goals> </execution> </executions> </plugin>
This plugin is designed to work in conjunction with the maven-failsafe-plugin. A small test instance of Accumulo will run during the pre-integration-test phase of the Maven build lifecycle, and will be stopped in the post-integration-test phase. Your integration tests, executed by maven-failsafe-plugin can access this instance with a MiniAccumuloInstance connector (the plugin uses MiniAccumuloInstance, internally), as in the following example:
private static Connector conn; @BeforeClass public static void setUp() throws Exception { String instanceName = "plugin-it-instance"; Instance instance = new MiniAccumuloInstance(instanceName, new File("target/accumulo-maven-plugin/" + instanceName)); conn = instance.getConnector("root", new PasswordToken("ITSecret")); }
This plugin is quite limited, currently only supporting an instance name and a root user password as configuration parameters. Improvements are expected in future releases, so feedback is welcome and appreciated (file bugs/requests under the “maven-plugin” component in the Accumulo JIRA).
One notable change that was made to the binary tarball is the purposeful omission of a pre-built copy of the Accumulo “native map” library. This shared library is used at ingest time to implement an off-JVM-heap sorted map that greatly increases ingest throughput while side-stepping issues such as JVM garbage collection pauses. In earlier releases, a pre-built copy of this shared library was included in the binary tarball; however, the decision was made to omit this due to the potential variance in toolchains on the target system.
It is recommended that users invoke the provided build_native_library.sh before running Accumulo:
$ACCUMULO_HOME/bin/build_native_library.sh
Be aware that you will need a C++ compiler/toolchain installed to build this library. Check your GNU/Linux distribution documentation for the package manager command.
A Constraint is an interface that can determine if a Mutation should be applied or rejected server-side. After ACCUMULO-466, new tables that are created in 1.6.0 will automatically have the DefaultKeySizeConstraint
set. As performance can suffer when large Keys are inserted into a table, this Constraint will reject any Key that is larger than 1MB. If this constraint is undesired, it can be removed using the constraint
shell command. See the help message on the command for more information.
bin/accumulo admin --help
When using Accumulo 1.6 and Hadoop 2, Accumulo will call hsync() on HDFS. Calling hsync improves durability by ensuring data is on disk (where other older Hadoop versions might lose data in the face of power failure); however, calling hsync frequently does noticeably slow writes. A simple work around is to increase the value of the tserver.mutation.queue.max configuration parameter via accumulo-site.xml.
A value of “4M” is a better recommendation, and memory consumption will increase by the number of concurrent writers to that TabletServer. For example, a value of 4M with 50 concurrent writers would equate to approximately 200M of Java heap being used for mutation queues.
For more information, see ACCUMULO-1950 and this comment.
Another possible cause of slower writes is the change in write ahead log replication between 1.4 and 1.5. Accumulo 1.4. defaulted to two loggers servers. Accumulo 1.5 and 1.6 store write ahead logs in HDFS and default to using three datanodes.
If a BatchWriter
fails with MutationsRejectedException
and the message contains "# server errors 1"
then it may be ACCUMULO-2388. To confirm this look in the tablet server logs for org.apache.accumulo.tserver.HoldTimeoutException
around the time the BatchWriter
failed. If this is happening often a possible work around is to set general.rpc.timeout
to 240s
.
The following deprecated methods were removed in ACCUMULO-1533
SecurityErrorCode o.a.a.core.client.AccumuloSecurityException.getErrorCode()
deprecated in ACCUMULO-970Connector o.a.a.core.client.Instance.getConnector(AuthInfo)
deprecated in ACCUMULO-1024Connector o.a.a.core.client.ZooKeeperInstance.getConnector(AuthInfo)
deprecated in ACCUMULO-1024static String o.a.a.core.client.ZooKeeperInstance.getInstanceIDFromHdfs(Path)
deprecated in ACCUMULO-1static String ZooKeeperInstance.lookupInstanceName (ZooCache,UUID)
deprecated in ACCUMULO-765void o.a.a.core.client.ColumnUpdate.setSystemTimestamp(long)
deprecated in ACCUMULO-786Below is a list of all platforms that 1.6.0 was tested against by developers. Each Apache Accumulo release has a set of tests that must be run before the candidate is capable of becoming an official release. That list includes the following:
Each unit and functional test only runs on a single node, while the RandomWalk and Continuous Ingest tests run on any number of nodes. Agitation refers to randomly restarting Accumulo processes and Hadoop Datanode processes, and, in HDFS High-Availability instances, forcing NameNode failover.
The following acronyms are used in the test testing table.
mvn verify
{: #release_notes_testing .table } | OS | Java | Hadoop | Nodes | ZooKeeper | HDFS HA | Version/Commit hash | Tests | |------------|----------------------------|-----------------------------------|--------------|--------------|---------|----------------------------------|--------------------------------------------------------------------| | CentOS 6.5 | CentOS OpenJDK 1.7 | Apache 2.2.0 | 20 EC2 nodes | Apache 3.4.5 | No | 1.6.0 RC1 + ACCUMULO_2668 patch | 24-hour CI w/o agitation. Verified. | | CentOS 6.5 | CentOS OpenJDK 1.7 | Apache 2.2.0 | 20 EC2 nodes | Apache 3.4.5 | No | 1.6.0 RC2 | 24-hour RW (Conditional.xml module) w/o agitation | | CentOS 6.5 | CentOS OpenJDK 1.7 | Apache 2.2.0 | 20 EC2 nodes | Apache 3.4.5 | No | 1.6.0 RC5 | 24-hour CI w/ agitation. Verified. | | CentOS 6.5 | CentOS OpenJDK 1.6 and 1.7 | Apache 1.2.1, 2.2.0 | Single | Apache 3.3.6 | No | 1.6.0 RC5 | All unit and ITs w/ -Dhadoop.profile=2
and -Dhadoop.profile=1
| | Gentoo | Sun JDK 1.6.0_45 | Apache 1.2.1, 2.2.0, 2.3.0, 2.4.0 | Single | Apache 3.4.5 | No | 1.6.0 RC5 | All unit and ITs. 2B entries ingested/verified with CI | | CentOS 6.4 | Sun JDK 1.6.0_31 | CDH 4.5.0 | 7 | CDH 4.5.0 | Yes | 1.6.0 RC4 and RC5 | 24-hour RW (LongClean) with and without agitation | | CentOS 6.4 | Sun JDK 1.6.0_31 | CDH 4.5.0 | 7 | CDH 4.5.0 | Yes | 3a1b38 | 72-hour CI with and without agitation. Verified. | | CentOS 6.4 | Sun JDK 1.6.0_31 | CDH 4.5.0 | 7 | CDH 4.5.0 | Yes | 1.6.0 RC2 | 24-hour CI without agitation. Verified. | | CentOS 6.4 | Sun JDK 1.6.0_31 | CDH 4.5.0 | 7 | CDH 4.5.0 | Yes | 1.6.0 RC3 | 24-hour CI with agitation. Verified. |