| Title: Apache Accumulo |
| Notice: Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| . |
| http://www.apache.org/licenses/LICENSE-2.0 |
| . |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| ****************************************************************************** |
| 0. Introduction |
| |
| Apache Accumulo is a sorted, distributed key/value store based on Google's |
| BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It |
| features a few novel improvements on the BigTable design in the form of |
| cell-level access labels and a server-side programming mechanism that can modify |
| key/value pairs at various points in the data management process. |
| |
| ****************************************************************************** |
| 1. Building |
| |
| In the normal tarball release of accumulo, everything is built and |
| ready to go on x86 GNU/Linux: there is no build step. |
| |
| However, if you only have source code, or you wish to make changes, you need to |
| have maven configured to get Accumulo prerequisites from repositories. See |
| the pom.xml file for the necessary components. |
| |
| You can build an Accumulo binary distribution, which is created in the |
| assemble/target directory, using the following command. Note that maven 3 |
| is required starting with Accumulo v1.5.0. By default, Accumulo compiles |
| against Hadoop 2.2.0, but these artifacts should be compatible with Apache |
| Hadoop 1.2.x or Apache Hadoop 2.2.x releases. |
| |
| mvn package -P assemble |
| |
| By default, Accumulo compiles against Apache Hadoop 2.2.0. To compile against |
| a different Hadoop 2-compatible version, specify the profile and version, |
| e.g. "-Dhadoop.version=0.23.5". |
| |
| To compile against Apache Hadoop 1.2.1, or a different version that is compatible |
| with Hadoop 1, specify hadoop.profile and hadoop.version on the command line, |
| e.g. "-Dhadoop.profile=1 -Dhadoop.version=0.20.205.0" or |
| "-Dhadoop.profile=1 -Dhadoop.version=1.1.0". |
| |
| If you are running on another Unix-like operating system (OSX, etc) then |
| you may wish to build the native libraries. They are not strictly necessary |
| but having them available suppresses a runtime warning and enables Accumulo |
| to run faster. You can execute the following script to automatically unpack |
| and install the native library. Be sure to have a JDK and a C++ compiler |
| installed with the JAVA_HOME environment variable set. |
| |
| $ ./bin/build_native_library.sh |
| |
| If your system's default compiler options are insufficient, you can add |
| additional compiler options to the command line, such as options for the |
| architecture. These will be passed to the Makefile in the environment variable |
| USERFLAGS: |
| |
| $ ./bin/build_native_library.sh -m32 |
| |
| Alternatively, you can manually unpack the accumulo-native tarball in the |
| $ACCUMULO_HOME/lib directory. Change to the accumulo-native directory in |
| the current directory and issue `make`. Then, copy the resulting 'libaccumulo' |
| library into the $ACCUMULO_HOME/lib/native/map. |
| |
| $ mkdir -p $ACCUMULO_HOME/lib/native/map |
| $ cp libaccumulo.* $ACCUMULO_HOME/lib/native/map |
| |
| |
| Building Documentation |
| |
| Use the following command to build the User Manual (docs/target/accumulo_user_manual.pdf) |
| and the configuration HTML page (docs/target/config.html) |
| |
| mvn package -P docs -DskipTests |
| |
| ****************************************************************************** |
| 2. Deployment |
| |
| Copy the accumulo tar file produced by mvn package from the assemble/target/ |
| directory to the desired destination, then untar it (e.g. |
| tar xzf accumulo-1.6.0-bin.tar.gz). |
| |
| Another option is to package Accumulo directly to a working directory. For example, |
| |
| mvn package -DskipTests -DDEV_ACCUMULO_HOME=/var/tmp |
| |
| The above command would create a directory with a name similar to |
| /var/tmp/accumulo-1.6.0-dev/accumulo-1.6.0/, containing all the contents |
| that are normally contained in accumulo-1.6.0-bin.tar.gz, but already unpacked. |
| If the DEV_ACCUMULO_HOME parameter is not specified, this directory would |
| normally be created in assemble/target, but that is subject to deletion by |
| the 'mvn clean' command. Specifying an external directory would not be subject |
| to 'mvn clean'. When executed more than once, newer files overwrite older files, |
| and files a user adds (such as configuration files in conf/) will be left alone. |
| |
| If HDFS and Zookeeper are running, you can run Accumulo directly from this |
| working directory. See the 'Running Apache Accumulo' section later in this document. |
| |
| You can avoid specifying the working directory each time you compile by adding |
| a profile to maven's settings.xml file. Below is an example of $HOME/.m2/settings.xml |
| |
| <settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd"> |
| <profiles> |
| <profile> |
| <id>inject-accumulo-home</id> |
| <properties> |
| <DEV_ACCUMULO_HOME>/var/tmp</DEV_ACCUMULO_HOME> |
| </properties> |
| </profile> |
| </profiles> |
| <activeProfiles> |
| <activeProfile>inject-accumulo-home</activeProfile> |
| </activeProfiles> |
| </settings> |
| |
| ****************************************************************************** |
| 3. Upgrading |
| |
| 3.1. From 1.5 to 1.6 |
| |
| This happens automatically the first time Accumulo 1.6 is started. |
| |
| If your instance previously upgraded from 1.4 to 1.5, you must verify that your |
| 1.5 instance has no outstanding local write ahead logs. You can do this by ensuring |
| either: |
| |
| - All of your tables are online and the Monitor shows all tablets hosted |
| - The directory for write ahead logs (logger.dir.walog) from 1.4 has no files remaining |
| on any tablet server / logger hosts |
| |
| To upgrade from 1.5 to 1.6 you must: |
| |
| * Verify that there are no outstanding FATE operations |
| - Under 1.5 you can list what's in FATE by running |
| $ACCUMULO_HOME/bin/accumulo org.apache.accumulo.server.fate.Admin print |
| - Note that operations in any state will prevent an upgrade. It is safe |
| to delete operations with status SUCCESSFUL. For others, you should restart |
| your 1.5 cluster and allow them to finish. |
| * Stop the 1.5 instance. |
| * Configure 1.6 to use the hdfs directory and zookeepers that 1.5 was using. |
| * Copy other 1.5 configuration options as needed. |
| * Start Accumulo 1.6. |
| |
| The upgrade process must make changes to Accumulo's internal state in both ZooKeeper and |
| the table metadata. This process may take some time if Tablet Servers have to go through |
| recovery. During this time, the Monitor will claim that the Master is down and some |
| services may send the Monitor log messages about failure to communicate with each other. |
| These messages are safe to ignore. If you need detail on the upgrade's progress you should |
| view the local logs on the Tablet Servers and active Master. |
| |
| 3.2. From 1.4 to 1.6 |
| |
| To upgrade from 1.4 to 1.6 you must perform a manual initial step. |
| |
| Prior to upgrading you must: |
| * Verify that there are no outstanding FATE operations |
| - Under 1.4 you can list what's in FATE by running |
| $ACCUMULO_HOME/bin/accumulo org.apache.accumulo.server.fate.Admin print |
| - Note that operations in any state will prevent an upgrade. It is safe |
| to delete operations with status SUCCESSFUL. For others, you should restart |
| your 1.4 cluster and allow them to finish. |
| * Stop the 1.4 instance. |
| * Configure 1.6 to use the hdfs directory, walog directories, and zookeepers |
| that 1.4 was using. |
| * Copy other 1.4 configuration options as needed. |
| |
| Prior to starting the 1.6 instance you will need to run the LocalWALRecovery tool |
| on each node that previously ran an instance of the Logger role. |
| |
| $ACCUMULO_HOME/bin/accumulo org.apache.accumulo.tserver.log.LocalWALRecovery |
| |
| The recovery tool will rewrite the 1.4 write ahead logs into a format that 1.6 can read. |
| After this step has completed on all nodes, start the 1.6 cluster to continue the upgrade. |
| |
| The upgrade process must make changes to Accumulo's internal state in both ZooKeeper and |
| the table metadata. This process may take some time if Tablet Servers have to go through |
| recovery. During this time, the Monitor will claim that the Master is down and some |
| services may send the Monitor log messages about failure to communicate with each other. |
| While the upgrade is in progress, the Garbage Collector may complain about invalid paths. |
| The Master may also complain about failure to create the trace table because it already |
| exists. These messages are safe to ignore. If other error messages occur, you should seek |
| out support before continuing to use Accumulo. If you need detail on the upgrade's progress |
| you should view the local logs on the Tablet Servers and active Master. |
| |
| Note that the LocalWALRecovery tool does not delete the local files. Once you confirm that |
| 1.6 is successfully running, you should delete these files on the local filesystem. |
| |
| ****************************************************************************** |
| 4. Configuring |
| |
| Apache Accumulo has two prerequisites, hadoop and zookeeper. Zookeeper must be |
| at least version 3.3.0. Both of these must be installed and configured. Some |
| versions of Zookeeper may only allow 10 connections from one computer by default. |
| On a single-host install, this number is a little too low. Add the following to |
| the $ZOOKEEPER_HOME/conf/zoo.cfg file: |
| |
| maxClientCnxns=100 |
| |
| Ensure you (or the some special hadoop user account) have accounts on all of |
| the machines in the cluster and that hadoop and accumulo install files can be |
| found in the same location on every machine in the cluster. You will need to |
| have password-less ssh set up as described in the hadoop documentation. |
| |
| You will need to have hadoop installed and configured on your system. Accumulo |
| 1.6.0 has been tested with hadoop version 1.0.4. To avoid data loss, |
| you must enable HDFS durable sync. How you enable this depends on your version |
| of Hadoop. Please consult the table below for information regarding your version. |
| If you need to set the coniguration, please be sure to restart HDFS. See |
| ACCUMULO-623 and ACCUMULO-1637 for more information. |
| |
| The following releases of Apache Hadoop require special configuration to ensure |
| that data is not inadvertently lost; however, in all releases of Apache Hadoop, |
| `dfs.durable.sync` and `dfs.support.append` should *not* be configured as `false`. |
| |
| VERSION NAME=VALUE |
| 0.20.205.0 - dfs.support.append=true |
| 1.0.x - dfs.support.append=true |
| |
| Additionally, it is strongly recommended that you enable 'dfs.datanode.synconclose' |
| (only available in Apache Hadoop >=1.1.1 or >=0.23) in your hdfs-site.xml configuration |
| file to ensure that, in the face of unexpected power loss to a datanode, files are |
| wholly synced to disk. |
| |
| Accumulo's own configuration files can be bootstrapped with the |
| $ACCUMULO_HOME/bin/bootstrap_config.sh script. This script will allow you to |
| select options which correspond closely to your particular environment. The |
| configuration files produced by this script are examples. You should always |
| inspect any configuration files you use to ensure they are appropriate for your |
| environment, and to tailor them to your needs, as they are not guaranteed to be |
| suitable for all users and all environments. |
| |
| Some example accumulo configuration files are placed in directories based on the |
| memory footprint for the accumulo processes. These are pre-generated from |
| particular selections from the bootstrap_config.sh script for your convenience. |
| If you are using native libraries for you tablet server in-memory map, then you |
| can use the files in "native-standalone". If you get warnings about not being |
| able to load the native libraries, you can use the configuration files in |
| "standalone". |
| |
| For testing on a single computer, use a fairly small configuration: |
| |
| $ cp conf/examples/512MB/native-standalone/* conf |
| |
| Please note that the footprints are for only the Accumulo system processes, so |
| ample space should be left for other processes like hadoop, zookeeper, and the |
| accumulo client code. These directories must be at the same location on every |
| node in the cluster. |
| |
| If you are configuring a larger cluster you will need to create the configuration |
| files yourself and propogate the changes to the $ACCUMULO_CONF_DIR directories: |
| |
| Create a "slaves" file in $ACCUMULO_CONF_DIR/. This is a list of machines |
| where tablet servers and loggers will run. |
| |
| Create a "masters" file in $ACCUMULO_CONF_DIR/. This is a list of |
| machines where the master server will run. |
| |
| Create conf/accumulo-env.sh following the template of |
| example/3GB/native-standalone/accumulo-env.sh. |
| |
| However you create your configuration files, you will need to set |
| JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME in conf/accumulo-env.sh |
| |
| Note that zookeeper client jar files must be installed on every machine, but |
| the server should not be run on every machine. |
| |
| Create the $ACCUMULO_LOG_DIR on every machine in the slaves file. |
| |
| * Note that you will be specifying the Java heap space in accumulo-env.sh. |
| You should make sure that the total heap space used for the accumulo tserver, |
| logger and the hadoop datanode and tasktracker is less than the available |
| memory on each slave node in the cluster. On large clusters, it is recommended |
| that the accumulo master, hadoop namenode, secondary namenode, and hadoop |
| jobtracker all be run on separate machines to allow them to use more heap |
| space. If you are running these on the same machine on a small cluster, make |
| sure their heap space settings fit within the available memory. The zookeeper |
| instances are also time sensitive and should be on machines that will not be |
| heavily loaded, or over-subscribed for memory. |
| |
| Edit conf/accumulo-site.xml. You must set the zookeeper servers in this |
| file (instance.zookeeper.host). Look at the "Configuration Management" section |
| of the user manual to see what additional variables you can modify and what |
| the defaults are. |
| |
| It is advisable to change the instance secret (instance.secret) to some new |
| value. Also ensure that the accumulo-site.xml file is not readable by other |
| users on the machine. |
| |
| Synchronize your accumulo conf directory across the cluster. As a precaution |
| against mis-configured systems, servers using different configuration files |
| will not communicate with the rest of the cluster. |
| |
| Accumulo requires the hadoop "commons-io" java package. This is normally |
| distributed with hadoop. However, it was not distributed with hadoop-0.20. |
| If your hadoop distribution does not provide this package, you will need |
| to obtain it and put the commons-io jar file in $ACCUMULO_HOME/lib. See the |
| pom.xml file for version information. |
| |
| ****************************************************************************** |
| 5. Running Apache Accumulo |
| |
| Make sure hadoop is configured on all of the machines in the cluster, including |
| access to a shared hdfs instance. Make sure hdfs is running. |
| |
| Make sure zookeeper is configured and running on at least one machine in the |
| cluster. |
| |
| Run "bin/accumulo init" to create the hdfs directory structure |
| (hdfs:///accumulo/*) and initial zookeeper settings. This will also allow you |
| to also configure the initial root password. Only do this once. |
| |
| Start accumulo using the bin/start-all.sh script. |
| |
| Use the "bin/accumulo shell -u <username>" command to run an accumulo shell |
| interpreter. Within this interpreter, run "createtable <tablename>" to create |
| a table, and run "table <tablename>" followed by "scan" to scan a table. |
| |
| In the example below a table is created, data is inserted, and the table is |
| scanned. |
| |
| $ ./bin/accumulo shell -u root |
| Enter current password for 'root'@'accumulo': ****** |
| |
| Shell - Apache Accumulo Interactive Shell |
| - |
| - version: 1.5.0 |
| - instance name: accumulo |
| - instance id: f5947fe6-081e-41a8-9877-43730c4dfc6f |
| - |
| - type 'help' for a list of available commands |
| - |
| root@ac> createtable foo |
| root@ac foo> insert row1 colf1 colq1 val1 |
| root@ac foo> insert row1 colf1 colq2 val2 |
| root@ac foo> scan |
| row1 colf1:colq1 [] val1 |
| row1 colf1:colq2 [] val2 |
| |
| The example below start the shell, switches to table foo, and scans for a |
| certain column. |
| |
| $ ./bin/accumulo shell -u root |
| Enter current password for 'root'@'accumulo': ****** |
| |
| Shell - Apache Accumulo Interactive Shell |
| - |
| - version: 1.5.0 |
| - instance name: accumulo |
| - instance id: f5947fe6-081e-41a8-9877-43730c4dfc6f |
| - |
| - type 'help' for a list of available commands |
| - |
| root@ac> table foo |
| root@ac foo> scan -c colf1:colq2 |
| row1 colf1:colq2 [] val2 |
| |
| |
| If you are running on top of hdfs with kerberos enabled, then you need to do |
| some extra work. First, create an Accumulo principal |
| |
| kadmin.local -q "addprinc -randkey accumulo/<host.domain.name>" |
| |
| where <host.domain.name> is replaced by a fully qualified domain name. Export |
| the principals to a keytab file. It is safer to create a unique keytab file for each |
| server, but you can also glob them if you wish. |
| |
| kadmin.local -q "xst -k accumulo.keytab -glob accumulo*" |
| |
| Place this file in $ACCUMULO_CONF_DIR for every host. It should be owned by |
| the accumulo user and chmodded to 400. Add the following to the accumulo-env.sh |
| |
| kinit -kt $ACCUMULO_HOME/conf/accumulo.keytab accumulo/`hostname -f` |
| |
| In the accumulo-site.xml file on each node, add settings for general.kerberos.keytab |
| and general.kerberos.principal, where the keytab setting is the absolute path |
| to the keytab file ($ACCUMULO_HOME is valid to use) and principal is set to |
| accumulo/_HOST@<REALM>, where REALM is set to your kerberos realm. You may use |
| _HOST in lieu of your individual host names. |
| |
| <property> |
| <name>general.kerberos.keytab</name> |
| <value>$ACCUMULO_CONF_DIR/accumulo.keytab</value> |
| </property> |
| |
| <property> |
| <name>general.kerberos.principal</name> |
| <value>accumulo/_HOST@MYREALM</value> |
| </property> |
| |
| You can then start up Accumulo as you would with the accumulo user, and it will |
| automatically handle the kerberos keys needed to access hdfs. |
| |
| Please Note: You may have issues initializing Accumulo while running kerberos HDFS. |
| You can resolve this by temporarily granting the accumulo user write access to the |
| hdfs root directory, running init, and then revoking write permission in the root |
| directory (be sure to maintain access to the /accumulo directory). |
| |
| ****************************************************************************** |
| 6. Monitoring Apache Accumulo |
| |
| You can point your browser to the master host, on port 50095 to see the status |
| of accumulo across the cluster. You can even do this with the text-based |
| browser "links": |
| |
| $ links http://localhost:50095 |
| |
| From this GUI, you can ensure that tablets are assigned, tables are online, |
| tablet servers are up. You can monitor query and ingest rates across the |
| cluster. |
| |
| ****************************************************************************** |
| 7. Stopping Apache Accumulo |
| |
| Do not kill the tabletservers or run bin/tdown.sh unless absolutely necessary. |
| Recovery from a catastrophic loss of servers can take a long time. To shutdown |
| cleanly, run "bin/stop-all.sh" and the master will orchestrate the shutdown of |
| all the tablet servers. Shutdown waits for all writes to finish, so it may |
| take some time for particular configurations. |
| |
| ****************************************************************************** |
| 8. Logging |
| |
| DEBUG and above are logged to the logs/ dir. To modify this behavior change |
| the scripts in conf/. To change the logging dir, set ACCUMULO_LOG_DIR in |
| conf/accumulo-env.sh. Stdout and stderr of each accumulo process is |
| redirected to the log dir. |
| |
| ****************************************************************************** |
| 9. API |
| |
| The public Accumulo API is composed of : |
| |
| * All public classes and interfaces in the org.apache.accumulo.core.client |
| package, as as well as all of its subpackages excluding those named "impl". |
| * Key, Mutation, Value, Range, Condition, and ConditionalMutation in |
| org.apache.accumulo.core.data. |
| * All public classes and interfaces in the org.apache.accumulo.minicluster |
| package, as well as all of its subpackages excluding those named "impl". |
| * Anything with public or protected acccess within any Class or Interface that |
| is in the public API. This includes, but is not limited to: methods, members |
| classes, interfaces, and enums. |
| |
| The Accumulo project maintains binary compatibility across this API within a major |
| release, as defined in the Java Language Specification 3rd ed. API changes should |
| only be made on major releases, with continued support of deprecated API elements |
| for at least one major revision. |
| |
| To get started using accumulo review the example and the javadoc for the |
| packages and classes mentioned above. |
| |
| ****************************************************************************** |
| 10. Performance Tuning |
| |
| Apache Accumulo has exposed several configuration properties that can be |
| changed. These properties and configuration management are described in detail |
| in the user manual. While the default value is usually optimal, there are |
| cases where a change can increase query and ingest performance. |
| |
| Before changing a property from its default in a production system, you should |
| develop a good understanding of the property and consider creating a test to |
| prove the increased performance. |
| |
| ****************************************************************************** |
| |
| 11. Export Control |
| |
| This distribution includes cryptographic software. The country in which you |
| currently reside may have restrictions on the import, possession, use, and/or |
| re-export to another country, of encryption software. BEFORE using any |
| encryption software, please check your country's laws, regulations and |
| policies concerning the import, possession, or use, and re-export of encryption |
| software, to see if this is permitted. See <http://www.wassenaar.org/> for more |
| information. |
| |
| The U.S. Government Department of Commerce, Bureau of Industry and Security |
| (BIS), has classified this software as Export Commodity Control Number (ECCN) |
| 5D002.C.1, which includes information security software using or performing |
| cryptographic functions with asymmetric algorithms. The form and manner of this |
| Apache Software Foundation distribution makes it eligible for export under the |
| License Exception ENC Technology Software Unrestricted (TSU) exception (see the |
| BIS Export Administration Regulations, Section 740.13) for both object code and |
| source code. |
| |
| The following provides more details on the included cryptographic software: ... |
| |
| Apache Accumulo uses the built-in java cryptography libraries in it's RFile |
| encryption implementation. See |
| http://www.oracle.com/us/products/export/export-regulations-345813.html |
| for more details for on Java's cryptography features. Apache Accumulo also uses |
| the bouncycastle library for some crypographic technology as well. See |
| http://www.bouncycastle.org/wiki/display/JA1/Frequently+Asked+Questions for |
| more details on bouncycastle's cryptography features. |
| |
| ****************************************************************************** |