README - accumulo - Git at Google

 ******************************************************************************
 0. Introduction

 Apache Accumulo is a sorted, distributed key/value store based on Google's
 BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It
 features a few novel improvements on the BigTable design in the form of
 cell-level access labels and a server-side programming mechanism that can modify
 key/value pairs at various points in the data management process.

 ******************************************************************************
 1. Building

 In the normal tarball or RPM release of Apache Accumulo, everything is built and
 ready to go: there is no build step.

 However, if you only have source code, or you wish to make changes, you need to
 have maven configured to get Accumulo pre-requisites from repositories.  See
 the pom.xml file for the necessary components.

 The libthrift 0.3 jar is no longer available from a repository.  This jar will
 be automatically built from the thrift tag and installed into your local maven
 repository during the Accumulo build process via the
 src/assemble/install-thrift-jar.sh script.

 Run the following commands to build Accumulo.

 tar xvzf accumulo-1.3.6-src.tar.gz
 cd accumulo-1.3.6
 mvn package && mvn assembly:single

 ******************************************************************************
 2. Deployment

 Copy the accumulo tar file produced by "mvn package && mvn assembly:single" from
 the target/ directory to the desired destination, then untar it (e.g.
 tar xvzf accumulo-1.3.6-dist.tar.gz).

 If you are using the RPM, install the RPM on every machine that will run
 accumulo.

 ******************************************************************************
 3. Configuring

 Apache Accumulo has two prerequisites, Hadoop and Zookeeper. Zookeeper must be
 at least version 3.3.0. Both of these must be installed and configured.

 Ensure you (or the some special hadoop user account) have accounts on all of
 the machines in the cluster and that hadoop and accumulo install files can be
 found in the same location on every machine in the cluster.  You will need to
 have password-less ssh set up as described in the hadoop documentation.

 You will need to have hadoop installed and configured on your system.
 Apache Accumulo 1.3.6 has been tested with hadoop version
 0.20.1 and 0.20.2.

 Create a "slaves" file in $ACCUMULO_HOME/conf/.  This is a list of machines
 where tablet servers and loggers will run.

 Create a "masters" file in $ACCUMULO_HOME/conf/.  This is a list of
 machines where the master server will run.

 Create conf/accumulo-env.sh following the template of
 conf/accumulo-env.sh.example.  Set JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME.
 These directories must be at the same location on every node in the cluster.
 Note that zookeeper must be installed on every machine, but it should not be
 run on every machine.

 * Note that you will be specifying the Java heap space in accumulo-env.sh.
 You should make sure that the total heap space used for the accumulo tserver,
 logger and the hadoop datanode and tasktracker is less than the available
 memory on each slave node in the cluster.  On large clusters, it is recommended
 that the accumulo master, hadoop namenode, secondary namenode, and hadoop
 jobtracker all be run on separate machines to allow them to use more heap
 space.  If you are running these on the same machine on a small cluster, make
 sure their heap space settings fit within the available memory.  The zookeeper
 instances are also time sensitive and should be on machines that will not be
 heavily loaded, or over-subscribed for memory.

 Create conf/accumulo-site.xml.  You must set the zookeeper servers in this
 file (instance.zookeeper.host).  Look at docs/config.html to see what
 additional variables you can modify and what the defaults are.

 Create the write-ahead log directory on all slaves.  The directory is set in
 the accumulo-site.xml as the "logger.dir.walog" parameter.  It is a local
 directory that will be used to log updates which will be used in the event of
 tablet server failure, so it is important that it have sufficient space and
 reliability.

 Synchronize your accumulo conf directory across the cluster.  As a precaution
 against mis-configured systems, servers using different configuration files
 will not communicate with the rest of the cluster.

 ******************************************************************************
 4. Running Apache Accumulo

 Make sure hadoop is configured on all of the machines in the cluster, including
 access to a shared hdfs instance.  Make sure hdfs is running.

 Make sure zookeeper is configured and running on at least one machine in the
 cluster.

 Run "bin/accumulo init" to create the hdfs directory structure
 (hdfs:///accumulo/*) and initial zookeeper settings. This will also allow you
 to also configure the initial root password. Only do this once.

 Start accumulo using the bin/start-all.sh script.

 Use the "bin/accumulo shell -u <username>" command to run an accumulo shell
 interpreter.  Within this interpreter, run "createtable <tablename>" to create
 a table, and run "table <tablename>" followed by "scan" to scan a table.

 In the example below a table is created, data is inserted, and the table is
 scanned.

     $ ./bin/accumulo shell -u root
     Enter current password for 'root'@'acu13': ******

     Shell - Apache Accumulo Interactive Shell
     -
     - version: 1.3.6
     - instance name: acu13
     - instance id: f5947fe6-081e-41a8-9877-43730c4dfc6f
     -
     - type 'help' for a list of available commands
     -
     root@ac> createtable foo
     root@ac foo> insert row1 colf1 colq1 val1
     root@ac foo> insert row1 colf1 colq2 val2
     root@ac foo> scan
     row1 colf1:colq1 []    val1
     row1 colf1:colq2 []    val2

 The example below start the shell, switches to table foo, and scans for a
 certain column.

     $ ./bin/accumulo shell -u root
     Enter current password for 'root'@'acu13': ******

     Shell - Apache Accumulo Interactive Shell
     -
     - version: 1.3.6
     - instance name: acu13
     - instance id: f5947fe6-081e-41a8-9877-43730c4dfc6f
     -
     - type 'help' for a list of available commands
     -
     root@ac> table foo
     root@ac foo> scan -c colf1:colq2
     row1 colf1:colq2 []    val2


 ******************************************************************************
 5. Monitoring Apache Accumulo

 You can point your browser to the master host, on port 50095 to see the status
 of accumulo across the cluster.  You can even do this with the text-based
 browser "elinks":

  $ links http://localhost:50095

 From this GUI, you can ensure that tablets are assigned, tables are online,
 tablet servers are up. You can monitor query and ingest rates across the
 cluster.

 ******************************************************************************
 6. Stopping Apache Accumulo

 Do not kill the tabletservers or run bin/tdown.sh unless absolutely necessary.
 Recovery from a catastrophic loss of servers can take a long time. To shutdown
 cleanly, run "bin/stop-all.sh" and the master will orchestrate the shutdown of
 all the tablet servers.  Shutdown waits for all writes to finish, so it may
 take some time for particular configurations.

 ******************************************************************************
 7. Logging

 DEBUG and above are logged to the logs/ dir.  To modify this behavior change
 the scripts in conf/.  To change the logging dir, set ACCUMULO_LOG_DIR in
 conf/accumulo-env.sh.  Stdout and stderr of each accumulo process is
 redirected to the log dir.

 ******************************************************************************
 8. API

 The public accumulo API is composed of everything in the
 org.apache.accumulo.core.client package (excluding the
 org.apache.accumulo.core.client.impl package) and the following classes from
 org.apache.accumulo.core.data : Key, Mutation, Value, and Range.  To get started
 using accumulo review the example and the javadoc for the packages and classes
 mentioned above.

 ******************************************************************************
 9. Performance Tuning

 Accumulo has exposed several configuration properties that can be changed.
 These properties and configuration management are described in detail in
 docs/config.html.  While the default value is usually optimal, there are cases
 where a change can increase query and ingest performance.

 Before changing a property from its default in a production system, you should
 develop a good understanding of the property and consider creating a test to
 prove the increased performance.

 ******************************************************************************
	******************************************************************************
	0. Introduction

	Apache Accumulo is a sorted, distributed key/value store based on Google's
	BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It
	features a few novel improvements on the BigTable design in the form of
	cell-level access labels and a server-side programming mechanism that can modify
	key/value pairs at various points in the data management process.

	******************************************************************************
	1. Building

	In the normal tarball or RPM release of Apache Accumulo, everything is built and
	ready to go: there is no build step.

	However, if you only have source code, or you wish to make changes, you need to
	have maven configured to get Accumulo pre-requisites from repositories. See
	the pom.xml file for the necessary components.

	The libthrift 0.3 jar is no longer available from a repository. This jar will
	be automatically built from the thrift tag and installed into your local maven
	repository during the Accumulo build process via the
	src/assemble/install-thrift-jar.sh script.

	Run the following commands to build Accumulo.

	tar xvzf accumulo-1.3.6-src.tar.gz
	cd accumulo-1.3.6
	mvn package && mvn assembly:single

	******************************************************************************
	2. Deployment

	Copy the accumulo tar file produced by "mvn package && mvn assembly:single" from
	the target/ directory to the desired destination, then untar it (e.g.
	tar xvzf accumulo-1.3.6-dist.tar.gz).

	If you are using the RPM, install the RPM on every machine that will run
	accumulo.

	******************************************************************************
	3. Configuring

	Apache Accumulo has two prerequisites, Hadoop and Zookeeper. Zookeeper must be
	at least version 3.3.0. Both of these must be installed and configured.

	Ensure you (or the some special hadoop user account) have accounts on all of
	the machines in the cluster and that hadoop and accumulo install files can be
	found in the same location on every machine in the cluster. You will need to
	have password-less ssh set up as described in the hadoop documentation.

	You will need to have hadoop installed and configured on your system.
	Apache Accumulo 1.3.6 has been tested with hadoop version
	0.20.1 and 0.20.2.

	Create a "slaves" file in $ACCUMULO_HOME/conf/. This is a list of machines
	where tablet servers and loggers will run.

	Create a "masters" file in $ACCUMULO_HOME/conf/. This is a list of
	machines where the master server will run.

	Create conf/accumulo-env.sh following the template of
	conf/accumulo-env.sh.example. Set JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME.
	These directories must be at the same location on every node in the cluster.
	Note that zookeeper must be installed on every machine, but it should not be
	run on every machine.

	* Note that you will be specifying the Java heap space in accumulo-env.sh.
	You should make sure that the total heap space used for the accumulo tserver,
	logger and the hadoop datanode and tasktracker is less than the available
	memory on each slave node in the cluster. On large clusters, it is recommended
	that the accumulo master, hadoop namenode, secondary namenode, and hadoop
	jobtracker all be run on separate machines to allow them to use more heap
	space. If you are running these on the same machine on a small cluster, make
	sure their heap space settings fit within the available memory. The zookeeper
	instances are also time sensitive and should be on machines that will not be
	heavily loaded, or over-subscribed for memory.

	Create conf/accumulo-site.xml. You must set the zookeeper servers in this
	file (instance.zookeeper.host). Look at docs/config.html to see what
	additional variables you can modify and what the defaults are.

	Create the write-ahead log directory on all slaves. The directory is set in
	the accumulo-site.xml as the "logger.dir.walog" parameter. It is a local
	directory that will be used to log updates which will be used in the event of
	tablet server failure, so it is important that it have sufficient space and
	reliability.

	Synchronize your accumulo conf directory across the cluster. As a precaution
	against mis-configured systems, servers using different configuration files
	will not communicate with the rest of the cluster.

	******************************************************************************
	4. Running Apache Accumulo

	Make sure hadoop is configured on all of the machines in the cluster, including
	access to a shared hdfs instance. Make sure hdfs is running.

	Make sure zookeeper is configured and running on at least one machine in the
	cluster.

	Run "bin/accumulo init" to create the hdfs directory structure
	(hdfs:///accumulo/*) and initial zookeeper settings. This will also allow you
	to also configure the initial root password. Only do this once.

	Start accumulo using the bin/start-all.sh script.

	Use the "bin/accumulo shell -u <username>" command to run an accumulo shell
	interpreter. Within this interpreter, run "createtable <tablename>" to create
	a table, and run "table <tablename>" followed by "scan" to scan a table.

	In the example below a table is created, data is inserted, and the table is
	scanned.

	$ ./bin/accumulo shell -u root
	Enter current password for 'root'@'acu13': ******

	Shell - Apache Accumulo Interactive Shell
	-
	- version: 1.3.6
	- instance name: acu13
	- instance id: f5947fe6-081e-41a8-9877-43730c4dfc6f
	-
	- type 'help' for a list of available commands
	-
	root@ac> createtable foo
	root@ac foo> insert row1 colf1 colq1 val1
	root@ac foo> insert row1 colf1 colq2 val2
	root@ac foo> scan
	row1 colf1:colq1 [] val1
	row1 colf1:colq2 [] val2

	The example below start the shell, switches to table foo, and scans for a
	certain column.

	$ ./bin/accumulo shell -u root
	Enter current password for 'root'@'acu13': ******

	Shell - Apache Accumulo Interactive Shell
	-
	- version: 1.3.6
	- instance name: acu13
	- instance id: f5947fe6-081e-41a8-9877-43730c4dfc6f
	-
	- type 'help' for a list of available commands
	-
	root@ac> table foo
	root@ac foo> scan -c colf1:colq2
	row1 colf1:colq2 [] val2




	******************************************************************************
	5. Monitoring Apache Accumulo

	You can point your browser to the master host, on port 50095 to see the status
	of accumulo across the cluster. You can even do this with the text-based
	browser "elinks":

	$ links http://localhost:50095

	From this GUI, you can ensure that tablets are assigned, tables are online,
	tablet servers are up. You can monitor query and ingest rates across the
	cluster.

	******************************************************************************
	6. Stopping Apache Accumulo

	Do not kill the tabletservers or run bin/tdown.sh unless absolutely necessary.
	Recovery from a catastrophic loss of servers can take a long time. To shutdown
	cleanly, run "bin/stop-all.sh" and the master will orchestrate the shutdown of
	all the tablet servers. Shutdown waits for all writes to finish, so it may
	take some time for particular configurations.

	******************************************************************************
	7. Logging

	DEBUG and above are logged to the logs/ dir. To modify this behavior change
	the scripts in conf/. To change the logging dir, set ACCUMULO_LOG_DIR in
	conf/accumulo-env.sh. Stdout and stderr of each accumulo process is
	redirected to the log dir.

	******************************************************************************
	8. API

	The public accumulo API is composed of everything in the
	org.apache.accumulo.core.client package (excluding the
	org.apache.accumulo.core.client.impl package) and the following classes from
	org.apache.accumulo.core.data : Key, Mutation, Value, and Range. To get started
	using accumulo review the example and the javadoc for the packages and classes
	mentioned above.

	******************************************************************************
	9. Performance Tuning

	Accumulo has exposed several configuration properties that can be changed.
	These properties and configuration management are described in detail in
	docs/config.html. While the default value is usually optimal, there are cases
	where a change can increase query and ingest performance.

	Before changing a property from its default in a production system, you should
	develop a good understanding of the property and consider creating a test to
	prove the increased performance.

	******************************************************************************