docs/src/main/asciidoc/chapters/design.txt - accumulo - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one or more
 // contributor license agreements.  See the NOTICE file distributed with
 // this work for additional information regarding copyright ownership.
 // The ASF licenses this file to You under the Apache License, Version 2.0
 // (the "License"); you may not use this file except in compliance with
 // the License.  You may obtain a copy of the License at
 //
 //     http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.

 == Accumulo Design

 === Data Model

 Accumulo provides a richer data model than simple key-value stores, but is not a
 fully relational database. Data is represented as key-value pairs, where the key and
 value are comprised of the following elements:

 [width="75%",cols="^,^,^,^,^,^"]
 |===========================================================================
     5+|Key                                                      .3+^.^|Value
 .2+^.^|Row ID 3+|Column                        .2+^.^|Timestamp
                 |Family |Qualifier |Visibility
 |===========================================================================

 All elements of the Key and the Value are represented as byte arrays except for
 Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically
 in ascending order. Timestamps are sorted in descending order so that later
 versions of the same Key appear first in a sequential scan. Tables consist of a set of
 sorted key-value pairs.

 === Architecture

 Accumulo is a distributed data storage and retrieval system and as such consists of
 several architectural components, some of which run on many individual servers.
 Much of the work Accumulo does involves maintaining certain properties of the
 data, such as organization, availability, and integrity, across many commodity-class
 machines.

 === Components

 An instance of Accumulo includes many TabletServers, one Garbage Collector process,
 one Master server and many Clients.

 ==== Tablet Server

 The TabletServer manages some subset of all the tablets (partitions of tables). This includes receiving writes from clients, persisting writes to a
 write-ahead log, sorting new key-value pairs in memory, periodically
 flushing sorted key-value pairs to new files in HDFS, and responding
 to reads from clients, forming a merge-sorted view of all keys and
 values from all the files it has created and the sorted in-memory
 store.

 TabletServers also perform recovery of a tablet
 that was previously on a server that failed, reapplying any writes
 found in the write-ahead log to the tablet.

 ==== Garbage Collector

 Accumulo processes will share files stored in HDFS. Periodically, the Garbage
 Collector will identify files that are no longer needed by any process, and
 delete them. Multiple garbage collectors can be run to provide hot-standby support.
 They will perform leader election among themselves to choose a single active instance.

 ==== Master

 The Accumulo Master is responsible for detecting and responding to TabletServer
 failure. It tries to balance the load across TabletServer by assigning tablets carefully
 and instructing TabletServers to unload tablets when necessary. The Master ensures all
 tablets are assigned to one TabletServer each, and handles table creation, alteration,
 and deletion requests from clients. The Master also coordinates startup, graceful
 shutdown and recovery of changes in write-ahead logs when Tablet servers fail.

 Multiple masters may be run. The masters will choose among themselves a single master,
 and the others will become backups if the master should fail.

 ==== Tracer

 The Accumulo Tracer process supports the distributed timing API provided by Accumulo.
 One to many of these processes can be run on a cluster which will write the timing
 information to a given Accumulo table for future reference. Seeing the section on
 Tracing for more information on this support.

 ==== Monitor

 The Accumulo Monitor is a web application that provides a wealth of information about
 the state of an instance. The Monitor shows graphs and tables which contain information
 about read/write rates, cache hit/miss rates, and Accumulo table information such as scan
 rate and active/queued compactions. Additionally, the Monitor should always be the first
 point of entry when attempting to debug an Accumulo problem as it will show high-level problems
 in addition to aggregated errors from all nodes in the cluster. See the section on <<monitoring>>
 for more information.

 Multiple Monitors can be run to provide hot-standby support in the face of failure. Due to the
 forwarding of logs from remote hosts to the Monitor, only one Monitor process should be active
 at one time. Leader election will be performed internally to choose the active Monitor.

 ==== Client

 Accumulo includes a client library that is linked to every application. The client
 library contains logic for finding servers managing a particular tablet, and
 communicating with TabletServers to write and retrieve key-value pairs.

 === Data Management

 Accumulo stores data in tables, which are partitioned into tablets. Tablets are
 partitioned on row boundaries so that all of the columns and values for a particular
 row are found together within the same tablet. The Master assigns Tablets to one
 TabletServer at a time. This enables row-level transactions to take place without
 using distributed locking or some other complicated synchronization mechanism. As
 clients insert and query data, and as machines are added and removed from the
 cluster, the Master migrates tablets to ensure they remain available and that the
 ingest and query load is balanced across the cluster.

 image::data_distribution.png[width=500]

 === Tablet Service


 When a write arrives at a TabletServer it is written to a Write-Ahead Log and
 then inserted into a sorted data structure in memory called a MemTable. When the
 MemTable reaches a certain size, the TabletServer writes out the sorted
 key-value pairs to a file in HDFS called a Relative Key File (RFile), which is a
 kind of Indexed Sequential Access Method (ISAM) file. This process is called a
 minor compaction. A new MemTable is then created and the fact of the compaction
 is recorded in the Write-Ahead Log.

 When a request to read data arrives at a TabletServer, the TabletServer does a
 binary search across the MemTable as well as the in-memory indexes associated
 with each RFile to find the relevant values. If clients are performing a scan,
 several key-value pairs are returned to the client in order from the MemTable
 and the set of RFiles by performing a merge-sort as they are read.

 === Compactions

 In order to manage the number of files per tablet, periodically the TabletServer
 performs Major Compactions of files within a tablet, in which some set of RFiles
 are combined into one file. The previous files will eventually be removed by the
 Garbage Collector. This also provides an opportunity to permanently remove
 deleted key-value pairs by omitting key-value pairs suppressed by a delete entry
 when the new file is created.

 === Splitting

 When a table is created it has one tablet. As the table grows its initial
 tablet eventually splits into two tablets. Its likely that one of these
 tablets will migrate to another tablet server. As the table continues to grow,
 its tablets will continue to split and be migrated. The decision to
 automatically split a tablet is based on the size of a tablets files. The
 size threshold at which a tablet splits is configurable per table. In addition
 to automatic splitting, a user can manually add split points to a table to
 create new tablets. Manually splitting a new table can parallelize reads and
 writes giving better initial performance without waiting for automatic
 splitting.

 As data is deleted from a table, tablets may shrink. Over time this can lead
 to small or empty tablets. To deal with this, merging of tablets was
 introduced in Accumulo 1.4. This is discussed in more detail later.

 === Fault-Tolerance

 If a TabletServer fails, the Master detects it and automatically reassigns the tablets
 assigned from the failed server to other servers. Any key-value pairs that were in
 memory at the time the TabletServer fails are automatically reapplied from the Write-Ahead
 Log(WAL) to prevent any loss of data.

 Tablet servers write their WALs directly to HDFS so the logs are available to all tablet
 servers for recovery. To make the recovery process efficient, the updates within a log are
 grouped by tablet.  TabletServers can quickly apply the mutations from the sorted logs
 that are destined for the tablets they have now been assigned.

 TabletServer failures are noted on the Master's monitor page, accessible via
 +http://master-address:9995/monitor+.

 image::failure_handling.png[width=500]
	// Licensed to the Apache Software Foundation (ASF) under one or more
	// contributor license agreements. See the NOTICE file distributed with
	// this work for additional information regarding copyright ownership.
	// The ASF licenses this file to You under the Apache License, Version 2.0
	// (the "License"); you may not use this file except in compliance with
	// the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing, software
	// distributed under the License is distributed on an "AS IS" BASIS,
	// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	// See the License for the specific language governing permissions and
	// limitations under the License.

	== Accumulo Design

	=== Data Model

	Accumulo provides a richer data model than simple key-value stores, but is not a
	fully relational database. Data is represented as key-value pairs, where the key and
	value are comprised of the following elements:

	[width="75%",cols="^,^,^,^,^,^"]
	\|===========================================================================
	5+\|Key .3+^.^\|Value
	.2+^.^\|Row ID 3+\|Column .2+^.^\|Timestamp
	\|Family \|Qualifier \|Visibility
	\|===========================================================================

	All elements of the Key and the Value are represented as byte arrays except for
	Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically
	in ascending order. Timestamps are sorted in descending order so that later
	versions of the same Key appear first in a sequential scan. Tables consist of a set of
	sorted key-value pairs.

	=== Architecture

	Accumulo is a distributed data storage and retrieval system and as such consists of
	several architectural components, some of which run on many individual servers.
	Much of the work Accumulo does involves maintaining certain properties of the
	data, such as organization, availability, and integrity, across many commodity-class
	machines.

	=== Components

	An instance of Accumulo includes many TabletServers, one Garbage Collector process,
	one Master server and many Clients.

	==== Tablet Server

	The TabletServer manages some subset of all the tablets (partitions of tables). This includes receiving writes from clients, persisting writes to a
	write-ahead log, sorting new key-value pairs in memory, periodically
	flushing sorted key-value pairs to new files in HDFS, and responding
	to reads from clients, forming a merge-sorted view of all keys and
	values from all the files it has created and the sorted in-memory
	store.

	TabletServers also perform recovery of a tablet
	that was previously on a server that failed, reapplying any writes
	found in the write-ahead log to the tablet.

	==== Garbage Collector

	Accumulo processes will share files stored in HDFS. Periodically, the Garbage
	Collector will identify files that are no longer needed by any process, and
	delete them. Multiple garbage collectors can be run to provide hot-standby support.
	They will perform leader election among themselves to choose a single active instance.

	==== Master

	The Accumulo Master is responsible for detecting and responding to TabletServer
	failure. It tries to balance the load across TabletServer by assigning tablets carefully
	and instructing TabletServers to unload tablets when necessary. The Master ensures all
	tablets are assigned to one TabletServer each, and handles table creation, alteration,
	and deletion requests from clients. The Master also coordinates startup, graceful
	shutdown and recovery of changes in write-ahead logs when Tablet servers fail.

	Multiple masters may be run. The masters will choose among themselves a single master,
	and the others will become backups if the master should fail.

	==== Tracer

	The Accumulo Tracer process supports the distributed timing API provided by Accumulo.
	One to many of these processes can be run on a cluster which will write the timing
	information to a given Accumulo table for future reference. Seeing the section on
	Tracing for more information on this support.

	==== Monitor

	The Accumulo Monitor is a web application that provides a wealth of information about
	the state of an instance. The Monitor shows graphs and tables which contain information
	about read/write rates, cache hit/miss rates, and Accumulo table information such as scan
	rate and active/queued compactions. Additionally, the Monitor should always be the first
	point of entry when attempting to debug an Accumulo problem as it will show high-level problems
	in addition to aggregated errors from all nodes in the cluster. See the section on <<monitoring>>
	for more information.

	Multiple Monitors can be run to provide hot-standby support in the face of failure. Due to the
	forwarding of logs from remote hosts to the Monitor, only one Monitor process should be active
	at one time. Leader election will be performed internally to choose the active Monitor.

	==== Client

	Accumulo includes a client library that is linked to every application. The client
	library contains logic for finding servers managing a particular tablet, and
	communicating with TabletServers to write and retrieve key-value pairs.

	=== Data Management

	Accumulo stores data in tables, which are partitioned into tablets. Tablets are
	partitioned on row boundaries so that all of the columns and values for a particular
	row are found together within the same tablet. The Master assigns Tablets to one
	TabletServer at a time. This enables row-level transactions to take place without
	using distributed locking or some other complicated synchronization mechanism. As
	clients insert and query data, and as machines are added and removed from the
	cluster, the Master migrates tablets to ensure they remain available and that the
	ingest and query load is balanced across the cluster.

	image::data_distribution.png[width=500]

	=== Tablet Service


	When a write arrives at a TabletServer it is written to a Write-Ahead Log and
	then inserted into a sorted data structure in memory called a MemTable. When the
	MemTable reaches a certain size, the TabletServer writes out the sorted
	key-value pairs to a file in HDFS called a Relative Key File (RFile), which is a
	kind of Indexed Sequential Access Method (ISAM) file. This process is called a
	minor compaction. A new MemTable is then created and the fact of the compaction
	is recorded in the Write-Ahead Log.

	When a request to read data arrives at a TabletServer, the TabletServer does a
	binary search across the MemTable as well as the in-memory indexes associated
	with each RFile to find the relevant values. If clients are performing a scan,
	several key-value pairs are returned to the client in order from the MemTable
	and the set of RFiles by performing a merge-sort as they are read.

	=== Compactions

	In order to manage the number of files per tablet, periodically the TabletServer
	performs Major Compactions of files within a tablet, in which some set of RFiles
	are combined into one file. The previous files will eventually be removed by the
	Garbage Collector. This also provides an opportunity to permanently remove
	deleted key-value pairs by omitting key-value pairs suppressed by a delete entry
	when the new file is created.

	=== Splitting

	When a table is created it has one tablet. As the table grows its initial
	tablet eventually splits into two tablets. Its likely that one of these
	tablets will migrate to another tablet server. As the table continues to grow,
	its tablets will continue to split and be migrated. The decision to
	automatically split a tablet is based on the size of a tablets files. The
	size threshold at which a tablet splits is configurable per table. In addition
	to automatic splitting, a user can manually add split points to a table to
	create new tablets. Manually splitting a new table can parallelize reads and
	writes giving better initial performance without waiting for automatic
	splitting.

	As data is deleted from a table, tablets may shrink. Over time this can lead
	to small or empty tablets. To deal with this, merging of tablets was
	introduced in Accumulo 1.4. This is discussed in more detail later.

	=== Fault-Tolerance

	If a TabletServer fails, the Master detects it and automatically reassigns the tablets
	assigned from the failed server to other servers. Any key-value pairs that were in
	memory at the time the TabletServer fails are automatically reapplied from the Write-Ahead
	Log(WAL) to prevent any loss of data.

	Tablet servers write their WALs directly to HDFS so the logs are available to all tablet
	servers for recovery. To make the recovery process efficient, the updates within a log are
	grouped by tablet. TabletServers can quickly apply the mutations from the sorted logs
	that are destined for the tablets they have now been assigned.

	TabletServer failures are noted on the Master's monitor page, accessible via
	+http://master-address:9995/monitor+.

	image::failure_handling.png[width=500]