blob: 6c77cb6408ab0c4940fde7e7bc0676e3c1648700 [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
== Accumulo Design
=== Data Model
Accumulo provides a richer data model than simple key-value stores, but is not a
fully relational database. Data is represented as key-value pairs, where the key and
value are comprised of the following elements:
[width="75%",cols="^,^,^,^,^,^"]
|===========================================================================
5+|Key .3+^.^|Value
.2+^.^|Row ID 3+|Column .2+^.^|Timestamp
|Family |Qualifier |Visibility
|===========================================================================
All elements of the Key and the Value are represented as byte arrays except for
Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically
in ascending order. Timestamps are sorted in descending order so that later
versions of the same Key appear first in a sequential scan. Tables consist of a set of
sorted key-value pairs.
=== Architecture
Accumulo is a distributed data storage and retrieval system and as such consists of
several architectural components, some of which run on many individual servers.
Much of the work Accumulo does involves maintaining certain properties of the
data, such as organization, availability, and integrity, across many commodity-class
machines.
=== Components
An instance of Accumulo includes many TabletServers, one Garbage Collector process,
one Master server and many Clients.
==== Tablet Server
The TabletServer manages some subset of all the tablets (partitions of tables). This includes receiving writes from clients, persisting writes to a
write-ahead log, sorting new key-value pairs in memory, periodically
flushing sorted key-value pairs to new files in HDFS, and responding
to reads from clients, forming a merge-sorted view of all keys and
values from all the files it has created and the sorted in-memory
store.
TabletServers also perform recovery of a tablet
that was previously on a server that failed, reapplying any writes
found in the write-ahead log to the tablet.
==== Garbage Collector
Accumulo processes will share files stored in HDFS. Periodically, the Garbage
Collector will identify files that are no longer needed by any process, and
delete them. Multiple garbage collectors can be run to provide hot-standby support.
They will perform leader election among themselves to choose a single active instance.
==== Master
The Accumulo Master is responsible for detecting and responding to TabletServer
failure. It tries to balance the load across TabletServer by assigning tablets carefully
and instructing TabletServers to unload tablets when necessary. The Master ensures all
tablets are assigned to one TabletServer each, and handles table creation, alteration,
and deletion requests from clients. The Master also coordinates startup, graceful
shutdown and recovery of changes in write-ahead logs when Tablet servers fail.
Multiple masters may be run. The masters will choose among themselves a single master,
and the others will become backups if the master should fail.
==== Tracer
The Accumulo Tracer process supports the distributed timing API provided by Accumulo.
One to many of these processes can be run on a cluster which will write the timing
information to a given Accumulo table for future reference. Seeing the section on
Tracing for more information on this support.
==== Monitor
The Accumulo Monitor is a web application that provides a wealth of information about
the state of an instance. The Monitor shows graphs and tables which contain information
about read/write rates, cache hit/miss rates, and Accumulo table information such as scan
rate and active/queued compactions. Additionally, the Monitor should always be the first
point of entry when attempting to debug an Accumulo problem as it will show high-level problems
in addition to aggregated errors from all nodes in the cluster. See the section on <<monitoring>>
for more information.
Multiple Monitors can be run to provide hot-standby support in the face of failure. Due to the
forwarding of logs from remote hosts to the Monitor, only one Monitor process should be active
at one time. Leader election will be performed internally to choose the active Monitor.
==== Client
Accumulo includes a client library that is linked to every application. The client
library contains logic for finding servers managing a particular tablet, and
communicating with TabletServers to write and retrieve key-value pairs.
=== Data Management
Accumulo stores data in tables, which are partitioned into tablets. Tablets are
partitioned on row boundaries so that all of the columns and values for a particular
row are found together within the same tablet. The Master assigns Tablets to one
TabletServer at a time. This enables row-level transactions to take place without
using distributed locking or some other complicated synchronization mechanism. As
clients insert and query data, and as machines are added and removed from the
cluster, the Master migrates tablets to ensure they remain available and that the
ingest and query load is balanced across the cluster.
image::data_distribution.png[width=500]
=== Tablet Service
When a write arrives at a TabletServer it is written to a Write-Ahead Log and
then inserted into a sorted data structure in memory called a MemTable. When the
MemTable reaches a certain size, the TabletServer writes out the sorted
key-value pairs to a file in HDFS called a Relative Key File (RFile), which is a
kind of Indexed Sequential Access Method (ISAM) file. This process is called a
minor compaction. A new MemTable is then created and the fact of the compaction
is recorded in the Write-Ahead Log.
When a request to read data arrives at a TabletServer, the TabletServer does a
binary search across the MemTable as well as the in-memory indexes associated
with each RFile to find the relevant values. If clients are performing a scan,
several key-value pairs are returned to the client in order from the MemTable
and the set of RFiles by performing a merge-sort as they are read.
=== Compactions
In order to manage the number of files per tablet, periodically the TabletServer
performs Major Compactions of files within a tablet, in which some set of RFiles
are combined into one file. The previous files will eventually be removed by the
Garbage Collector. This also provides an opportunity to permanently remove
deleted key-value pairs by omitting key-value pairs suppressed by a delete entry
when the new file is created.
=== Splitting
When a table is created it has one tablet. As the table grows its initial
tablet eventually splits into two tablets. Its likely that one of these
tablets will migrate to another tablet server. As the table continues to grow,
its tablets will continue to split and be migrated. The decision to
automatically split a tablet is based on the size of a tablets files. The
size threshold at which a tablet splits is configurable per table. In addition
to automatic splitting, a user can manually add split points to a table to
create new tablets. Manually splitting a new table can parallelize reads and
writes giving better initial performance without waiting for automatic
splitting.
As data is deleted from a table, tablets may shrink. Over time this can lead
to small or empty tablets. To deal with this, merging of tablets was
introduced in Accumulo 1.4. This is discussed in more detail later.
=== Fault-Tolerance
If a TabletServer fails, the Master detects it and automatically reassigns the tablets
assigned from the failed server to other servers. Any key-value pairs that were in
memory at the time the TabletServer fails are automatically reapplied from the Write-Ahead
Log(WAL) to prevent any loss of data.
Tablet servers write their WALs directly to HDFS so the logs are available to all tablet
servers for recovery. To make the recovery process efficient, the updates within a log are
grouped by tablet. TabletServers can quickly apply the mutations from the sorted logs
that are destined for the tablets they have now been assigned.
TabletServer failures are noted on the Master's monitor page, accessible via
+http://master-address:9995/monitor+.
image::failure_handling.png[width=500]