blob: d142d665e1267708df2a64114c00e7fdb6a0adfc [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
== Writing Accumulo Clients
=== Running Client Code
There are multiple ways to run Java code that uses Accumulo. Below is a list
of the different ways to execute client code.
* using java executable
* using the accumulo script
* using the tool script
In order to run client code written to run against Accumulo, you will need to
include the jars that Accumulo depends on in your classpath. Accumulo client
code depends on Hadoop and Zookeeper. For Hadoop add the hadoop client jar, all
of the jars in the Hadoop lib directory, and the conf directory to the
classpath. For recent Zookeeper versions, you only need to add the Zookeeper jar, and not
what is in the Zookeeper lib directory. You can run the following command on a
configured Accumulo system to see what its using for its classpath.
$ACCUMULO_HOME/bin/accumulo classpath
Another option for running your code is to put a jar file in
+$ACCUMULO_HOME/lib/ext+. After doing this you can use the accumulo
script to execute your code. For example if you create a jar containing the
class +com.foo.Client+ and placed that in +lib/ext+, then you could use the command
+$ACCUMULO_HOME/bin/accumulo com.foo.Client+ to execute your code.
If you are writing map reduce job that access Accumulo, then you can use the
bin/tool.sh script to run those jobs. See the map reduce example.
=== Connecting
All clients must first identify the Accumulo instance to which they will be
communicating. Code to do this is as follows:
[source,java]
----
String instanceName = "myinstance";
String zooServers = "zooserver-one,zooserver-two"
Instance inst = new ZooKeeperInstance(instanceName, zooServers);
Connector conn = inst.getConnector("user", new PasswordToken("passwd"));
----
The PasswordToken is the most common implementation of an `AuthenticationToken`.
This general interface allow authentication as an Accumulo user to come from
a variety of sources or means. The CredentialProviderToken leverages the Hadoop
CredentialProviders (new in Hadoop 2.6).
For example, the CredentialProviderToken can be used in conjunction with a Java
KeyStore to alleviate passwords stored in cleartext. When stored in HDFS, a single
KeyStore can be used across an entire instance. Be aware that KeyStores stored on
the local filesystem must be made available to all nodes in the Accumulo cluster.
[source,java]
----
KerberosToken token = new KerberosToken();
Connector conn = inst.getConnector(token.getPrincipal(), token);
----
The KerberosToken can be provided to use the authentication provided by Kerberos.
Using Kerberos requires external setup and additional configuration, but provides
a single point of authentication through HDFS, YARN and ZooKeeper and allowing
for password-less authentication with Accumulo.
=== Writing Data
Data are written to Accumulo by creating Mutation objects that represent all the
changes to the columns of a single row. The changes are made atomically in the
TabletServer. Clients then add Mutations to a BatchWriter which submits them to
the appropriate TabletServers.
Mutations can be created thus:
[source,java]
----
Text rowID = new Text("row1");
Text colFam = new Text("myColFam");
Text colQual = new Text("myColQual");
ColumnVisibility colVis = new ColumnVisibility("public");
long timestamp = System.currentTimeMillis();
Value value = new Value("myValue".getBytes());
Mutation mutation = new Mutation(rowID);
mutation.put(colFam, colQual, colVis, timestamp, value);
----
==== BatchWriter
The BatchWriter is highly optimized to send Mutations to multiple TabletServers
and automatically batches Mutations destined for the same TabletServer to
amortize network overhead. Care must be taken to avoid changing the contents of
any Object passed to the BatchWriter since it keeps objects in memory while
batching.
Mutations are added to a BatchWriter thus:
[source,java]
----
// BatchWriterConfig has reasonable defaults
BatchWriterConfig config = new BatchWriterConfig();
config.setMaxMemory(10000000L); // bytes available to batchwriter for buffering mutations
BatchWriter writer = conn.createBatchWriter("table", config)
writer.addMutation(mutation);
writer.close();
----
An example of using the batch writer can be found at
+accumulo/docs/examples/README.batch+.
==== ConditionalWriter
The ConditionalWriter enables efficient, atomic read-modify-write operations on
rows. The ConditionalWriter writes special Mutations which have a list of per
column conditions that must all be met before the mutation is applied. The
conditions are checked in the tablet server while a row lock is
held (Mutations written by the BatchWriter will not obtain a row
lock). The conditions that can be checked for a column are equality and
absence. For example a conditional mutation can require that column A is
absent inorder to be applied. Iterators can be applied when checking
conditions. Using iterators, many other operations besides equality and
absence can be checked. For example, using an iterator that converts values
less than 5 to 0 and everything else to 1, its possible to only apply a
mutation when a column is less than 5.
In the case when a tablet server dies after a client sent a conditional
mutation, its not known if the mutation was applied or not. When this happens
the ConditionalWriter reports a status of UNKNOWN for the ConditionalMutation.
In many cases this situation can be dealt with by simply reading the row again
and possibly sending another conditional mutation. If this is not sufficient,
then a higher level of abstraction can be built by storing transactional
information within a row.
An example of using the batch writer can be found at
+accumulo/docs/examples/README.reservations+.
==== Durability
By default, Accumulo writes out any updates to the Write-Ahead Log (WAL). Every change
goes into a file in HDFS and is sync'd to disk for maximum durability. In
the event of a failure, writes held in memory are replayed from the WAL. Like
all files in HDFS, this file is also replicated. Sending updates to the
replicas, and waiting for a permanent sync to disk can significantly write speeds.
Accumulo allows users to use less tolerant forms of durability when writing.
These levels are:
* none: no durability guarantees are made, the WAL is not used
* log: the WAL is used, but not flushed; loss of the server probably means recent writes are lost
* flush: updates are written to the WAL, and flushed out to replicas; loss of a single server is unlikely to result in data loss.
* sync: updates are written to the WAL, and synced to disk on all replicas before the write is acknowledge. Data will not be lost even if the entire cluster suddenly loses power.
The user can set the default durability of a table in the shell. When
writing, the user can configure the BatchWriter or ConditionalWriter to use
a different level of durability for the session. This will override the
default durability setting.
[source,java]
----
BatchWriterConfig cfg = new BatchWriterConfig();
// We don't care about data loss with these writes:
// This is DANGEROUS:
cfg.setDurability(Durability.NONE);
Connection conn = ... ;
BatchWriter bw = conn.createBatchWriter(table, cfg);
----
=== Reading Data
Accumulo is optimized to quickly retrieve the value associated with a given key, and
to efficiently return ranges of consecutive keys and their associated values.
==== Scanner
To retrieve data, Clients use a Scanner, which acts like an Iterator over
keys and values. Scanners can be configured to start and stop at particular keys, and
to return a subset of the columns available.
[source,java]
----
// specify which visibilities we are allowed to see
Authorizations auths = new Authorizations("public");
Scanner scan =
conn.createScanner("table", auths);
scan.setRange(new Range("harry","john"));
scan.fetchColumnFamily(new Text("attributes"));
for(Entry<Key,Value> entry : scan) {
Text row = entry.getKey().getRow();
Value value = entry.getValue();
}
----
==== Isolated Scanner
Accumulo supports the ability to present an isolated view of rows when
scanning. There are three possible ways that a row could change in Accumulo :
* a mutation applied to a table
* iterators executed as part of a minor or major compaction
* bulk import of new files
Isolation guarantees that either all or none of the changes made by these
operations on a row are seen. Use the IsolatedScanner to obtain an isolated
view of an Accumulo table. When using the regular scanner it is possible to see
a non isolated view of a row. For example if a mutation modifies three
columns, it is possible that you will only see two of those modifications.
With the isolated scanner either all three of the changes are seen or none.
The IsolatedScanner buffers rows on the client side so a large row will not
crash a tablet server. By default rows are buffered in memory, but the user
can easily supply their own buffer if they wish to buffer to disk when rows are
large.
For an example, look at the following
examples/simple/src/main/java/org/apache/accumulo/examples/simple/isolation/InterferenceTest.java
==== BatchScanner
For some types of access, it is more efficient to retrieve several ranges
simultaneously. This arises when accessing a set of rows that are not consecutive
whose IDs have been retrieved from a secondary index, for example.
The BatchScanner is configured similarly to the Scanner; it can be configured to
retrieve a subset of the columns available, but rather than passing a single Range,
BatchScanners accept a set of Ranges. It is important to note that the keys returned
by a BatchScanner are not in sorted order since the keys streamed are from multiple
TabletServers in parallel.
[source,java]
----
ArrayList<Range> ranges = new ArrayList<Range>();
// populate list of ranges ...
BatchScanner bscan =
conn.createBatchScanner("table", auths, 10);
bscan.setRanges(ranges);
bscan.fetchColumnFamily("attributes");
for(Entry<Key,Value> entry : bscan) {
System.out.println(entry.getValue());
}
----
An example of the BatchScanner can be found at
+accumulo/docs/examples/README.batch+.
=== Proxy
The proxy API allows the interaction with Accumulo with languages other than Java.
A proxy server is provided in the codebase and a client can further be generated.
The proxy API can also be used instead of the traditional ZooKeeperInstance class to
provide a single TCP port in which clients can be securely routed through a firewall,
without requiring access to all tablet servers in the cluster.
==== Prerequisites
The proxy server can live on any node in which the basic client API would work. That
means it must be able to communicate with the Master, ZooKeepers, NameNode, and the
DataNodes. A proxy client only needs the ability to communicate with the proxy server.
==== Configuration
The configuration options for the proxy server live inside of a properties file. At
the very least, you need to supply the following properties:
protocolFactory=org.apache.thrift.protocol.TCompactProtocol$Factory
tokenClass=org.apache.accumulo.core.client.security.tokens.PasswordToken
port=42424
instance=test
zookeepers=localhost:2181
You can find a sample configuration file in your distribution:
$ACCUMULO_HOME/proxy/proxy.properties.
This sample configuration file further demonstrates an ability to back the proxy server
by MockAccumulo or the MiniAccumuloCluster.
==== Running the Proxy Server
After the properties file holding the configuration is created, the proxy server
can be started using the following command in the Accumulo distribution (assuming
your properties file is named +config.properties+):
$ACCUMULO_HOME/bin/accumulo proxy -p config.properties
==== Creating a Proxy Client
Aside from installing the Thrift compiler, you will also need the language-specific library
for Thrift installed to generate client code in that language. Typically, your operating
system's package manager will be able to automatically install these for you in an expected
location such as +/usr/lib/python/site-packages/thrift+.
You can find the thrift file for generating the client:
$ACCUMULO_HOME/proxy/proxy.thrift.
After a client is generated, the port specified in the configuration properties above will be
used to connect to the server.
==== Using a Proxy Client
The following examples have been written in Java and the method signatures may be
slightly different depending on the language specified when generating client with
the Thrift compiler. After initiating a connection to the Proxy (see Apache Thrift's
documentation for examples of connecting to a Thrift service), the methods on the
proxy client will be available. The first thing to do is log in:
[source,java]
Map password = new HashMap<String,String>();
password.put("password", "secret");
ByteBuffer token = client.login("root", password);
Once logged in, the token returned will be used for most subsequent calls to the client.
Let's create a table, add some data, scan the table, and delete it.
First, create a table.
[source,java]
client.createTable(token, "myTable", true, TimeType.MILLIS);
Next, add some data:
[source,java]
----
// first, create a writer on the server
String writer = client.createWriter(token, "myTable", new WriterOptions());
//rowid
ByteBuffer rowid = ByteBuffer.wrap("UUID".getBytes());
//mutation like class
ColumnUpdate cu = new ColumnUpdate();
cu.setColFamily("MyFamily".getBytes());
cu.setColQualifier("MyQualifier".getBytes());
cu.setColVisibility("VisLabel".getBytes());
cu.setValue("Some Value.".getBytes());
List<ColumnUpdate> updates = new ArrayList<ColumnUpdate>();
updates.add(cu);
// build column updates
Map<ByteBuffer, List<ColumnUpdate>> cellsToUpdate = new HashMap<ByteBuffer, List<ColumnUpdate>>();
cellsToUpdate.put(rowid, updates);
// send updates to the server
client.updateAndFlush(writer, "myTable", cellsToUpdate);
client.closeWriter(writer);
----
Scan for the data and batch the return of the results on the server:
[source,java]
----
String scanner = client.createScanner(token, "myTable", new ScanOptions());
ScanResult results = client.nextK(scanner, 100);
for(KeyValue keyValue : results.getResultsIterator()) {
// do something with results
}
client.closeScanner(scanner);
----