| // Licensed to the Apache Software Foundation (ASF) under one or more |
| // contributor license agreements. See the NOTICE file distributed with |
| // this work for additional information regarding copyright ownership. |
| // The ASF licenses this file to You under the Apache License, Version 2.0 |
| // (the "License"); you may not use this file except in compliance with |
| // the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, software |
| // distributed under the License is distributed on an "AS IS" BASIS, |
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| // See the License for the specific language governing permissions and |
| // limitations under the License. |
| |
| == Writing Accumulo Clients |
| |
| === Running Client Code |
| |
| There are multiple ways to run Java code that uses Accumulo. Below is a list |
| of the different ways to execute client code. |
| |
| * using java executable |
| * using the accumulo script |
| * using the tool script |
| |
| In order to run client code written to run against Accumulo, you will need to |
| include the jars that Accumulo depends on in your classpath. Accumulo client |
| code depends on Hadoop and Zookeeper. For Hadoop add the hadoop client jar, all |
| of the jars in the Hadoop lib directory, and the conf directory to the |
| classpath. For recent Zookeeper versions, you only need to add the Zookeeper jar, and not |
| what is in the Zookeeper lib directory. You can run the following command on a |
| configured Accumulo system to see what its using for its classpath. |
| |
| $ACCUMULO_HOME/bin/accumulo classpath |
| |
| Another option for running your code is to put a jar file in |
| +$ACCUMULO_HOME/lib/ext+. After doing this you can use the accumulo |
| script to execute your code. For example if you create a jar containing the |
| class +com.foo.Client+ and placed that in +lib/ext+, then you could use the command |
| +$ACCUMULO_HOME/bin/accumulo com.foo.Client+ to execute your code. |
| |
| If you are writing map reduce job that access Accumulo, then you can use the |
| bin/tool.sh script to run those jobs. See the map reduce example. |
| |
| === Connecting |
| |
| All clients must first identify the Accumulo instance to which they will be |
| communicating. Code to do this is as follows: |
| |
| [source,java] |
| ---- |
| String instanceName = "myinstance"; |
| String zooServers = "zooserver-one,zooserver-two" |
| Instance inst = new ZooKeeperInstance(instanceName, zooServers); |
| |
| Connector conn = inst.getConnector("user", new PasswordToken("passwd")); |
| ---- |
| |
| The PasswordToken is the most common implementation of an `AuthenticationToken`. |
| This general interface allow authentication as an Accumulo user to come from |
| a variety of sources or means. The CredentialProviderToken leverages the Hadoop |
| CredentialProviders (new in Hadoop 2.6). |
| |
| For example, the CredentialProviderToken can be used in conjunction with a Java |
| KeyStore to alleviate passwords stored in cleartext. When stored in HDFS, a single |
| KeyStore can be used across an entire instance. Be aware that KeyStores stored on |
| the local filesystem must be made available to all nodes in the Accumulo cluster. |
| |
| [source,java] |
| ---- |
| KerberosToken token = new KerberosToken(); |
| Connector conn = inst.getConnector(token.getPrincipal(), token); |
| ---- |
| |
| The KerberosToken can be provided to use the authentication provided by Kerberos. |
| Using Kerberos requires external setup and additional configuration, but provides |
| a single point of authentication through HDFS, YARN and ZooKeeper and allowing |
| for password-less authentication with Accumulo. |
| |
| === Writing Data |
| |
| Data are written to Accumulo by creating Mutation objects that represent all the |
| changes to the columns of a single row. The changes are made atomically in the |
| TabletServer. Clients then add Mutations to a BatchWriter which submits them to |
| the appropriate TabletServers. |
| |
| Mutations can be created thus: |
| |
| [source,java] |
| ---- |
| Text rowID = new Text("row1"); |
| Text colFam = new Text("myColFam"); |
| Text colQual = new Text("myColQual"); |
| ColumnVisibility colVis = new ColumnVisibility("public"); |
| long timestamp = System.currentTimeMillis(); |
| |
| Value value = new Value("myValue".getBytes()); |
| |
| Mutation mutation = new Mutation(rowID); |
| mutation.put(colFam, colQual, colVis, timestamp, value); |
| ---- |
| |
| ==== BatchWriter |
| The BatchWriter is highly optimized to send Mutations to multiple TabletServers |
| and automatically batches Mutations destined for the same TabletServer to |
| amortize network overhead. Care must be taken to avoid changing the contents of |
| any Object passed to the BatchWriter since it keeps objects in memory while |
| batching. |
| |
| Mutations are added to a BatchWriter thus: |
| |
| [source,java] |
| ---- |
| // BatchWriterConfig has reasonable defaults |
| BatchWriterConfig config = new BatchWriterConfig(); |
| config.setMaxMemory(10000000L); // bytes available to batchwriter for buffering mutations |
| |
| BatchWriter writer = conn.createBatchWriter("table", config) |
| |
| writer.addMutation(mutation); |
| |
| writer.close(); |
| ---- |
| |
| An example of using the batch writer can be found at |
| +accumulo/docs/examples/README.batch+. |
| |
| ==== ConditionalWriter |
| The ConditionalWriter enables efficient, atomic read-modify-write operations on |
| rows. The ConditionalWriter writes special Mutations which have a list of per |
| column conditions that must all be met before the mutation is applied. The |
| conditions are checked in the tablet server while a row lock is |
| held (Mutations written by the BatchWriter will not obtain a row |
| lock). The conditions that can be checked for a column are equality and |
| absence. For example a conditional mutation can require that column A is |
| absent inorder to be applied. Iterators can be applied when checking |
| conditions. Using iterators, many other operations besides equality and |
| absence can be checked. For example, using an iterator that converts values |
| less than 5 to 0 and everything else to 1, its possible to only apply a |
| mutation when a column is less than 5. |
| |
| In the case when a tablet server dies after a client sent a conditional |
| mutation, its not known if the mutation was applied or not. When this happens |
| the ConditionalWriter reports a status of UNKNOWN for the ConditionalMutation. |
| In many cases this situation can be dealt with by simply reading the row again |
| and possibly sending another conditional mutation. If this is not sufficient, |
| then a higher level of abstraction can be built by storing transactional |
| information within a row. |
| |
| An example of using the batch writer can be found at |
| +accumulo/docs/examples/README.reservations+. |
| |
| ==== Durability |
| |
| By default, Accumulo writes out any updates to the Write-Ahead Log (WAL). Every change |
| goes into a file in HDFS and is sync'd to disk for maximum durability. In |
| the event of a failure, writes held in memory are replayed from the WAL. Like |
| all files in HDFS, this file is also replicated. Sending updates to the |
| replicas, and waiting for a permanent sync to disk can significantly write speeds. |
| |
| Accumulo allows users to use less tolerant forms of durability when writing. |
| These levels are: |
| |
| * none: no durability guarantees are made, the WAL is not used |
| * log: the WAL is used, but not flushed; loss of the server probably means recent writes are lost |
| * flush: updates are written to the WAL, and flushed out to replicas; loss of a single server is unlikely to result in data loss. |
| * sync: updates are written to the WAL, and synced to disk on all replicas before the write is acknowledge. Data will not be lost even if the entire cluster suddenly loses power. |
| |
| The user can set the default durability of a table in the shell. When |
| writing, the user can configure the BatchWriter or ConditionalWriter to use |
| a different level of durability for the session. This will override the |
| default durability setting. |
| |
| [source,java] |
| ---- |
| BatchWriterConfig cfg = new BatchWriterConfig(); |
| // We don't care about data loss with these writes: |
| // This is DANGEROUS: |
| cfg.setDurability(Durability.NONE); |
| |
| Connection conn = ... ; |
| BatchWriter bw = conn.createBatchWriter(table, cfg); |
| |
| ---- |
| |
| === Reading Data |
| |
| Accumulo is optimized to quickly retrieve the value associated with a given key, and |
| to efficiently return ranges of consecutive keys and their associated values. |
| |
| ==== Scanner |
| |
| To retrieve data, Clients use a Scanner, which acts like an Iterator over |
| keys and values. Scanners can be configured to start and stop at particular keys, and |
| to return a subset of the columns available. |
| |
| [source,java] |
| ---- |
| // specify which visibilities we are allowed to see |
| Authorizations auths = new Authorizations("public"); |
| |
| Scanner scan = |
| conn.createScanner("table", auths); |
| |
| scan.setRange(new Range("harry","john")); |
| scan.fetchColumnFamily(new Text("attributes")); |
| |
| for(Entry<Key,Value> entry : scan) { |
| Text row = entry.getKey().getRow(); |
| Value value = entry.getValue(); |
| } |
| ---- |
| |
| ==== Isolated Scanner |
| |
| Accumulo supports the ability to present an isolated view of rows when |
| scanning. There are three possible ways that a row could change in Accumulo : |
| |
| * a mutation applied to a table |
| * iterators executed as part of a minor or major compaction |
| * bulk import of new files |
| |
| Isolation guarantees that either all or none of the changes made by these |
| operations on a row are seen. Use the IsolatedScanner to obtain an isolated |
| view of an Accumulo table. When using the regular scanner it is possible to see |
| a non isolated view of a row. For example if a mutation modifies three |
| columns, it is possible that you will only see two of those modifications. |
| With the isolated scanner either all three of the changes are seen or none. |
| |
| The IsolatedScanner buffers rows on the client side so a large row will not |
| crash a tablet server. By default rows are buffered in memory, but the user |
| can easily supply their own buffer if they wish to buffer to disk when rows are |
| large. |
| |
| For an example, look at the following |
| |
| examples/simple/src/main/java/org/apache/accumulo/examples/simple/isolation/InterferenceTest.java |
| |
| ==== BatchScanner |
| |
| For some types of access, it is more efficient to retrieve several ranges |
| simultaneously. This arises when accessing a set of rows that are not consecutive |
| whose IDs have been retrieved from a secondary index, for example. |
| |
| The BatchScanner is configured similarly to the Scanner; it can be configured to |
| retrieve a subset of the columns available, but rather than passing a single Range, |
| BatchScanners accept a set of Ranges. It is important to note that the keys returned |
| by a BatchScanner are not in sorted order since the keys streamed are from multiple |
| TabletServers in parallel. |
| |
| [source,java] |
| ---- |
| ArrayList<Range> ranges = new ArrayList<Range>(); |
| // populate list of ranges ... |
| |
| BatchScanner bscan = |
| conn.createBatchScanner("table", auths, 10); |
| bscan.setRanges(ranges); |
| bscan.fetchColumnFamily("attributes"); |
| |
| for(Entry<Key,Value> entry : bscan) { |
| System.out.println(entry.getValue()); |
| } |
| ---- |
| |
| An example of the BatchScanner can be found at |
| +accumulo/docs/examples/README.batch+. |
| |
| === Proxy |
| |
| The proxy API allows the interaction with Accumulo with languages other than Java. |
| A proxy server is provided in the codebase and a client can further be generated. |
| The proxy API can also be used instead of the traditional ZooKeeperInstance class to |
| provide a single TCP port in which clients can be securely routed through a firewall, |
| without requiring access to all tablet servers in the cluster. |
| |
| ==== Prerequisites |
| |
| The proxy server can live on any node in which the basic client API would work. That |
| means it must be able to communicate with the Master, ZooKeepers, NameNode, and the |
| DataNodes. A proxy client only needs the ability to communicate with the proxy server. |
| |
| |
| ==== Configuration |
| |
| The configuration options for the proxy server live inside of a properties file. At |
| the very least, you need to supply the following properties: |
| |
| protocolFactory=org.apache.thrift.protocol.TCompactProtocol$Factory |
| tokenClass=org.apache.accumulo.core.client.security.tokens.PasswordToken |
| port=42424 |
| instance=test |
| zookeepers=localhost:2181 |
| |
| You can find a sample configuration file in your distribution: |
| |
| $ACCUMULO_HOME/proxy/proxy.properties. |
| |
| This sample configuration file further demonstrates an ability to back the proxy server |
| by MockAccumulo or the MiniAccumuloCluster. |
| |
| ==== Running the Proxy Server |
| |
| After the properties file holding the configuration is created, the proxy server |
| can be started using the following command in the Accumulo distribution (assuming |
| your properties file is named +config.properties+): |
| |
| $ACCUMULO_HOME/bin/accumulo proxy -p config.properties |
| |
| ==== Creating a Proxy Client |
| |
| Aside from installing the Thrift compiler, you will also need the language-specific library |
| for Thrift installed to generate client code in that language. Typically, your operating |
| system's package manager will be able to automatically install these for you in an expected |
| location such as +/usr/lib/python/site-packages/thrift+. |
| |
| You can find the thrift file for generating the client: |
| |
| $ACCUMULO_HOME/proxy/proxy.thrift. |
| |
| After a client is generated, the port specified in the configuration properties above will be |
| used to connect to the server. |
| |
| ==== Using a Proxy Client |
| |
| The following examples have been written in Java and the method signatures may be |
| slightly different depending on the language specified when generating client with |
| the Thrift compiler. After initiating a connection to the Proxy (see Apache Thrift's |
| documentation for examples of connecting to a Thrift service), the methods on the |
| proxy client will be available. The first thing to do is log in: |
| |
| [source,java] |
| Map password = new HashMap<String,String>(); |
| password.put("password", "secret"); |
| ByteBuffer token = client.login("root", password); |
| |
| Once logged in, the token returned will be used for most subsequent calls to the client. |
| Let's create a table, add some data, scan the table, and delete it. |
| |
| |
| First, create a table. |
| |
| [source,java] |
| client.createTable(token, "myTable", true, TimeType.MILLIS); |
| |
| |
| Next, add some data: |
| |
| [source,java] |
| ---- |
| // first, create a writer on the server |
| String writer = client.createWriter(token, "myTable", new WriterOptions()); |
| |
| //rowid |
| ByteBuffer rowid = ByteBuffer.wrap("UUID".getBytes()); |
| |
| //mutation like class |
| ColumnUpdate cu = new ColumnUpdate(); |
| cu.setColFamily("MyFamily".getBytes()); |
| cu.setColQualifier("MyQualifier".getBytes()); |
| cu.setColVisibility("VisLabel".getBytes()); |
| cu.setValue("Some Value.".getBytes()); |
| |
| List<ColumnUpdate> updates = new ArrayList<ColumnUpdate>(); |
| updates.add(cu); |
| |
| // build column updates |
| Map<ByteBuffer, List<ColumnUpdate>> cellsToUpdate = new HashMap<ByteBuffer, List<ColumnUpdate>>(); |
| cellsToUpdate.put(rowid, updates); |
| |
| // send updates to the server |
| client.updateAndFlush(writer, "myTable", cellsToUpdate); |
| |
| client.closeWriter(writer); |
| ---- |
| |
| |
| Scan for the data and batch the return of the results on the server: |
| |
| [source,java] |
| ---- |
| String scanner = client.createScanner(token, "myTable", new ScanOptions()); |
| ScanResult results = client.nextK(scanner, 100); |
| |
| for(KeyValue keyValue : results.getResultsIterator()) { |
| // do something with results |
| } |
| |
| client.closeScanner(scanner); |
| ---- |