layout: docs title: Using Map Storage permalink: /docs/map-storage/

Connection

The AccumuloStorage class provides the ability to read data from an Accumulo table. The string argument to the LOAD command is a URI which contains Accumulo connection information and some query options. The URI scheme is always “accumulo” and the path is the Accumulo table name to read from. The query string is used to provide the previously mentioned connection and query options. These options are the same regardless of whether AccumuloStorage is being used for reading or writing.

  • instance - The Accumulo instance name
  • user - The Accumulo user
  • password - The password for the Accumulo user
  • zookeepers - A comma separated list of ZooKeeper hosts for the Accumulo instance

Reading

Some basic Accumulo read parameters are exposed for use. All of the following are optional.

  • fetch_columns - A comma separated list of optionally colon-separated elements mapping to column family and qualifier pairs, e.g. foo:bar,column1,column5. Default: All columns.
  • begin - The row to begin scanning from. Default: beginning of the table (null).
  • end - The row to stop scanning at. Default: end of the table (null).
  • auths - A comma separated list of Authorizations to use for the provided users. Default: all authorizations the user has.

AccumuloStorage will return you data in the following schema.

Each key in the map is a column (family and qualifier) within the provided rowkey and the values are the Accumulo values for the given rowkey+column. By default, the map key will have a colon separator between the column family and column qualifier. A boolean argument can be provided to the AccumuloStorage constructor. If this boolean is true, the map key will only be composed of the column qualifier and each map will be the collection of each column family within the row. For example:

By default will generate a tuple of the following:

If the previously mentioned boolean argument is provided as true, the following will be generated instead:

Writing

Some basic Accumulo write parameters are exposed for use. Like read operations, all of the following are optional.

  • write_buffer_size - The size, in bytes, to buffer Mutations before sending to an Accumulo server. Default: 10,000,000 (10MB).
  • write_threads - The number of threads to use when sending Mutations to Accumulo servers. Default: 10.
  • write_latency_ms - The number of milliseconds to wait before forcibly flushing Mutations to Accumulo. Default: 10,000 (10 seconds).

The AccumuloStorage class can accept data in a few different formats. A String argument may be provided to the AccumuloStorage constructor to provide a column mapping which is a comma-separated list (this will be touched on later). One thing that is universal is that the first entry in the tuple is treated as the rowkey and must be castable to a chararray. For elements 1 through N in a Tuple, the two cases may apply.

Data as map

When the Tuple entry is a map, we can naturally treat it as a column to value mapping, placing the map key in the column family and the map value in the Accumulo value. When a non-empty value from the column mapping provided in the AccumuloStorage constructor is present, we will use the _N_th value in that CSV as a column family for this _N_th entry in the Tuple and the map key is placed in the qualifier.

If the entry from the column mapping happens to contain a colon, AccumuloStorage will split the key on the colon. In the case where the colon is present, the characters following the colon in the column mapping entry will be placed in the column qualifier with the map key being append to it.

Concretely, let's say we have the following Tuple:

With an empty (or null) column mapping, AccumuloStorage will generate the following Key-Value pairs:

With a column mapping of “information”:

And, with a column mapping of “information:person_”:

Data as fields

When an entry in the Tuple is not a map, we require a non-empty column mapping to use in coordination with the current field. The same colon-delimiter logic that was described when handling a Map in a tuple applies with other fields.

With an empty (or null) column mapping, AccumuloStorage will generate the following Key-Value pairs:

With a column mapping of “information”:

And, with a column mapping of “information:person_”:

In addition to dealing with data in this row with columns approach, you can also treat read/write data from Accumulo with Pig in terms of keys and values.