The Apache Rya rya.mapreduce
project contains a set of classes facilitating the use of Accumulo-backed Rya as the input source or output destination of Hadoop MapReduce jobs.
RyaStatementWritable wraps a statement in a WritableComparable object, so triples can be used as keys or values in MapReduce tasks. Statements are considered equal if they contain equivalent triples and equivalent Accumulo metadata (visibility, timestamp, etc.).
Input formats are provided for reading triple data from Rya or from RDF files:
RdfFileInputFormat will read and parse RDF files of any format. Format must be explicitly specified. Reading and parsing is done asynchronously, enabling large input files depending on how much information the openrdf parser itself needs to hold in memory in order to parse the file. (For example, large N-Triples files can be handled easily, but large XML files might require you to allocate more memory for the Map task.) Handles multiple files if given a directory as input, as long as all files are the specified format. Files will only be split if the format is set to N-Triples or N-Quads; otherwise, the number of input files will be the number of splits. Output pairs are <LongWritable, RyaStatementWritable>
, where the former is the number of the statement within the input split and the latter is the statement itself.
RyaInputFormat will read statements directly from a Rya table in Accumulo. Extends Accumulo‘s AbstractInputFormat and uses that class’s configuration methods to configure the connection to Accumulo. The table scanned should be one of the Rya core tables (spo, po, or osp), and whichever is used should be specified using RyaInputFormat.setTableLayout
, so the input format can deserialize the statements correctly. Choice of table may influence parallelization if the tables are split differently in Accumulo. (The number of splits in Accumulo will be the number of input splits in Hadoop and therefore the number of Mappers.) Output pairs are <Text, RyaStatementWritable>
, where the former is the Accumulo row ID and the latter is the statement itself.
An output format is provided for writing triple data to Rya:
<Writable, RyaStatementWritable>
. Keys are ignored and values are written to Rya.MRUtils defines constant configuration parameter names used for passing Accumulo connection information, Rya prefix and table layout, RDF format, etc., as well as some convenience methods for getting and setting these values with respect to a given Configuration.
AbstractAccumuloMRTool can be used as a base class for Rya MapReduce Tools using the ToolRunner API. It extracts necessary parameters from the configuration and provides methods for setting input and/or output formats and configuring them accordingly. To use, extend this class and implement run
. In the run method, call init
to extract and validate configuration values from the Hadoop Configuration. Then use setup*(Input/Output)
methods as needed to configure input and output for MapReduce jobs using the stored parameters. (Input and output formats can then be configured directly, if necessary.)
Expects parameters to be specified in the configuration using the names defined in MRUtils, or for secondary indexers, the names in org.apache.rya.indexing.accumulo.ConfigUtils
.
See the examples
subpackage for examples of how to use the interface, and the tools
subpackage for some individual MapReduce applications.