blob: 196033dccbf1d8ee6b03b7fcd1b6ecb9065450fd [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<title>Apache Jena - Apache Jena Elephas</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
<link href="/css/bootstrap-extension.css" rel="stylesheet" type="text/css">
<link href="/css/jena.css" rel="stylesheet" type="text/css">
<link rel="shortcut icon" href="/images/favicon.ico" />
<script src="https://code.jquery.com/jquery-2.2.4.min.js"
integrity="sha256-BbhdlvQf/xTY9gja0Dq3HiwQF8LaCRTXxZKRutelT44="
crossorigin="anonymous"></script>
<script src="/js/jena-navigation.js" type="text/javascript"></script>
<script src="/js/bootstrap.min.js" type="text/javascript"></script>
<script src="/js/improve.js" type="text/javascript"></script>
</head>
<body>
<nav class="navbar navbar-default" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-ex1-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/index.html">
<img class="logo-menu" src="/images/jena-logo/jena-logo-notext-small.png" alt="jena logo">Apache Jena</a>
</div>
<div class="collapse navbar-collapse navbar-ex1-collapse">
<ul class="nav navbar-nav">
<li id="homepage"><a href="/index.html"><span class="glyphicon glyphicon-home"></span> Home</a></li>
<li id="download"><a href="/download/index.cgi"><span class="glyphicon glyphicon-download-alt"></span> Download</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown"><span class="glyphicon glyphicon-book"></span> Learn <b class="caret"></b></a>
<ul class="dropdown-menu">
<li class="dropdown-header">Tutorials</li>
<li><a href="/tutorials/index.html">Overview</a></li>
<li><a href="/documentation/fuseki2/index.html">Fuseki Triplestore</a></li>
<li><a href="/documentation/notes/index.html">How-To's</a></li>
<li><a href="/documentation/query/manipulating_sparql_using_arq.html">Manipulating SPARQL using ARQ</a></li>
<li><a href="/tutorials/rdf_api.html">RDF core API tutorial</a></li>
<li><a href="/tutorials/sparql.html">SPARQL tutorial</a></li>
<li><a href="/tutorials/using_jena_with_eclipse.html">Using Jena with Eclipse</a></li>
<li class="divider"></li>
<li class="dropdown-header">References</li>
<li><a href="/documentation/index.html">Overview</a></li>
<li><a href="/documentation/query/index.html">ARQ (SPARQL)</a></li>
<li><a href="/documentation/assembler/index.html">Assembler</a></li>
<li><a href="/documentation/tools/index.html">Command-line tools</a></li>
<li><a href="/documentation/rdfs/">Data with RDFS Inferencing</a></li>
<li><a href="/documentation/geosparql/index.html">GeoSPARQL</a></li>
<li><a href="/documentation/inference/index.html">Inference API</a></li>
<li><a href="/documentation/javadoc.html">Javadoc</a></li>
<li><a href="/documentation/ontology/">Ontology API</a></li>
<li><a href="/documentation/permissions/index.html">Permissions</a></li>
<li><a href="/documentation/extras/querybuilder/index.html">Query Builder</a></li>
<li><a href="/documentation/rdf/index.html">RDF API</a></li>
<li><a href="/documentation/rdfconnection/">RDF Connection - SPARQL API</a></li>
<li><a href="/documentation/io/">RDF I/O</a></li>
<li><a href="/documentation/rdfstar/index.html">RDF-star</a></li>
<li><a href="/documentation/shacl/index.html">SHACL</a></li>
<li><a href="/documentation/shex/index.html">ShEx</a></li>
<li><a href="/documentation/jdbc/index.html">SPARQL over JDBC</a></li>
<li><a href="/documentation/tdb/index.html">TDB</a></li>
<li><a href="/documentation/tdb2/index.html">TDB2</a></li>
<li><a href="/documentation/query/text-query.html">Text Search</a></li>
</ul>
</li>
<li class="drop down">
<a href="#" class="dropdown-toggle" data-toggle="dropdown"><span class="glyphicon glyphicon-book"></span> Javadoc <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="/documentation/javadoc.html">All Javadoc</a></li>
<li><a href="/documentation/javadoc/arq/">ARQ</a></li>
<li><a href="/documentation/javadoc_elephas.html">Elephas</a></li>
<li><a href="/documentation/javadoc/fuseki2/">Fuseki</a></li>
<li><a href="/documentation/javadoc/geosparql/">GeoSPARQL</a></li>
<li><a href="/documentation/javadoc/jdbc/">JDBC</a></li>
<li><a href="/documentation/javadoc/jena/">Jena Core</a></li>
<li><a href="/documentation/javadoc/permissions/">Permissions</a></li>
<li><a href="/documentation/javadoc/extras/querybuilder/">Query Builder</a></li>
<li><a href="/documentation/javadoc/shacl/">SHACL</a></li>
<li><a href="/documentation/javadoc/tdb/">TDB</a></li>
<li><a href="/documentation/javadoc/text/">Text Search</a></li>
</ul>
</li>
<li id="ask"><a href="/help_and_support/index.html"><span class="glyphicon glyphicon-question-sign"></span> Ask</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown"><span class="glyphicon glyphicon-bullhorn"></span> Get involved <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="/getting_involved/index.html">Contribute</a></li>
<li><a href="/help_and_support/bugs_and_suggestions.html">Report a bug</a></li>
<li class="divider"></li>
<li class="dropdown-header">Project</li>
<li><a href="/about_jena/about.html">About Jena</a></li>
<li><a href="/about_jena/architecture.html">Architecture</a></li>
<li><a href="/about_jena/citing.html">Citing</a></li>
<li><a href="/about_jena/team.html">Project team</a></li>
<li><a href="/about_jena/contributions.html">Related projects</a></li>
<li><a href="/about_jena/roadmap.html">Roadmap</a></li>
<li class="divider"></li>
<li class="dropdown-header">ASF</li>
<li><a href="http://www.apache.org/">Apache Software Foundation</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
<li><a href="http://www.apache.org/security/">Security</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
</ul>
</li>
<li id="edit"><a href="https://github.com/apache/jena-site/edit/main/source/documentation/archive/hadoop/elephas_index.md" title="Edit this page on GitHub"><span class="glyphicon glyphicon-pencil"></span> Edit this page</a></li>
</ul>
</div>
</div>
</nav>
<div class="container">
<div class="row">
<div class="col-md-12">
<div id="breadcrumbs">
<ol class="breadcrumb">
<li><a href='/documentation'>DOCUMENTATION</a></li>
<li><a href='/documentation/archive'>ARCHIVE</a></li>
<li><a href='/documentation/archive/hadoop'>HADOOP</a></li>
<li class="active">ELEPHAS INDEX</li>
</ol>
</div>
<h1 class="title">Apache Jena Elephas</h1>
<p>Apache Jena Elephas is a set of libraries which provide various basic building blocks which enable you to start writing Apache Hadoop based applications which work with RDF data.</p>
<p>Historically there has been no serious support for RDF within the Hadoop ecosystem and what support has existed has
often been limited and task specific. These libraries aim to be as generic as possible and provide the necessary
infrastructure that enables developers to create their application specific logic without worrying about the
underlying plumbing.</p>
<h2 id="beta">Beta</h2>
<p>These modules are currently considered to be in a <strong>Beta</strong> state, they have been under active development for about a year but have not yet been widely deployed and may contain as yet undiscovered bugs.</p>
<p>Please see the <a href="../../help_and_support/bugs_and_suggestions.html">How to Report a Bug</a> page for how to report any bugs you may encounter.</p>
<h2 id="documentation">Documentation</h2>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#getting-started">Getting Started</a></li>
<li>APIs
<ul>
<li><a href="common.html">Common</a></li>
<li><a href="io.html">IO</a></li>
<li><a href="mapred.html">Map/Reduce</a></li>
<li><a href="../javadoc/elephas/">Javadoc</a></li>
</ul>
</li>
<li>Examples
<ul>
<li><a href="demo.html">RDF Stats Demo</a></li>
</ul>
</li>
<li><a href="artifacts.html">Maven Artifacts</a></li>
</ul>
<h2 id="overview">Overview</h2>
<p>Apache Jena Elephas is published as a set of Maven module via its <a href="artifacts.html">maven artifacts</a>. The source for these libraries
may be <a href="/download/index.cgi">downloaded</a> as part of the source distribution. These modules are built against the Hadoop 2.x. APIs and no
backwards compatibility for 1.x is provided.</p>
<p>The core aim of these libraries it to provide the basic building blocks that allow users to start writing Hadoop applications that
work with RDF. They are mostly fairly low level components but they are designed to be used as building blocks to help users and developers
focus on actual application logic rather than on the low level plumbing.</p>
<p>Firstly at the lowest level they provide <code>Writable</code> implementations that allow the basic RDF primitives - nodes, triples and quads -
to be represented and exchanged within Hadoop applications, this support is provided by the <a href="common.html">Common</a> library.</p>
<p>Secondly they provide support for all the RDF serialisations which Jena supports as both input and output formats subject to the specific
limitations of those serialisations. This support is provided by the <a href="io.html">IO</a> library in the form of standard <code>InputFormat</code> and
<code>OutputFormat</code> implementations.</p>
<p>There are also a set of basic <code>Mapper</code> and <code>Reducer</code> implementations provided by the <a href="mapred.html">Map/Reduce</a> library which contains code
that enables various common Hadoop tasks such as counting, filtering, splitting and grouping to be carried out on RDF data. Typically these
will be used as a starting point to build more complex RDF processing applications.</p>
<p>Finally there is a <a href="demo.html">RDF Stats Demo</a> which is a runnable Hadoop job JAR file that demonstrates using these libraries to calculate
a number of basic statistics over arbitrary RDF data.</p>
<h2 id="getting-started">Getting Started</h2>
<p>To get started you will need to add the relevant dependencies to your project, the exact dependencies necessary will depend
on what you are trying to do. Typically you will likely need at least the IO library and possibly the Map/Reduce library:</p>
<pre><code>&lt;dependency&gt;
&lt;groupId&gt;org.apache.jena&lt;/groupId&gt;
&lt;artifactId&gt;jena-elephas-io&lt;/artifactId&gt;
&lt;version&gt;x.y.z&lt;/version&gt;
&lt;/dependency&gt;
&lt;dependency&gt;
&lt;groupId&gt;org.apache.jena&lt;/groupId&gt;
&lt;artifactId&gt;jena-elephas-mapreduce&lt;/artifactId&gt;
&lt;version&gt;x.y.z&lt;/version&gt;
&lt;/dependency&gt;
</code></pre>
<p>Our libraries depend on the relevant Hadoop libraries but since these libraries are typically provided by the Hadoop cluster those dependencies are marked as <code>provided</code> and thus are not transitive. This means that you will typically also need to add the following additional dependencies:</p>
<pre><code>&lt;!-- Hadoop Dependencies --&gt;
&lt;!--
Note these will be provided on the Hadoop cluster hence the provided
scope
--&gt;
&lt;dependency&gt;
&lt;groupId&gt;org.apache.hadoop&lt;/groupId&gt;
&lt;artifactId&gt;hadoop-common&lt;/artifactId&gt;
&lt;version&gt;2.6.0&lt;/version&gt;
&lt;scope&gt;provided&lt;/scope&gt;
&lt;/dependency&gt;
&lt;dependency&gt;
&lt;groupId&gt;org.apache.hadoop&lt;/groupId&gt;
&lt;artifactId&gt;hadoop-mapreduce-client-common&lt;/artifactId&gt;
&lt;version&gt;2.6.0&lt;/version&gt;
&lt;scope&gt;provided&lt;/scope&gt;
&lt;/dependency&gt;
</code></pre>
<p>You can then write code to launch a Map/Reduce job that works with RDF. For example let us consider a RDF variation of the classic Hadoop
word count example. In this example which we call node count we do the following:</p>
<ul>
<li>Take in some RDF triples</li>
<li>Split them up into their constituent nodes i.e. the URIs, Blank Nodes &amp; Literals</li>
<li>Assign an initial count of one to each node</li>
<li>Group by node and sum up the counts</li>
<li>Output the nodes and their usage counts</li>
</ul>
<p>We will start with our <code>Mapper</code> implementation, as you can see this simply takes in a triple and splits it into its constituent nodes. It
then outputs each node with an initial count of 1:</p>
<pre><code>package org.apache.jena.hadoop.rdf.mapreduce.count;
import org.apache.jena.hadoop.rdf.types.NodeWritable;
import org.apache.jena.hadoop.rdf.types.TripleWritable;
import org.apache.jena.graph.Triple;
/**
* A mapper for counting node usages within triples designed primarily for use
* in conjunction with {@link NodeCountReducer}
*
* @param &lt;TKey&gt; Key type
*/
public class TripleNodeCountMapper&lt;TKey&gt; extends AbstractNodeTupleNodeCountMapper&lt;TKey, Triple, TripleWritable&gt; {
@Override
protected NodeWritable[] getNodes(TripleWritable tuple) {
Triple t = tuple.get();
return new NodeWritable[] { new NodeWritable(t.getSubject()),
new NodeWritable(t.getPredicate()),
new NodeWritable(t.getObject()) };
}
}
</code></pre>
<p>And then our <code>Reducer</code> implementation, this takes in the data grouped by node and sums up the counts outputting the node and the final count:</p>
<pre><code>package org.apache.jena.hadoop.rdf.mapreduce.count;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.jena.hadoop.rdf.types.NodeWritable;
/**
* A reducer which takes node keys with a sequence of longs representing counts
* as the values and sums the counts together into pairs consisting of a node
* key and a count value.
*/
public class NodeCountReducer extends Reducer&lt;NodeWritable, LongWritable, NodeWritable, LongWritable&gt; {
@Override
protected void reduce(NodeWritable key, Iterable&lt;LongWritable&gt; values, Context context) throws IOException,
InterruptedException {
long count = 0;
Iterator&lt;LongWritable&gt; iter = values.iterator();
while (iter.hasNext()) {
count += iter.next().get();
}
context.write(key, new LongWritable(count));
}
}
</code></pre>
<p>Finally we then need to define an actual Hadoop job we can submit to run this. Here we take advantage of the <a href="io.html">IO</a> library to provide
us with support for our desired RDF input format:</p>
<pre><code>package org.apache.jena.hadoop.rdf.stats;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.jena.hadoop.rdf.io.input.TriplesInputFormat;
import org.apache.jena.hadoop.rdf.io.output.ntriples.NTriplesNodeOutputFormat;
import org.apache.jena.hadoop.rdf.mapreduce.count.NodeCountReducer;
import org.apache.jena.hadoop.rdf.mapreduce.count.TripleNodeCountMapper;
import org.apache.jena.hadoop.rdf.types.NodeWritable;
public class RdfMapReduceExample {
public static void main(String[] args) {
try {
// Get Hadoop configuration
Configuration config = new Configuration(true);
// Create job
Job job = Job.getInstance(config);
job.setJarByClass(RdfMapReduceExample.class);
job.setJobName(&quot;RDF Triples Node Usage Count&quot;);
// Map/Reduce classes
job.setMapperClass(TripleNodeCountMapper.class);
job.setMapOutputKeyClass(NodeWritable.class);
job.setMapOutputValueClass(LongWritable.class);
job.setReducerClass(NodeCountReducer.class);
// Input and Output
job.setInputFormatClass(TriplesInputFormat.class);
job.setOutputFormatClass(NTriplesNodeOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(&quot;/example/input/&quot;));
FileOutputFormat.setOutputPath(job, new Path(&quot;/example/output/&quot;));
// Launch the job and await completion
job.submit();
if (job.monitorAndPrintJob()) {
// OK
System.out.println(&quot;Completed&quot;);
} else {
// Failed
System.err.println(&quot;Failed&quot;);
}
} catch (Throwable e) {
e.printStackTrace();
}
}
}
</code></pre>
<p>So this really is no different from configuring any other Hadoop job, we simply have to point to the relevant input and output formats and provide our mapper and reducer. Note that here we use the <code>TriplesInputFormat</code> which can handle RDF in any Jena supported format, if you know your RDF is in a specific format it is usually more efficient to use a more specific input format. Please see the <a href="io.html">IO</a> page for more detail on the available input formats and the differences between them.</p>
<p>We recommend that you next take a look at our <a href="demo.html">RDF Stats Demo</a> which shows how to do some more complex computations by chaining multiple jobs together.</p>
<h2 id="apis">APIs</h2>
<p>There are three main libraries each with their own API:</p>
<ul>
<li><a href="common.html">Common</a> - this provides the basic data model for representing RDF data within Hadoop</li>
<li><a href="io.html">IO</a> - this provides support for reading and writing RDF</li>
<li><a href="mapred.html">Map/Reduce</a> - this provides support for writing Map/Reduce jobs that work with RDF</li>
</ul>
</div>
</div>
</div>
<footer class="footer">
<div class="container" style="font-size:80%" >
<p>
Copyright &copy; 2011&ndash;2022 The Apache Software Foundation, Licensed under the
<a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
</p>
<p>
Apache Jena, Jena, the Apache Jena project logo, Apache and the Apache feather logos are trademarks of
The Apache Software Foundation.
<br/>
<a href="https://privacy.apache.org/policies/privacy-policy-public.html"
>Apache Software Foundation Privacy Policy</a>.
</p>
</div>
</footer>
<script type="text/javascript">
var link = $('a[href="' + this.location.pathname + '"]');
if (link != undefined)
link.parents('li,ul').addClass('active');
</script>
</body>
</html>