blob: d0f38022f16e94fc0c7887fd6e1c6331f9b33c8c [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<title>Apache Jena - SDB Loading data</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
<link href="/css/bootstrap-extension.css" rel="stylesheet" type="text/css">
<link href="/css/jena.css" rel="stylesheet" type="text/css">
<link rel="shortcut icon" href="/images/favicon.ico" />
<script src="https://code.jquery.com/jquery-2.2.4.min.js"
integrity="sha256-BbhdlvQf/xTY9gja0Dq3HiwQF8LaCRTXxZKRutelT44="
crossorigin="anonymous"></script>
<script src="/js/jena-navigation.js" type="text/javascript"></script>
<script src="/js/bootstrap.min.js" type="text/javascript"></script>
<script src="/js/improve.js" type="text/javascript"></script>
</head>
<body>
<nav class="navbar navbar-default" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-ex1-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/index.html">
<img class="logo-menu" src="/images/jena-logo/jena-logo-notext-small.png" alt="jena logo">Apache Jena</a>
</div>
<div class="collapse navbar-collapse navbar-ex1-collapse">
<ul class="nav navbar-nav">
<li id="homepage"><a href="/index.html"><span class="glyphicon glyphicon-home"></span> Home</a></li>
<li id="download"><a href="/download/index.cgi"><span class="glyphicon glyphicon-download-alt"></span> Download</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown"><span class="glyphicon glyphicon-book"></span> Learn <b class="caret"></b></a>
<ul class="dropdown-menu">
<li class="dropdown-header">Tutorials</li>
<li><a href="/tutorials/index.html">Overview</a></li>
<li><a href="/documentation/fuseki2/index.html">Fuseki Triplestore</a></li>
<li><a href="/documentation/notes/index.html">How-To's</a></li>
<li><a href="/documentation/query/manipulating_sparql_using_arq.html">Manipulating SPARQL using ARQ</a></li>
<li><a href="/tutorials/rdf_api.html">RDF core API tutorial</a></li>
<li><a href="/tutorials/sparql.html">SPARQL tutorial</a></li>
<li><a href="/tutorials/using_jena_with_eclipse.html">Using Jena with Eclipse</a></li>
<li class="divider"></li>
<li class="dropdown-header">References</li>
<li><a href="/documentation/index.html">Overview</a></li>
<li><a href="/documentation/query/index.html">ARQ (SPARQL)</a></li>
<li><a href="/documentation/assembler/index.html">Assembler</a></li>
<li><a href="/documentation/tools/index.html">Command-line tools</a></li>
<li><a href="/documentation/rdfs/">Data with RDFS Inferencing</a></li>
<li><a href="/documentation/geosparql/index.html">GeoSPARQL</a></li>
<li><a href="/documentation/inference/index.html">Inference API</a></li>
<li><a href="/documentation/javadoc.html">Javadoc</a></li>
<li><a href="/documentation/ontology/">Ontology API</a></li>
<li><a href="/documentation/permissions/index.html">Permissions</a></li>
<li><a href="/documentation/extras/querybuilder/index.html">Query Builder</a></li>
<li><a href="/documentation/rdf/index.html">RDF API</a></li>
<li><a href="/documentation/rdfconnection/">RDF Connection - SPARQL API</a></li>
<li><a href="/documentation/io/">RDF I/O</a></li>
<li><a href="/documentation/rdfstar/index.html">RDF-star</a></li>
<li><a href="/documentation/shacl/index.html">SHACL</a></li>
<li><a href="/documentation/shex/index.html">ShEx</a></li>
<li><a href="/documentation/jdbc/index.html">SPARQL over JDBC</a></li>
<li><a href="/documentation/tdb/index.html">TDB</a></li>
<li><a href="/documentation/tdb2/index.html">TDB2</a></li>
<li><a href="/documentation/query/text-query.html">Text Search</a></li>
</ul>
</li>
<li class="drop down">
<a href="#" class="dropdown-toggle" data-toggle="dropdown"><span class="glyphicon glyphicon-book"></span> Javadoc <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="/documentation/javadoc.html">All Javadoc</a></li>
<li><a href="/documentation/javadoc/arq/">ARQ</a></li>
<li><a href="/documentation/javadoc_elephas.html">Elephas</a></li>
<li><a href="/documentation/javadoc/fuseki2/">Fuseki</a></li>
<li><a href="/documentation/javadoc/geosparql/">GeoSPARQL</a></li>
<li><a href="/documentation/javadoc/jdbc/">JDBC</a></li>
<li><a href="/documentation/javadoc/jena/">Jena Core</a></li>
<li><a href="/documentation/javadoc/permissions/">Permissions</a></li>
<li><a href="/documentation/javadoc/extras/querybuilder/">Query Builder</a></li>
<li><a href="/documentation/javadoc/shacl/">SHACL</a></li>
<li><a href="/documentation/javadoc/tdb/">TDB</a></li>
<li><a href="/documentation/javadoc/text/">Text Search</a></li>
</ul>
</li>
<li id="ask"><a href="/help_and_support/index.html"><span class="glyphicon glyphicon-question-sign"></span> Ask</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown"><span class="glyphicon glyphicon-bullhorn"></span> Get involved <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="/getting_involved/index.html">Contribute</a></li>
<li><a href="/help_and_support/bugs_and_suggestions.html">Report a bug</a></li>
<li class="divider"></li>
<li class="dropdown-header">Project</li>
<li><a href="/about_jena/about.html">About Jena</a></li>
<li><a href="/about_jena/architecture.html">Architecture</a></li>
<li><a href="/about_jena/citing.html">Citing</a></li>
<li><a href="/about_jena/team.html">Project team</a></li>
<li><a href="/about_jena/contributions.html">Related projects</a></li>
<li><a href="/about_jena/roadmap.html">Roadmap</a></li>
<li class="divider"></li>
<li class="dropdown-header">ASF</li>
<li><a href="http://www.apache.org/">Apache Software Foundation</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
<li><a href="http://www.apache.org/security/">Security</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
</ul>
</li>
<li id="edit"><a href="https://github.com/apache/jena-site/edit/main/source/documentation/archive/sdb/loading_data.md" title="Edit this page on GitHub"><span class="glyphicon glyphicon-pencil"></span> Edit this page</a></li>
</ul>
</div>
</div>
</nav>
<div class="container">
<div class="row">
<div class="col-md-12">
<div id="breadcrumbs">
<ol class="breadcrumb">
<li><a href='/documentation'>DOCUMENTATION</a></li>
<li><a href='/documentation/archive'>ARCHIVE</a></li>
<li><a href='/documentation/archive/sdb'>SDB</a></li>
<li class="active">LOADING DATA</li>
</ol>
</div>
<h1 class="title">SDB Loading data</h1>
<p>There are three ways to load data into SDB:</p>
<ol>
<li>Use the command utility
<a href="commands.html#Loading_data" title="SDB/Commands">sdbload</a></li>
<li>Use one of the Jena <code>model.read</code> operations</li>
<li>Use the Jena <code>model.add</code></li>
</ol>
<p>The last one of these requires the application to signal the
beginning and end of batches.</p>
<h2 id="loading-with-modelread">Loading with <code>Model.read</code></h2>
<p>A Jena Model obtained from SDB via:</p>
<pre><code>SDBFactory.connectModel(store)
</code></pre>
<p>will automatically bulk load data for each call of one of the
<code>Model.read</code> operations.</p>
<h2 id="loading-with-modeladd">Loading with <code>Model.add</code></h2>
<p>The <code>Model.add</code> operations, in any form or combination of forms,
whether loading a single statement, list of statements, or another
model, will invoke the bulk loader if previously notified before an
add operation.</p>
<p>You can also explicitly delimit bulk operations:</p>
<pre><code> model.notifyEvent(GraphEvents.startRead)
... do add/remove operations ...
model.notifyEvent(GraphEvents.finishRead)
</code></pre>
<p><strong>Failing to notify the end of the operations will result in data loss</strong>.</p>
<p>A try/finally block can ensure that the finish is notified.</p>
<pre><code> model.notifyEvent(GraphEvents.startRead) ;
try {
... do add/remove operations ...
} finally {
model.notifyEvent(GraphEvents.finishRead) ;
}
</code></pre>
<p>The <code>model.read</code> operations do this automatically.</p>
<p>The bulk loader will automatically chunk large sequences of
additions to sizes appropriate to the underlying database. The bulk
loader is threaded with double-buffered; loading to the database
happens in parallel to the application thread and any RDF parsing.</p>
<h2 id="how-the-loader-works">How the loader works</h2>
<p>Loading consists of two phases: in the java VM, and on the database
itself. The SDB loader takes incoming triples and breaks them down
into components ready for the database. These prepared triples are
added to a queue for the database phase, which (by default) takes
place on a separate thread. When the number of triples reaches a
limit (default 20,000), or finish update is signalled, the triples
are passed to the database.</p>
<p>You can configure whether to use threading and the &lsquo;chunk size&rsquo; &ndash;
the number of triples per load event &ndash; via <code>StoreLoader</code>.</p>
<pre><code>Store store; // SDB Store
...
store.getLoader().setChunkSize(5000); //
store.getLoader().setUseThreading(false); // Don't thread
</code></pre>
<p>You should set these <em>before</em> the loader has been used.</p>
<p>Each loader sets up two temporary tables (<code>NNode</code> and <code>NTrip</code>) that
mirror <code>Nodes</code> and <code>Triples</code> tables. These tables are virtually
identical, except that a) they are not indexed and b) for the index
variant there is no index column for nodes.</p>
<p>When loading prepared triples &ndash; triples that have been broken down
ready for the database &ndash; are passed to the loader core (normally
running on a different thread). When the chunk size is reached, or
we are out of triples, the following happens:</p>
<ul>
<li>Prepared nodes are added in one go to <code>NNode</code>. Duplicate nodes
within a chunk are suppressed on the java side (this is worth doing
since they are quite common, e.g. properties).</li>
<li>Prepared triples are added in one go to <code>NTrip</code>.</li>
<li>New nodes are added to the node table (duplicate suppression is
explained below).</li>
<li>New triples are added to the triple table (once again
suppressing dupes). For the index case this involves joining on the
node table to do a hash to index lookup.</li>
<li>We commit.</li>
<li>If anything goes wrong the transaction (the chunk) is rolled
back, and an exception is thrown (or readied for throwing on the
calling thread).</li>
</ul>
<p>Thus there are five calls to the database for every chunk. The
database handles almost all of the work uninterrupted (duplicate
suppression, hash to index lookup), which makes loading reasonably
quick.</p>
<h2 id="duplicate-suppression">Duplicate Suppression</h2>
<p>MySQL has a very useful <code>INSERT IGNORE</code>, which will keep going,
skipping an offending row if a uniqueness constraint is violated.
For other databases we need something else.</p>
<p>Having tried a number of options the best seems to be to <code>INSERT</code>
new items by <code>LEFT JOIN</code> new items to existing items, then
filtering <code>WHERE (existing item feature) IS NULL</code>. Specifically,
for the triple hash case (where no id lookups are needed):</p>
<pre><code>INSERT INTO Triples
SELECT DISTINCT NTrip.s, NTrip.p, NTrip.o -- DISTINCT because new triples may contain duplicates (not so for nodes)
NTrip LEFT JOIN Triples ON (NTrip.s=Triples.s AND NTrip.p=Triples.p AND NTrip.o=Triples.o)
WHERE Triples.s IS NULL OR Triples.p IS NULL OR Triples.o IS NULL
</code></pre>
</div>
</div>
</div>
<footer class="footer">
<div class="container" style="font-size:80%" >
<p>
Copyright &copy; 2011&ndash;2022 The Apache Software Foundation, Licensed under the
<a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
</p>
<p>
Apache Jena, Jena, the Apache Jena project logo, Apache and the Apache feather logos are trademarks of
The Apache Software Foundation.
<br/>
<a href="https://privacy.apache.org/policies/privacy-policy-public.html"
>Apache Software Foundation Privacy Policy</a>.
</p>
</div>
</footer>
<script type="text/javascript">
var link = $('a[href="' + this.location.pathname + '"]');
if (link != undefined)
link.parents('li,ul').addClass('active');
</script>
</body>
</html>