TDB xloader (“x” for external) is a bulkloader for very large datasets. The goal is stability and reliability for long running loading, running on modest hardware and can be use to load a database on rotating disk or SSD.
xloader
is not a replacement for regular TDB1 and TDB2 loaders. It is for very large datasets.
There are two scripts to load data using the xloader subsystem.
“tdb1.xloader”, which was called “tdbloader2”, has some improvements.
It is not as fast as other TDB loaders on datasets where the general loaders work without encountering progressive slowdown.
The xloaders for TDB1 and TDB2 are not identical. The TDB2 xloader is more capable; it is based on the same design approach with further refinements to building the node table and to reduce the total amount of temporary file space used.
The xloader does not run on MS Windows. It uses an external sort program from unix - sort(1)
.
The xloader only builds a fresh database from empty. It can not be used to load an existing database.
tdb2.xloader --loc DIRECTORY
FILE...
or
tdb1.xloader --loc DIRECTORY
FILE...
Additionally, there is an argument --tmpdir
to use a different directory for temporary files.
FILE
is any RDF syntax supported by Jena. Syntax is determined by the file extension and can include an addtional “.gz” or “.bz2” for compressed files.
tdb2.xloader
also supports argument --threads
to set the number of threads to use with sort(1)
. The default is 2. The recommendation for an initial setting is to set it to the number of cores (not hardware threads) minus 1. This is sensitive to the hardware environment. Experimentation may show a different, better setting.
To avoid a load failing due to a syntax or other data error, it is advisable to run riot --check
on the data first. Parsing is faster than loading.
The TDB databases will take up a lot of disk space and in addition during loading xloader
uses a significant amount of temporary disk space.
If desired, the data can be converted to RDF Thrift at this stage by adding --stream rdf-thrift
to the riot checking run. Parsing RDF Thrift is faster than parsing N-Triples although the bulk of the loading process is not limited by parser speed.
Do not capture the bulk loader output in a file on the same disk as the database or temporary directory; it slows loading down.