blob: 162ed09469c970285e00d82414cc37213c065d1d [file] [log] [blame] [view]
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# Default Indexing Tool for RDF
This tool provides a default configuration for creating a SOLr index of RDF
files (e.g. a SKOS export of a thesaurus or a set of foaf files)
## Building
If not yet built during the build process of the entityhub call
mvn install
to build the jar with all the dependencies used later for indexing.
If the build succeeds go to the /target directory and copy the
org.apache.stanbol.entityhub.indexing.genericrdf-*.jar
to the directory you would like to start the indexing.
## Indexing
### (1) Initialize the configuration
The default configuration is initialized by calling
java -jar org.apache.stanbol.entityhub.indexing.genericrdf-*.jar init
This will create a sub-folder "indexing" in the current directory.
Within this folder all the
* configurations (indexing/config)
* source files (indexing/resources)
* created files (indexing/destination)
* distribution files (indexing/distribution)
will be located.
### (2) Adapt the configuration
The configuration is located within the
indexing/config
directory.
The indexer supports two indexing modes
1. Iterate over the data and lookup the scores for entities (default).
For this mode the "entityDataIterable" and an "entityScoreProvider" MUST BE
configured. If no entity scores are available, a default entityScoreProvider
provides no entity scores. This mode is typically used to index all entities of
a dataset.
2. Iterate over the entity IDs and Scores and lookup the data. For this Mode an
"entityIdIterator" and an "entityDataProvider" MUST BE configured. This mode is
typically used if only a small sub-set of a large dataset is indexed. This might
be the case if Entity-Scores are available and users want only to index the e.g.
10000 most important Entities or if a dataset contains Entities of many different
types but one wants only include entities of a specific type (e.g. Species in
DBpedia).
The configuration of the mentioned components is contained in the main indexing
configuration file explained below.
#### Main indexing configuration (indexing.properties)
This file contains the main configuration for the indexing process.
* the "name" property MUST BE set to the name of the referenced site to be created
by the indexing process
* the "entityDataIterable" is used to configure the component iterating over the
RDF data to be indexed. The "source" parameter refers to the directory the RDF
files to be indexed are searched. The RDF files can be compressed with 'gz',
'bz2' or 'zip'. It is even supported to load multiple RDF files contained in a
single ZIP archive.
* the "entityScoreProvider" is used to provide the ranking for entities. A
typical example is the number of incoming links. Such rankings are typically
used to weight recommendations and sort result lists. (e.g. by a query for
"Paris" it is much more likely that a user refers to Paris in France as to one
of the two Paris in Texas). If no rankings are available you should use the
"org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider".
* the "scoreNormalizer" is only useful in case entity scores are available.
This component is used to normalize rankings or also to filter entities with
low rankings.
* the "entityProcessor" is used to process (map, convert, filter) information
of entities before indexing. The mapping configuration is provided in an separate
file (default "mapping.txt").
* the "entityPostProcessor" is used to process already indexed entities in a
2nd iteration. This has the advantage, that processors used in the post-processing
can assume that all raw data are already present within IndexingDestination.
For this step the IndexingDestination is used for both source and destination.
See also [STANBOL-591](https://issues.apache.org/jira/browse/STANBOL-591)
* Indexes need to provide the configurations used to store entities. The
"fieldConfiguration" allows to specify this. Typically it is the same mapping
file as used for the "entityProcessor" however this is not a requirement.
* the "indexingDestination" property is used to configure the target for the
indexing. Currently there is only a single implementation that stores the indexed
data within a SolrYard. The "boosts" parameter can be used to boost (see Solr
Documentation for details) specific fields (typically labels) for full text
searches.
* all properties starting with "org.apache.stanbol.entityhub.site." are used for
the configuration of the referenced site.
Please note also the documentation within the "indexing.properties" file for details.
#### Mapping configuration (mappings.txt)
Mappings are used for three different purposes:
1. During the indexing process by the "entityProcessor" to process the
information of each entity
2. At runtime by the local Cache to process single Entities that are updated in the cache.
3. At runtime by the Entityhub when importing an Entity from a referenced Site.
The configurations for (1) and (2) are typically identical. For (3) one might
want to use a different configuration. The default configuration assumes to
use the same configuration (mappings.txt) for (1) and (2) and no specific
configuration for (3).
The mappings.txt in its default already include mappings for popular ontologies
such as Dublin Core, SKOS and FOAF. Domain specific mappings can be added to
this configuration.
#### Score Normalizer configuration
The default configuration also provides examples for configurations of the
different score normalisers. However by default they are not used.
* "minscore.properties": Example of how to configure minimum score for Entities
to be indexed
* "scorerange.properties": Example of how to normalise the maximum/minimum score
of Entities to the configured range.
NOTE:
* To use score normalisation, scores need to be provided for Entities. This means
an "entityScoreProvider" or an "entityIdIterator" needs to be configured
(indexing.properties).
* Multiple score normalisers can be used. The call order is determined by the
configuration of the "scoreNormalizer" property (indexing.properties).
### (3) Provide the RDF files to be indexed
All sources for the indexing process need to be located within the the
indexing/resources
directory
By default the RDF files need to be located within
indexing/resources/rdfdata
however this can be changed via the "source" parameter of the "entityDataIterable"
or "entityDataProvider" property in the main indexing configuration (indexing.properties).
Supported RDF files are:
* RDF/XML (by using one of "rdf", "owl", "xml" as extension): Note that this
encoding is not well suited for importing large RDF datasets.
* N-Triples (by using "nt" as extension): This is the preferred format for
importing (especially large) RDF datasets.
* NTurtle (by using "ttl" as extension)
* N3 (by using "n3" as extension)
* NQuards (by using "nq" as extension): Note that all named graphs will be
imported into the same index.
* Trig (by using "trig" as extension)
Supported compression formats are:
* "gz" and "bz2" files: One need to use double file extensions to indicate both
the used compression and RDF file format (e.g. myDump.nt.bz2)
* "zip": For ZIP archives all files within the archive are treated separately.
That means that even if a ZIP archive contains multiple RDF files, all of them
will be imported.
### (4) Create the Index
java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-*.jar index
Note that calling the utility with the option -h will print the help.
## Use the created index with the Entityhub
After the indexing completes the distribution folder
/indexing/dist
will contain two files
1. org.apache.stanbol.data.site.{name}-{version}.jar: This is a Bundle that can
be installed to any OSGI environment running the Apache Stanbol Entityhub. When
Started it will create and configure
* a "ReferencedSite" accessible at "http://{host}/{root}/entityhub/site/{name}"
* a "Cache" used to connect the ReferencedSite with your Data and
* a "SolrYard" that managed the data indexed by this utility.
When installing this bundle the Site will not be yet work, because this Bundle
does not contain the indexed data but only the configuration for the Solr Index.
2. {name}.solrindex.zip: This is the ZIP archive with the indexed data. This
file will be requested by the Apache Stanbol Data File Provider after installing
the Bundle described above. To install the data you need copy this file to the
"/sling/datafiles" folder within the working directory of your Stanbol Server.
If you copy the ZIP archive before installing the bundle, the data will be
picked up during the installation of the bundle automatically. If you provide
the file afterwards you will also need to restart the SolrYard installed by the
Bundle.
{name} denotes to the value you configured for the "name" property within the
"indexing.properties" file.
### A note about blank nodes
If your input data sets contain large numbers of blank nodes, you may find that
you have problems running out of heap space during indexing. This is because Jena
(like many semantic stores) keeps a store of blank nodes in core memory while
importing. Keeping in mind that EntityHub does not support the use of blank nodes,
there is a means of indexing such data sets nonetheless. You can convert them to
named nodes and then index. There is a convenient tool packaged with Stanbol for
this purpose, called "Urify" (org.apache.stanbol.entityhub.indexing.Urify).
It is available in the runnable JAR file built by this indexer. To use it, put that
JAR on your classpath, and you can execute Urify, giving it a list of files to process.
Use the "-h" or "--help" flag to see options for Urify:
java -Xmx1024m -cp org.apache.stanbol.entityhub.indexing.genericrdf-*.jar \
org.apache.stanbol.entityhub.indexing.Urify --help