This tool creates local indexes of DBpedia to be used with the Stanbol Entityhub.
If not yet build by the built process of the entityhub call
mvn install
to build the jar with all the dependencies used later for indexing.
If the build succeeds go to the /target directory and copy the
org.apache.stanbol.entityhub.indexing.dbpedia-*.jar
to the directory you would like to start the indexing.
The configuration can be initialized with the defaults by calling
java -jar org.apache.stanbol.entityhub.indexing.dbpedia-*.jar init
This will create a sub-folder with the name indexing in the current directory. Within this folder all the
will be located.
The indexing itself can be started by
java -jar org.apache.stanbol.entityhub.indexing.dbpedia-*.jar index
but before doing this please note the points (2), (3) and (4)
All RDF dumps need to be copied to the directory:
indexing/resources/rdfdata
The files do not need to be decompressed. The raw “.nt.bz2” files from DBpedia can downloaded to that folder directly.
At the time of writing, version 3.6 is the latest release. All available archives are referenced on this page:
http://wiki.dbpedia.org/Downloads36
The RDF dump of DBpedia.org is splitted up in a number of different files. The actual files needed depend on the configuration of the mappings (indexing/config/mappings.txt). Generally one need to make sure that all the RDF dumps with the source data for the specified mappings are available. A best is to use the previews of the dumps to check if the data of a dump is required or not. Here is the list of typically interesting dump files:
During the first part of the indexing (a.k.a. the initialisation step) all the RDF files within the “indexing/resources/rdfdata” directory will be imported to an Jena TDB RDF triple store. The imported data are stored under:
indexing/resources/tdb
and can be reused for subsequent indexing processes.
To avoid (re)importing of already imported resources one need to remove such RDF files from the “indexing/resources/rdfdata” or - typically the better option - rename the “rdfdata” folder after the initial run.
It is also safe to:
cancel the indexing process after the initialization has completed (as soon as the log says that the indexing has started).
load additional RDF dumps by putting additional RDF files to the “rdfdata” directory. This files will be added to the others on the next start of the indexing tool.
The DBpedia.org indexer uses the incoming links from other wikipages to calculate the rank of entities. Entities with more incoming links get an higher rank. A RDF dump containing all outgoing wiki links is available on DBpedia (page_links_en.nt.bz2). This file need to be processed with the following command to get an file containing an ordered list of incoming count and the local name of the entity.
curl http://downloads.dbpedia.org/3.6/en/page_links_en.nt.bz2 \ | bzcat \ | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \ | sort \ | uniq -c \ | sort -nr > incoming_links.txt
Note: replace “3.6” by the latest release version of DBpedia in the above command line to get up to date data.
Depending on the machine and the download speed for the source file the execution of this command will take several hours.
Important NOTES:
It is strongly suggested to:
first check if still Category: is used as prefix (e.g. by checking if http://dbpedia.org/page/Category:Political_culture is still valid)
second if that is the case replace all occurrences of “CAT:” to “Category:”
sed -i ‘s/CAT:/Category:/g’ incoming_links.txt
The resulting file MUST BE copied to
indexing/resources/incoming_links.txt
There is also the possibility do download a precomputed file from:
TODO: add download location
The configurations are contained within the “indexing/config” folder:
After the indexing completes the distribution folder
/indexing/dist
will contain two files
dbpedia.solrindex.zip
: This is the ZIP archive with the indexed data. This file will be requested by the Apache Stanbol Data File Provider after installing the Bundle described above. To install the data you need copy this file to the “/sling/datafiles” folder within the working directory of your Stanbol Server.
If you are currently using the default data for DBPedia.org (as included in the Stanbol launchers) copying this file to the datafiles directory will automatically trigger the replacement of the default data. The new data will be available as soon as the copying of the data and the initialisation of the index completes. In the meantime queries will still use the old data.
To update from one “dbpedia.solrindex.zip” to an other you need first to delete the existing dbpedia.solrindex.zip from the datafiles folder. Secondly you need to stop the “org.apache.stanbol.data.site.dbpedia-{version}” bundle. This should trigger the deletion of the current dbpedia index. After the deletion you need to reactivate the bundle and copy the new “dbpedia.solrindex.zip” into the “datafiles” folder. You can already copy the new “dbpedia.solrindex.zip” to the datafile folder while the bundle is deactivated.
TODO: updating from one dbpedia.solrindex.zip to an other one is currently to complex! This could be simplified if we set “Synchronized=true” also for DBPedia.org, or by providing a management user interface for the ManagedSolrServer (e.g. Tab within the Apache Felix Web Console).
org.apache.stanbol.data.site.dbpedia-{version}.jar
:
NOTE: The installation of this bundle is no longer required, because all current Stanbol launchers do already include this bundle. If you are using an launcher configuration that does not include this bundle the steps as described below are still required.
This is a Bundle that can be installed to any OSGI environment running the Apache Stanbol Entityhub (for instance using the Apache Felix web console under http://server:port/system/console - with account admin / admin by default).
When started it will create and configure:
In case you install the bundle before copying the “dbpedia.solrindex.zip” to “/sling/datafiles” you will need to restart the dbpedia “SolrYard” instance.
This describes the default configuration as initialized during the first start of the indexing tool.
The default configuration stores creates an index with the following features:
By default English, German, France and Italian and all literals without any language information are indexed. Please also note that one needs to provide also the RDF dumps for those languages.
DBpedia.org uses “rdfs:label” for labels. Short description are stored within “rdfs:comment” and a longer version in “dbp-ont:abstract”. For both labels and descriptions generic language analyzer are used for indexing. Also Term Vectors are stored so that “More Like This” queries can be used on such fields. Abstracts are only indexed and not stored in the index. This means that values can be searched but not retrieved.
The types of the entities (Person, Organisation, Places, ...) are stored in “rdf:type”. Values are URLs as defined mainly by the DBpedia.org ontology.
The geo locations are indexed within “geo:lat”, “geo:long” and “geo:alt”. The mappings ensure that lat/long values are doubles and the altitude are integers.
DBpedia contains also categories. Entities are linked to categories by the “skos:subject” and/or the “dcterms:subject” property. During the import all values defined by “dcterms:subject” are copied to “skos:subject”. Categories itself are hierarchical. Parent categories can be used by following “skos:broader” relations. e.g.
Berlin -> skos:subject -> Category:City-states -> skos:broader -> Category:Cities -> skos:broader -> Category:Populated_places -> skos:broader -> Category:Human_habitats ...
All properties defined by SKOS (http://www.w3.org/TR/skos-reference/) are indexed and stored.
All properties of the DBpedia.org Ontology are indexed and stored in the index. see http://wiki.dbpedia.org/Ontology
Properties are field/values directly taken from the information boxes on the right side of Wikipedia pages. Fieldnames may depend on the language and also the data type of the values may be different from entity to entity. Because of this such entities are not indexed by the default configuration. It is possible to include some/all such properties by changing the mappings.txt. Note that in such cases it is also required do include the RDF dump containing this data.
DBpedia uses FOAF (http://www.foaf-project.org/) to provide additional information for persons. Some properties such as foaf:homepage are also used for entities of other types. All properties defined by FOAF are indexed and stored.
DC Elements and DC Terms metadata are indexed and stored. All DC Element properties are mapped to there DC Terms counterpart.