tree: 881cebb0631fee82e476a07cb87939b8108016d9 [path history] [tgz]

entityhub/indexing/dbpedia/README.md

Indexer for the DBpedia dataset (see http://dbpedia.org/)

This tool creates local indexes of DBpedia to be used with the Stanbol Entityhub.

Building:

If not yet build by the built process of the entityhub call

mvn install

to build the jar with all the dependencies used later for indexing.

If the build succeeds go to the /target directory and copy the

org.apache.stanbol.entityhub.indexing.dbpedia-*.jar

to the directory you would like to start the indexing.

Index:

(1) Initialize the configuration

The configuration can be initialized with the defaults by calling

java -jar org.apache.stanbol.entityhub.indexing.dbpedia-*.jar init

This will create a sub-folder with the name indexing in the current directory. Within this folder all the

configurations (indexing/config)
source files (indexing/resources)
created files (indexing/destination)
distribution files (indexing/distribution)

will be located.

The indexing itself can be started by

java -jar org.apache.stanbol.entityhub.indexing.dbpedia-*.jar index

but before doing this please note the points (2), (3) and (4)

(2) Download the dbPedia Dump Files:

All RDF dumps need to be copied to the directory:

indexing/resources/rdfdata

The files do not need to be decompressed. The raw “.nt.bz2” files from DBpedia can downloaded to that folder directly.

At the time of writing, version 3.6 is the latest release. All available archives are referenced on this page:

http://wiki.dbpedia.org/Downloads36

The RDF dump of DBpedia.org is splitted up in a number of different files. The actual files needed depend on the configuration of the mappings (indexing/config/mappings.txt). Generally one need to make sure that all the RDF dumps with the source data for the specified mappings are available. A best is to use the previews of the dumps to check if the data of a dump is required or not. Here is the list of typically interesting dump files:

During the first part of the indexing (a.k.a. the initialisation step) all the RDF files within the “indexing/resources/rdfdata” directory will be imported to an Jena TDB RDF triple store. The imported data are stored under:

indexing/resources/tdb

and can be reused for subsequent indexing processes.

To avoid (re)importing of already imported resources one need to remove such RDF files from the “indexing/resources/rdfdata” or - typically the better option - rename the “rdfdata” folder after the initial run.

It is also safe to:

cancel the indexing process after the initialization has completed (as soon as the log says that the indexing has started).
load additional RDF dumps by putting additional RDF files to the “rdfdata” directory. This files will be added to the others on the next start of the indexing tool.

(3) Entity Scores

The DBpedia.org indexer uses the incoming links from other wikipages to calculate the rank of entities. Entities with more incoming links get an higher rank. A RDF dump containing all outgoing wiki links is available on DBpedia (page_links_en.nt.bz2). This file need to be processed with the following command to get an file containing an ordered list of incoming count and the local name of the entity.

curl http://downloads.dbpedia.org/3.6/en/page_links_en.nt.bz2 \
    | bzcat \
    | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
    | sort \
    | uniq -c  \
    | sort -nr > incoming_links.txt

Note: replace “3.6” by the latest release version of DBpedia in the above command line to get up to date data.

Depending on the machine and the download speed for the source file the execution of this command will take several hours.

Important NOTES:

Links to Categories use wrong URLs in the current version (3.6) of the page_links_en.nt.bz2 dump. All categories start with “CAT:{categoryName}” but the correct local name would be “Category:{categoryName}”. because of this categories would not be indexed.

It is strongly suggested to:

first check if still Category: is used as prefix (e.g. by checking if http://dbpedia.org/page/Category:Political_culture is still valid)
second if that is the case replace all occurrences of “CAT:” to “Category:”
sed -i ‘s/CAT:/Category:/g’ incoming_links.txt

The resulting file MUST BE copied to

indexing/resources/incoming_links.txt

There is also the possibility do download a precomputed file from:

TODO: add download location

(4) Configuration of the Index

The configurations are contained within the “indexing/config” folder:

indexing.properties: Main configuration for the indexing process. It defines the used components and there configurations. Usually no need to make any changes.
mapping.txt: Define the fields, data type requirements and languages to be indexed. Note: It is also important that the dumps containing the RDF data are available.
dbpedia/conf/schema.xml: Defines the schema used by Solr to store the data. This can be used to configure e.g. if values are stored (available for retrieval) or only indexed. See the comments within the file for details
fieldBoosts.properties: Can be used to set boost factors for fields.
minincoming.properties: Can be used to define the minimum number of incoming links (to an Wiki page from other Wiki pages) so that an entity is indexed. Higher values will cause less entities to be indexed. A value of 0 will result in all entities to be indexed.
scoreRange.properties: Can be use to set the upper bound for entities score. The entities with the most incoming links will get this score. Entities with no incoming links would get a score of zero.

(5) Using the precomputed Index:

Use the created index with the Entityhub

After the indexing completes the distribution folder

/indexing/dist

will contain two files

dbpedia.solrindex.zip: This is the ZIP archive with the indexed data. This file will be requested by the Apache Stanbol Data File Provider after installing the Bundle described above. To install the data you need copy this file to the “/sling/datafiles” folder within the working directory of your Stanbol Server.
If you are currently using the default data for DBPedia.org (as included in the Stanbol launchers) copying this file to the datafiles directory will automatically trigger the replacement of the default data. The new data will be available as soon as the copying of the data and the initialisation of the index completes. In the meantime queries will still use the old data.
To update from one “dbpedia.solrindex.zip” to an other you need first to delete the existing dbpedia.solrindex.zip from the datafiles folder. Secondly you need to stop the “org.apache.stanbol.data.site.dbpedia-{version}” bundle. This should trigger the deletion of the current dbpedia index. After the deletion you need to reactivate the bundle and copy the new “dbpedia.solrindex.zip” into the “datafiles” folder. You can already copy the new “dbpedia.solrindex.zip” to the datafile folder while the bundle is deactivated.
TODO: updating from one dbpedia.solrindex.zip to an other one is currently to complex! This could be simplified if we set “Synchronized=true” also for DBPedia.org, or by providing a management user interface for the ManagedSolrServer (e.g. Tab within the Apache Felix Web Console).
org.apache.stanbol.data.site.dbpedia-{version}.jar:
NOTE: The installation of this bundle is no longer required, because all current Stanbol launchers do already include this bundle. If you are using an launcher configuration that does not include this bundle the steps as described below are still required.
This is a Bundle that can be installed to any OSGI environment running the Apache Stanbol Entityhub (for instance using the Apache Felix web console under http://server:port/system/console - with account admin / admin by default).
When started it will create and configure:
- a “ReferencedSite” accessible at “http://{host}/{root}/entityhub/site/dbpedia”
- a “Cache” used to connect the ReferencedSite with your Data and
- a “SolrYard” that managed the data indexed by this utility.

In case you install the bundle before copying the “dbpedia.solrindex.zip” to “/sling/datafiles” you will need to restart the dbpedia “SolrYard” instance.

The used Default configuration:

This describes the default configuration as initialized during the first start of the indexing tool.

The default configuration stores creates an index with the following features:

Languages:

By default English, German, France and Italian and all literals without any language information are indexed. Please also note that one needs to provide also the RDF dumps for those languages.

Labels and Descriptions:

DBpedia.org uses “rdfs:label” for labels. Short description are stored within “rdfs:comment” and a longer version in “dbp-ont:abstract”. For both labels and descriptions generic language analyzer are used for indexing. Also Term Vectors are stored so that “More Like This” queries can be used on such fields. Abstracts are only indexed and not stored in the index. This means that values can be searched but not retrieved.

Entity types:

The types of the entities (Person, Organisation, Places, ...) are stored in “rdf:type”. Values are URLs as defined mainly by the DBpedia.org ontology.

Spatial Information:

The geo locations are indexed within “geo:lat”, “geo:long” and “geo:alt”. The mappings ensure that lat/long values are doubles and the altitude are integers.

Categories:

DBpedia contains also categories. Entities are linked to categories by the “skos:subject” and/or the “dcterms:subject” property. During the import all values defined by “dcterms:subject” are copied to “skos:subject”. Categories itself are hierarchical. Parent categories can be used by following “skos:broader” relations. e.g.

Berlin -> skos:subject
    -> Category:City-states -> skos:broader
        -> Category:Cities -> skos:broader
            -> Category:Populated_places -> skos:broader
                -> Category:Human_habitats ...

All properties defined by SKOS (http://www.w3.org/TR/skos-reference/) are indexed and stored.

DBpedia Ontology:

All properties of the DBpedia.org Ontology are indexed and stored in the index. see http://wiki.dbpedia.org/Ontology

DBpedia Properties:

Properties are field/values directly taken from the information boxes on the right side of Wikipedia pages. Fieldnames may depend on the language and also the data type of the values may be different from entity to entity. Because of this such entities are not indexed by the default configuration. It is possible to include some/all such properties by changing the mappings.txt. Note that in such cases it is also required do include the RDF dump containing this data.

Person related Properties:

DBpedia uses FOAF (http://www.foaf-project.org/) to provide additional information for persons. Some properties such as foaf:homepage are also used for entities of other types. All properties defined by FOAF are indexed and stored.

Dublin Core (DC) Metadata:

DC Elements and DC Terms metadata are indexed and stored. All DC Element properties are mapped to there DC Terms counterpart.