This tool creates local indexes of Freebase to be used with the Stanbol Entityhub.
If not yet build by the built process of the entityhub call
mvn install
to build the jar with all the dependencies used later for indexing.
If the build succeeds go to the /target directory and copy the
org.apache.stanbol.entityhub.indexing.freebase-*.jar
to the directory you would like to start the indexing.
The configuration can be initialized with the defaults by calling
java -jar org.apache.stanbol.entityhub.indexing.freebase-*.jar init
This will create a sub-folder with the name indexing in the current directory. Within this folder all the
will be located.
The indexing itself can be started by
java -jar -Xmx32g org.apache.stanbol.entityhub.indexing.freebase-*.jar index
but before doing this please note the points (2) ... (5)
NOTEs:
Freebase provided full RDF dumps at
https://developers.google.com/freebase/data
you will need to download the dump and store it to the ‘indexing/resources/rdfdata’ folder.
The Entityhub Indexing tool supports the use of index time boosts. Those boosts can be set based on the number of referenced an Entity has within the freebase knowledge base by calling
gunzip -c ${FB_DUMP} \ | grep "^ns:m\..*\t.*\tns:m\." \ | cut -f 3 | sed 's/.$//' \ | sort -S $MAX_SORT_MEM \ | uniq -c \ | sort -nr -S $MAX_SORT_MEM > $INCOMING_FILE
NOTE: Ubuntu requires a different syntax for grep e.g.
grep $'^ns:m\..*\t.*\tns:m\.'
See also the [fbranking.sh] script in the same directory. The $INCOMING_FILE needs to be copied to ‘indexing/resource/incoming_lings.txt’.
As of March 2013 some statements within the Freebase RDF dump where corrupted. In such cases you will encounter RiotExceptions (Jena RDF parser exceptions) while importing the Dump to Jena TDB.
Luckily Andy Seaborne has created an Perl script that is able to correct all those issues. You can download this script from
http://people.apache.org/~andy/Freebase20121223/
and use it to process the dump like
gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}
NOTE that the script for (3) EntityScores will no longer work on the fixed version as ‘fixit’ replaces ‘\t’ with ' '. So if you want to run the EntityScore script on the fixed Dump you will need to adapt the ‘grep’ part of the script accordingly.
The tool comes with a default configuration that will:
Configuration Files
This section provide information on the configuration files in the ‘indexing/config’ folders.
indexing.properties: Main configuration for the indexing process. It defines the used components and there configurations. Unless for users that want to add/remove additional components (e.g. EntityProcessors) there is usually no need to make any changes to this file.
indexingsource.properties: This is the configuration for the Jena TDB indexing source. This configuration is important as it is used to define the subset of RDF triples that will get imported from the massive Freebase RDF dump file containing > 1.300 million RDF triples. Reducing the number of imported triples can considerable reduce the indexing time. By default two ‘import-filter’ are configured:
mapping.ldpath: This is used to use LDPath for transforming information provided by Freebase. NOTE that with the currently used LDPath version full URIs need to be used for Freebase properties as the parser does not support ‘{ns}:{localname}’ for '{localname}'s that do contain ‘.’.
mapping.txt: This defines the properties included in the generated index. In addition it is used for data type transformation and copying fields. While those things could be also done using LDPath it is more efficient to use the this configuration for those things.
entityTypes.properties: This allows to index only Entities of specific types. By default only Freebase topics are indexed (as those are similar to what Entities are in Apahce Stanbol). However this can also be used to index only specific types (e.g. Persons, Organizations and Places).
fieldboosts.properties: Contains index time boosts for specific fields. By default labels are boosted agains alternate labels and comments.
namespaceprefix.mappings: defines extra namespace prefixes used within the indexing configuration. Stanbol default mappings are anyway present. If the host has internet connectivity also mappings from prefix.cc will be loaded. It is important that this file maps the prefix ‘ns’ to the Freebase namespace.
minincoming.properties: The minimum number of incoming links an entity must have within the Freebase knowledge base to be indexed. Higher values will reduce the number of indexed Entities. ‘1’ will include all Entities.
iditerator.properties: Ensures that the ‘indexing/resource/incoming_lings.txt’ file is correctly processed. No need to change this file.
scorerange.properties: This file ensures that ranking of Entities (based on the number of incoming links) are correctly mapped in the [0..1] range. No need to change this file
freebase/**: this is the Solr Core configuration used for indexing.
After the indexing completes the distribution folder
/indexing/dist
will contain two files
freebase.solrindex.zip
: This is the ZIP archive with the indexed data. This file will be requested by the Apache Stanbol Data File Provider after installing the Bundle described above. To install the data you need copy this file to the “/stanbol/datafiles” folder within the working directory of your Stanbol Server.
org.apache.stanbol.data.site.freebase-{version}.jar
:
This is a Bundle that can be installed to any OSGI environment running the Apache Stanbol Entityhub. This can be done by using the Apache Felix Webconsole or by copying the bundle to the ‘{stanbol-working-dir}/stanbol/fileinstall’ folder.