tree: 38a0ae775f4dc27837b708cca32714890467e93b [path history] [tgz]
  1. src/
  2. pom.xml
  3. README.md
entityhub/indexing/geonames/README.md

Indexing utility for the geonames.org dataset.

With STANBOL-835 this tool was fully ported to the Entityhub Indexing Tool. Please also consider the documentation of that tool as this will only cover geonames.org details.

Building and Indexing

Built the utility:

mvn install
mvn assembly:single

after this completes you will find the runable jar at

target/org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar

It is strongly recommended to copy this file in an dedicated folder used for indexing. Within this folder you need than to call

java -jar org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar init

this will initialize the indexing directory based on the default configuration included in the tool.

Configuration of the Tool

This chapter only covers genomes specific stuff. Users that are new to the Entityhub Indexing Tool should also have a look at the documentation provided with the genericrdf indexing tool.

Geonames IndexingSource

The geonamames.org indexing tool provides an own indexing source that operates on the database dump file provided by geonames.org. Users that want to index all Geonames.org entities will want to use the allCountries archive as source. However Geonames.org also provide country specific as well as files only containing Cities with a population higher than x.

Users that do want to use several files as indexing source should create an own folder in the resources directory (the “./indexing/resources” folder) and add all sums they want to index to that folder. If the indexing source configuration points to that folder all files within that folder will be indexed.

The following example shows an configuration of the indexing source within the indexing.properties file that assumes that the “dump” folder created. The “dump” folder can contain as many genomes archives as needed (e.g. DE.zip, AT, CH and cities15000.zip.

entityDataIterable=org.apache.stanbol.entityhub.indexing.geonames.GeonamesIndexingSource,source:allCountries.zip

Support for alternate labels

Alternate labels are provided by the alternateNames.zip. That means that those labels are not available form the Geonames IndexingSource. Because of that those labels are added during the Entity processing step by the AlternateLabelProcessor.

To use this EntityProcessor users need to add it the the list of EntityProcessors as configured in the indexing.properties file. It is activated by the default configuration of the tool.

by default the AlternateLabelProcessor assumes the alternateNames.zip to be present in the Resource Directory (./indexing/resources)

Support for Hierarchy

Geonames.org defines different two sources of hierarchies: (1) via the administrative regions and (2) the hierarchy.zip. For details please see the Geonames Dump Readme file.

As this information are not part of the geonames.org main table those information are not provided by the Geonames IndexingSource but added by the HierarchyProcessor. This processor consumes the following data:

By default all those files are expected in the Resource directory (./indexing/resources). File names and location can be adapted by the configuration provided in the indexing.properties file.

Indexing

To start the indexing process make sure that all the required files are in the Resource Folder (./indexing/resources). After that you need to call the tool with

java -Xmx4g -server -jar org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar index

The 4GByte of memory are required because hierarchy and alternate labels are loaded in-memory for the indexing process. If you do not use those EntityProcessors the memory footprint should be less than 500MByte.

Advanded Options

  • LDPath: The GeonamesIndexingSource does not support LDPath. Because of that the LdpathSourceProcessor can not be used. Users that want to use LDPath programs with a path length > 1 can however use the LdPathPostProcessor. This will use the IndexingDestination (SolrYard) as both source and target in the Post-Processing phase of the Indexing Tool. The indexing.properties files contains some examples for that. Users might also want to see the examples of the genericrdf indexing tool.
  • FieldMappings: The default configurations does use some simple mapping rules. The UTF-8 labels of genomes are copied to rdfs:label and the genomes:parentFeature relation is used to store links to all parent features (transitive closure over the hierarchy).