With STANBOL-835 this tool was fully ported to the Entityhub Indexing Tool. Please also consider the documentation of that tool as this will only cover geonames.org details.
Built the utility:
mvn install mvn assembly:single
after this completes you will find the runable jar at
target/org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar
It is strongly recommended to copy this file in an dedicated folder used for indexing. Within this folder you need than to call
java -jar org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar init
this will initialize the indexing directory based on the default configuration included in the tool.
This chapter only covers genomes specific stuff. Users that are new to the Entityhub Indexing Tool should also have a look at the documentation provided with the genericrdf indexing tool.
The geonamames.org indexing tool provides an own indexing source that operates on the database dump file provided by geonames.org. Users that want to index all Geonames.org entities will want to use the allCountries archive as source. However Geonames.org also provide country specific as well as files only containing Cities with a population higher than x.
Users that do want to use several files as indexing source should create an own folder in the resources directory (the “./indexing/resources” folder) and add all sums they want to index to that folder. If the indexing source configuration points to that folder all files within that folder will be indexed.
The following example shows an configuration of the indexing source within the indexing.properties file that assumes that the “dump” folder created. The “dump” folder can contain as many genomes archives as needed (e.g. DE.zip, AT, CH and cities15000.zip.
entityDataIterable=org.apache.stanbol.entityhub.indexing.geonames.GeonamesIndexingSource,source:allCountries.zip
Alternate labels are provided by the alternateNames.zip. That means that those labels are not available form the Geonames IndexingSource. Because of that those labels are added during the Entity processing step by the AlternateLabelProcessor.
To use this EntityProcessor users need to add it the the list of EntityProcessors as configured in the indexing.properties file. It is activated by the default configuration of the tool.
by default the AlternateLabelProcessor assumes the alternateNames.zip to be present in the Resource Directory (./indexing/resources)
Geonames.org defines different two sources of hierarchies: (1) via the administrative regions and (2) the hierarchy.zip. For details please see the Geonames Dump Readme file.
As this information are not part of the geonames.org main table those information are not provided by the Geonames IndexingSource but added by the HierarchyProcessor. This processor consumes the following data:
By default all those files are expected in the Resource directory (./indexing/resources). File names and location can be adapted by the configuration provided in the indexing.properties file.
To start the indexing process make sure that all the required files are in the Resource Folder (./indexing/resources). After that you need to call the tool with
java -Xmx4g -server -jar org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar index
The 4GByte of memory are required because hierarchy and alternate labels are loaded in-memory for the indexing process. If you do not use those EntityProcessors the memory footprint should be less than 500MByte.