| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| # Indexing utility for the [geonames.org](http://www.geonames.org) dataset. |
| |
| With [STANBOL-835](https://issues.apache.org/jira/browse/STANBOL-835) this tool was fully ported to the Entityhub Indexing Tool. Please also consider the documentation of that tool as this will only cover geonames.org details. |
| |
| |
| ## Building and Indexing |
| |
| Built the utility: |
| |
| mvn install |
| mvn assembly:single |
| |
| after this completes you will find the runable jar at |
| |
| target/org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar |
| |
| It is strongly recommended to copy this file in an dedicated folder used for indexing. Within this folder you need than to call |
| |
| java -jar org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar init |
| |
| this will initialize the indexing directory based on the default configuration included in the tool. |
| |
| ## Configuration of the Tool |
| |
| This chapter only covers genomes specific stuff. Users that are new to the Entityhub Indexing Tool should also have a look at the documentation provided with the genericrdf indexing tool. |
| |
| ### Geonames IndexingSource |
| |
| The geonamames.org indexing tool provides an own indexing source that operates on the database dump file provided by geonames.org. Users that want to index all Geonames.org entities will want to use the [allCountries](http://download.geonames.org/export/dump/allCountries.zip) archive as source. However Geonames.org also provide country specific as well as files only containing Cities with a population higher than x. |
| |
| Users that do want to use several files as indexing source should create an own folder in the resources directory (the "./indexing/resources" folder) and add all sums they want to index to that folder. If the indexing source configuration points to that folder all files within that folder will be indexed. |
| |
| The following example shows an configuration of the indexing source within the indexing.properties file that assumes that the "dump" folder created. The "dump" folder can contain as many genomes archives as needed (e.g. [DE.zip](http://download.geonames.org/export/dump/DE.zip), [AT](http://download.geonames.org/export/dump/AT.zip), [CH](http://download.geonames.org/export/dump/CH.zip) and [cities15000.zip](http://download.geonames.org/export/dump/cities15000.zip). |
| |
| entityDataIterable=org.apache.stanbol.entityhub.indexing.geonames.GeonamesIndexingSource,source:allCountries.zip |
| |
| ### Support for alternate labels |
| |
| Alternate labels are provided by the [alternateNames.zip](http://download.geonames.org/export/dump/alternateNames.zip). That means that those labels are not available form the Geonames IndexingSource. Because of that those labels are added during the Entity processing step by the AlternateLabelProcessor. |
| |
| To use this EntityProcessor users need to add it the the list of EntityProcessors as configured in the indexing.properties file. It is activated by the default configuration of the tool. |
| |
| by default the AlternateLabelProcessor assumes the [alternateNames.zip](http://download.geonames.org/export/dump/alternateNames.zip) to be present in the Resource Directory (./indexing/resources) |
| |
| ### Support for Hierarchy |
| |
| Geonames.org defines different two sources of hierarchies: (1) via the administrative regions and (2) the [hierarchy.zip](http://download.geonames.org/export/dump/hierarchy.zip). For details please see the [Geonames Dump Readme file](http://download.geonames.org/export/dump/readme.txt). |
| |
| As this information are not part of the geonames.org main table those information are not provided by the Geonames IndexingSource but added by the HierarchyProcessor. This processor consumes the following data: |
| |
| * [hierarchy.zip](http://download.geonames.org/export/dump/hierarchy.zip) |
| * [countryInfo.txt](http://download.geonames.org/export/dump/ countryInfo.txt) |
| * [admin1CodesASCII.txt](http://download.geonames.org/export/dump/admin1CodesASCII.txt) |
| * [admin2Codes.txt](http://download.geonames.org/export/dump/admin2Codes.txt) |
| |
| By default all those files are expected in the Resource directory (./indexing/resources). File names and location can be adapted by the configuration provided in the indexing.properties file. |
| |
| ## Indexing |
| |
| To start the indexing process make sure that all the required files are in the Resource Folder (./indexing/resources). After that you need to call the tool with |
| |
| java -Xmx4g -server -jar org.apache.stanbol.entityhub.indexing.geonames-*-jar-with-dependencies.jar index |
| |
| The 4GByte of memory are required because hierarchy and alternate labels are loaded in-memory for the indexing process. If you do not use those EntityProcessors the memory footprint should be less than 500MByte. |
| |
| ## Advanded Options |
| |
| * __LDPath:__ The GeonamesIndexingSource does not support LDPath. Because of that the LdpathSourceProcessor can not be used. Users that want to use LDPath programs with a path length > 1 can however use the LdPathPostProcessor. This will use the IndexingDestination (SolrYard) as both source and target in the Post-Processing phase of the Indexing Tool. The indexing.properties files contains some examples for that. Users might also want to see the examples of the genericrdf indexing tool. |
| * __FieldMappings:__ The default configurations does use some simple mapping rules. The UTF-8 labels of genomes are copied to rdfs:label and the genomes:parentFeature relation is used to store links to all parent features (transitive closure over the hierarchy). |