| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| |
| # Default Indexing Tool for RDF |
| |
| This tool provides a default configuration for creating a SOLr index of RDF |
| files (e.g. a SKOS export of a thesaurus or a set of foaf files) |
| |
| ## Building |
| |
| If not yet built during the build process of the entityhub call |
| |
| mvn install |
| |
| to build the jar with all the dependencies used later for indexing. |
| |
| If the build succeeds go to the /target directory and copy the |
| |
| org.apache.stanbol.entityhub.indexing.genericrdf-*.jar |
| |
| to the directory you would like to start the indexing. |
| |
| ## Indexing |
| |
| ### (1) Initialize the configuration |
| |
| The default configuration is initialized by calling |
| |
| java -jar org.apache.stanbol.entityhub.indexing.genericrdf-*.jar init |
| |
| This will create a sub-folder "indexing" in the current directory. |
| Within this folder all the |
| |
| * configurations (indexing/config) |
| * source files (indexing/resources) |
| * created files (indexing/destination) |
| * distribution files (indexing/distribution) |
| |
| will be located. |
| |
| ### (2) Adapt the configuration |
| |
| The configuration is located within the |
| |
| indexing/config |
| |
| directory. |
| |
| The indexer supports two indexing modes |
| |
| 1. Iterate over the data and lookup the scores for entities (default). |
| For this mode the "entityDataIterable" and an "entityScoreProvider" MUST BE |
| configured. If no entity scores are available, a default entityScoreProvider |
| provides no entity scores. This mode is typically used to index all entities of |
| a dataset. |
| 2. Iterate over the entity IDs and Scores and lookup the data. For this Mode an |
| "entityIdIterator" and an "entityDataProvider" MUST BE configured. This mode is |
| typically used if only a small sub-set of a large dataset is indexed. This might |
| be the case if Entity-Scores are available and users want only to index the e.g. |
| 10000 most important Entities or if a dataset contains Entities of many different |
| types but one wants only include entities of a specific type (e.g. Species in |
| DBpedia). |
| |
| |
| The configuration of the mentioned components is contained in the main indexing |
| configuration file explained below. |
| |
| #### Main indexing configuration (indexing.properties) |
| |
| This file contains the main configuration for the indexing process. |
| |
| * the "name" property MUST BE set to the name of the referenced site to be created |
| by the indexing process |
| * the "entityDataIterable" is used to configure the component iterating over the |
| RDF data to be indexed. The "source" parameter refers to the directory the RDF |
| files to be indexed are searched. The RDF files can be compressed with 'gz', |
| 'bz2' or 'zip'. It is even supported to load multiple RDF files contained in a |
| single ZIP archive. |
| * the "entityScoreProvider" is used to provide the ranking for entities. A |
| typical example is the number of incoming links. Such rankings are typically |
| used to weight recommendations and sort result lists. (e.g. by a query for |
| "Paris" it is much more likely that a user refers to Paris in France as to one |
| of the two Paris in Texas). If no rankings are available you should use the |
| "org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider". |
| * the "scoreNormalizer" is only useful in case entity scores are available. |
| This component is used to normalize rankings or also to filter entities with |
| low rankings. |
| * the "entityProcessor" is used to process (map, convert, filter) information |
| of entities before indexing. The mapping configuration is provided in an separate |
| file (default "mapping.txt"). |
| * the "entityPostProcessor" is used to process already indexed entities in a |
| 2nd iteration. This has the advantage, that processors used in the post-processing |
| can assume that all raw data are already present within IndexingDestination. |
| For this step the IndexingDestination is used for both source and destination. |
| See also [STANBOL-591](https://issues.apache.org/jira/browse/STANBOL-591) |
| * Indexes need to provide the configurations used to store entities. The |
| "fieldConfiguration" allows to specify this. Typically it is the same mapping |
| file as used for the "entityProcessor" however this is not a requirement. |
| * the "indexingDestination" property is used to configure the target for the |
| indexing. Currently there is only a single implementation that stores the indexed |
| data within a SolrYard. The "boosts" parameter can be used to boost (see Solr |
| Documentation for details) specific fields (typically labels) for full text |
| searches. |
| * all properties starting with "org.apache.stanbol.entityhub.site." are used for |
| the configuration of the referenced site. |
| |
| Please note also the documentation within the "indexing.properties" file for details. |
| |
| #### Mapping configuration (mappings.txt) |
| |
| Mappings are used for three different purposes: |
| |
| 1. During the indexing process by the "entityProcessor" to process the |
| information of each entity |
| 2. At runtime by the local Cache to process single Entities that are updated in the cache. |
| 3. At runtime by the Entityhub when importing an Entity from a referenced Site. |
| |
| The configurations for (1) and (2) are typically identical. For (3) one might |
| want to use a different configuration. The default configuration assumes to |
| use the same configuration (mappings.txt) for (1) and (2) and no specific |
| configuration for (3). |
| |
| The mappings.txt in its default already include mappings for popular ontologies |
| such as Dublin Core, SKOS and FOAF. Domain specific mappings can be added to |
| this configuration. |
| |
| #### Score Normalizer configuration |
| |
| The default configuration also provides examples for configurations of the |
| different score normalisers. However by default they are not used. |
| |
| * "minscore.properties": Example of how to configure minimum score for Entities |
| to be indexed |
| * "scorerange.properties": Example of how to normalise the maximum/minimum score |
| of Entities to the configured range. |
| |
| NOTE: |
| |
| * To use score normalisation, scores need to be provided for Entities. This means |
| an "entityScoreProvider" or an "entityIdIterator" needs to be configured |
| (indexing.properties). |
| * Multiple score normalisers can be used. The call order is determined by the |
| configuration of the "scoreNormalizer" property (indexing.properties). |
| |
| ### (3) Provide the RDF files to be indexed |
| |
| All sources for the indexing process need to be located within the the |
| |
| indexing/resources |
| |
| directory |
| |
| By default the RDF files need to be located within |
| |
| indexing/resources/rdfdata |
| |
| however this can be changed via the "source" parameter of the "entityDataIterable" |
| or "entityDataProvider" property in the main indexing configuration (indexing.properties). |
| |
| |
| Supported RDF files are: |
| |
| * RDF/XML (by using one of "rdf", "owl", "xml" as extension): Note that this |
| encoding is not well suited for importing large RDF datasets. |
| * N-Triples (by using "nt" as extension): This is the preferred format for |
| importing (especially large) RDF datasets. |
| * NTurtle (by using "ttl" as extension) |
| * N3 (by using "n3" as extension) |
| * NQuards (by using "nq" as extension): Note that all named graphs will be |
| imported into the same index. |
| * Trig (by using "trig" as extension) |
| |
| Supported compression formats are: |
| |
| * "gz" and "bz2" files: One need to use double file extensions to indicate both |
| the used compression and RDF file format (e.g. myDump.nt.bz2) |
| * "zip": For ZIP archives all files within the archive are treated separately. |
| That means that even if a ZIP archive contains multiple RDF files, all of them |
| will be imported. |
| |
| ### (4) Create the Index |
| |
| java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-*.jar index |
| |
| Note that calling the utility with the option -h will print the help. |
| |
| |
| ## Use the created index with the Entityhub |
| |
| After the indexing completes the distribution folder |
| |
| /indexing/dist |
| |
| will contain two files |
| |
| 1. org.apache.stanbol.data.site.{name}-{version}.jar: This is a Bundle that can |
| be installed to any OSGI environment running the Apache Stanbol Entityhub. When |
| Started it will create and configure |
| |
| * a "ReferencedSite" accessible at "http://{host}/{root}/entityhub/site/{name}" |
| * a "Cache" used to connect the ReferencedSite with your Data and |
| * a "SolrYard" that managed the data indexed by this utility. |
| |
| When installing this bundle the Site will not be yet work, because this Bundle |
| does not contain the indexed data but only the configuration for the Solr Index. |
| |
| 2. {name}.solrindex.zip: This is the ZIP archive with the indexed data. This |
| file will be requested by the Apache Stanbol Data File Provider after installing |
| the Bundle described above. To install the data you need copy this file to the |
| "/sling/datafiles" folder within the working directory of your Stanbol Server. |
| |
| If you copy the ZIP archive before installing the bundle, the data will be |
| picked up during the installation of the bundle automatically. If you provide |
| the file afterwards you will also need to restart the SolrYard installed by the |
| Bundle. |
| |
| {name} denotes to the value you configured for the "name" property within the |
| "indexing.properties" file. |
| |
| ### A note about blank nodes |
| |
| If your input data sets contain large numbers of blank nodes, you may find that |
| you have problems running out of heap space during indexing. This is because Jena |
| (like many semantic stores) keeps a store of blank nodes in core memory while |
| importing. Keeping in mind that EntityHub does not support the use of blank nodes, |
| there is a means of indexing such data sets nonetheless. You can convert them to |
| named nodes and then index. There is a convenient tool packaged with Stanbol for |
| this purpose, called "Urify" (org.apache.stanbol.entityhub.indexing.Urify). |
| It is available in the runnable JAR file built by this indexer. To use it, put that |
| JAR on your classpath, and you can execute Urify, giving it a list of files to process. |
| Use the "-h" or "--help" flag to see options for Urify: |
| |
| java -Xmx1024m -cp org.apache.stanbol.entityhub.indexing.genericrdf-*.jar \ |
| org.apache.stanbol.entityhub.indexing.Urify --help |
| |
| |