NOTE: The remaining part of this document provides detailed information about this demo and provides information on how to customize it further to specific needs. Users that want only use this demo will not need to read this part.
The configuration used for indexing can be found at
./src/main/indexing/config
It contains of the following parts:
The default SolrCore configuration used by the Apache Entityhub is contained in the SolrYard module and can be found here. This configuration will be used if no customized configuration is present in “{indexing-root}/indexing/config/{name}” where {name} refers to the value of the property “name” in the “indexing.properties”.
Users that want/need to customize the SolrCore configuration should start with the default configuration extract this zip file to “{incexing-root}/indexing/config” and than rename the folder to the “name” configured in the “indexing.properties”. After that you can start to customize the configuration of the SolrCore used for the configuration.
THis demo uses this procedure to define two special Solr field types for indexing labels and IDs (see ./src/main/indexing/config/ehealth/conf/schema.xml).
:::xml <!-- intended to be used for labels of drugs --> <fieldType name="label" class="solr.TextField" positionIncrementGap="100" omitNorms="false"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateNumbers="1" catenateAll="1" generateWordParts="1" generateNumberParts="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="0" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <!-- Field Type used for searching Drugs based on their variouse IDs --> <fieldType name="code_field" class="solr.TextField" positionIncrementGap="100" omitNorms="false"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateNumbers="1" catenateAll="1" generateWordParts="1" generateNumberParts="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="0" /> </analyzer> </fieldType>
For more information on the tokenizers and filters used by this configuration please see Analyzers, Tokenizers, and Token Filters documentation.
Such field types are than applied to specific properties with the following configurations
The defined field names must include the prefixes used by the Apache Entityhub to represent RDF types. In this case ‘@’ refers to a plain literal without a defined language and ‘/’ is used as separator between the prefix, property and postfix.
Such mappings are configured by the “mappings.txt” file in the “{indexing-root}/indexing/config” directory.
NOTE that for this demo the “mapping.txt” file is located at “./src/main/indexing/conifg/mapping.txt” and copied by the “./indexing.sh” script to the “./target/indexing/indexing/config” folder. Users that want to modify the mappings should edit the mappings.txt file under “./src”!.
While this demo defines a lot of mappings a lot of them could be omitted, because they do just validate data types. In the following some of those data types mappings are shown.
diseasome:geneId | d=xsd:anyURI drugbank:creationDate | d=xsd:dateTime drugbank:patientInformationInsert | d=xsd:anyURI
Data type mappings are only needed if the dataset does not correctly specify the XSD datatype for literal values. Typically this happens for numbers that are stored as plain literals.
Important are field mappings such as the following mappings for SKOS preferred labels.
drugbank:genericName > skos:prefLabel diseasome:name > skos:prefLabel dailymed:fullName > skos:prefLabel
This specific set of mappings allow to search for entities of the three different datasets by using one and the same property. This is extremely useful for finding those entities form text parsed to the enhancer, because one needs only to configure a single KeywordExtractionEngine instance to cover them all.
A similar configuration is used for the various IDs specified for drugs. Those are all mapped to the “skos:notation” field. This allows to easily identify them regardless of the ID known by the User or mentioned in an text. Here are those mappings.
drugbank:ahfsCode | d=xsd:string > skos:notation drugbank:atcCode | d=xsd:string > skos:notation drugbank:dpdDrugIdNumber | d=xsd:string > skos:notation drugbank:pdbHomologyId | d=xsd:string > skos:notation drugbank:inchiKey | d=xsd:string > skos:notation drugbank:primaryAccessionNo | d=xsd:string > skos:notation drugbank:secondaryAccessionNumber | d=xsd:string > skos:notation
Note also the wildcard mappings for the used namespaces
dailymed:* drugbank:* diseasome:* sider:*
that ensures that all properties of those namespaces get indexed. This also ensures that even if a mapping like
drugbank:genericName > skos:prefLabel
is defined also
drugbank:genericName
will be present in the indexed dataset. Without those wildcard mappings one would need to explicitly define both
drugbank:genericName > skos:prefLabel drugbank:genericName
to get the same result.
While the default mapping language supports a lot of use cases for mapping, converting and filtering of properties it is by far not as capable as LDpath. Because of that the indexing tools has also support for using LDPath to process entities by using the “LdpathProcessor”.
A typical configuration of this processor (in the “indexing.properties” file) would look like
org.apache.stanbol.entityhub.indexing.core.processor.LdpathProcessor,ldpath:ldpath-mapping.txt,append:true;
This configuration says that the LDPath program is read from a file with the name “ldpath-mapping.txt” within the same directory and that the results of the transformation are appended to the indexed entity. If append is deactivated that the data of the parsed entity will be replaced by the results of the LDPath statement.
A typical usage example of the LdpathProcessor processor are type specific mappings such as
skos:prefLabel = .[rdf:type is diseasome:genes]/rdfs:label;
This specifies that only for entities of the type “diseasome:genes” the rdfs:label is mapped to skos:prefLabel.
NOTEs:
This demo indexes all four datasets in a single step. However this is not required. With a simple trick it is possible to index different datasets with different indexing configurations to the same target. This section describes how this could be achieved and why users might want to do this.
This demo uses Solr as target for the indexing process. Theoretically there might be several possibility, but currently this is the only available IndexingDestination implementation. The SolrIdnex used to store the data is located at "{indexing-root}/indexing/destination/indexes/default/{name}. If this directory does not alread exist it is initialized by the indexing tool based on the SolrCore configuration in “{indexing-root}/indexing/config/{name}” or the default SolrCore configuration of not present. However if it already exists than this core is used and the data of the current indexing process are added to the existing SolrCore.
Because of that is is possible to subsequently add information of different datasets to the same SolrIndex. However users need to know that if the different dataset contain the same entity (resource with the same URI) the information of the second dataset will replace those of the first. Nonetheless this would allow in the given demo to create separate configurations (e.g. mappings) for all four datasets while still ensuring the indexed data are contained in the same SolrIndex.
This might be useful in situations where the same property (e.g. rdfs:label) is used by the different datasets in different ways. Because than one could create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
Workflows like that can be easily implemented by shell scrips or by setting soft links in the file system.
Often users will only be interested in specific Entities of a dataset (e.g. only in Drugs but not in drug interactions, genes, side effects …). In such cases Entity Filters can be used to specify what entities should be indexed and what entities can be safely ignored.
This can be achieved by using the “FieldValueFilter” actually a special implementation of an EntityProcessor. It is included by default within the “indexing.properties” configuration, but it is deactivated by the default configuration within the “entityTypes.properties”. Detailed information on how to correctly configure this filter are provided within the “entityTypes.properties” file. To give an example the following configuration would just index drugs (of all datasets), diseases and organizations. All other entities such as sider:side_effects and dailymed:ingredients would be skipped.
field=rdf:type values= drugbank:drugs; ailymed:drugs; sider:drugs; tcm:Medicine; diseasome:diseases; dailymed:organization
FieldValueFilter supports only a single field/value combination and entities are selected if they do match at least a single of the defined values. Users that need to filter for several fields and/or multiple values can configure multiple instances. This is achieved by adding the “FieldValueFilter” multiple times as entityProcessor in the “indexing.properties” file but with different config parameters. Here is an example of such an configuration
entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:filter1;org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:filter2;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor
Make shure that the “{indexing-root}/indexing/config” contains both a “filter1.properties” and “filter2.properties” file with the according filter rules. Only Entities that pass both filters will be indexed.
This section assumes that this demo is running on a Apache Stanbol server (version 0.9.0-incubating or later). Readers that do not run their own server or have not yet installed this demo are encouraged to do so. If you do not want to do that you can also use the Stambol test server hosted by the IKS project. However all the links used by this demo will point to “http://localhost:8080”. So you will need to edit the used commands.
Sider, Drugbank and Dailymed are interlinked with each other but do define a lot of different sets of properties. The following example shows how to collect information about a drug based on following “owl:sameAs” relations defined in-between Dailymed, Sider and DrugBank.
name = dailymed:name; activeIngredient = dailymed:activeIngredient/rdfs:label; indication = dailymed:indication; dosage = dailymed:dosage; adverseReaction = dailymed:adverseReaction; warning = dailymed:boxedWarning; contraindication = dailymed:contraindication; sideEffect = (owl:sameAs)+/sider:sideEffect/rdfs:label; genericName = (owl:sameAs)+/drugbank:genericName; inchiKey = (owl:sameAs)+/drugbank:inchiKey; indication = (owl:sameAs)+/drugbank:indication; foodInteraction = (owl:sameAs)+/drugbank:foodInteraction; toxicity = (owl:sameAs)+/drugbank:toxicity; pharmacology = (owl:sameAs)+/drugbank:pharmacology;
Here LDpath is used to collect the interesting information. “(owl:sameAs)+” is used to build the transitive closure over the “owl:sameAs” properties. This LDpath program ensures that the context is an entity if the type “dailymed:drugs”.
LDPath statements like that can be used with the