ctakes-dictionary-lookup/README - ctakes - Git at Google


 Contents
 - Introduction
 - Implementation overview
 - Sample dictionaries included
 - Creating your own dictionaries
 - Configuring the annotator to use your dictionary
 - Running the analysis engine (annotator)
 	- AggregateAE.xml
 	- DictionaryLookupannotator.xml
 	- DictionaryLookupannotatorCSV.xml
 	- DictionaryLookupannotatorDB.xml

 ############
 Introduction
 ############

 The dictionary lookup annotator finds the entries from one or more dictionaries that match the
 document text in some way.  Within this annotator, these matches are called lookup hits.

 The dictionary lookup annotator is very customizable.

 It can look for matches where the words in the dictionary entries appear in the same order as
 the words in the document text, or it can look for permutations of the words from the dictionary.

 It can look just for exact matches of the words, or it can also look for matches to the
 canonical forms of the words.

 Searches for a lookup hit are limited to within windows, where the window type is defined in
 the LookupDescriptorFile. A window can be the words that fall within the same Sentence, the same
 Chunk, the same LookupWindowAnnotation or any other annotation.  See the clinical documents
 pipeline project for an example of an analysis engine (LookupWindowAnnotator.xml) that
 that creates LookupWindowAnnotations.

 Note: Dictionary entries need to have been tokenized the way the pipeline tokenizes the document text.


 #######################
 Implementation overview
 #######################

 The behavior of the dictionary lookup annotator is controlled by the parameters and resources
 defined in the analysis engine descriptor, and by the contents of the resource called the
 LookupDescriptorFile.

 For example, if the analysis engine descriptor "DictionaryLookup.xml" contains
 a resource named LookupDescriptorFile with value "lookup/LookupDesc.xml", then the parameter
 settings and resources named within "DictionaryLookup.xml", together with the values within
 "lookup/LookupDesc.xml" will control the actions of the dictionary lookup annotator.

 The lookupInitializer and lookupConsumer classes are specified within the LookupDescriptorFile.

 The algorithm used for looking up the terms is defined by the lookupInitializer, which
 creates the lookup hits.

 The lookupConsumer adds annotations to the CAS for some or all of the lookup hits.

 An example adding only some of the lookup hits to the CAS is if you have a dictionary of RxNorm
 terms with their RxNorm codes, and a dictionary of terms from the OrangeBook, and want to create
 annotations for those terms that are in the OrangeBook that also have an RxNorm code.

 This can be done using class org.apache.ctakes.dictionary.lookup.ae.FirstTokenPermLookupInitializerImpl
 as the lookupInitializer, and using class OrangeBookFilterConsumerImpl as the lookupConsumer,
 provided you have the RxNorm dictionary, and you configure the LookupDescriptorFile resource
 to use your RxNorm dictionary.


 Note: The analysis engine descriptors for this annotator use elements of type
 configurableDataResourceSpecifier.
 These cannot be modified from the Parameters or Resources tabs of the Component Descriptor
 Editor (at least not in UIMA 2.2). To view these values or edit them, use the Sources tab or
 open the descriptor with a Text Editor.


 Note: Dictionary entries need to have been tokenized the way the pipeline tokenizes the document text.

 For example, the lookup algorithm will not find a lookup hit if a dictionary entry is
 "ear, skin" but the document text contains the same text ("ear, skin") and the pipeline
 has tokenized that text as the 3 tokens "ear" "," "skin".

 To find a lookup hit for the 3 tokens, the dictionary entry should be tokenized,
 with a space before the comma: "ear , skin"

 To determine the LookupDescriptorFile for an analysis engine, open the analysis engine descriptor
 (e.g. DictionaryLookupannotator.xml) and note the URL for the LookupDescriptorFile resource
 (e.g. lookup/LookupDesc.xml).

 A LookupDescriptorFile such as lookup/LookupDesc.xml, found in resources/,
 defines the dictionary(s) used, and the classes that interact with the
 dictionary(s). The implementation tag identifies the type of dictionary:
 lucene index (luceneImpl), database (jdbcImpl), or delimited flatfile (csvImpl).
 See class org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.java for implementation details.


 To better understand the dictionary lookup annotator code you could start by reading the javadocs
 for the classes DictionaryLookupAnnotator.java and FirstTokenPermutationImpl.java.

 #######################################
 Sample dictionaries included
 #######################################

 This project includes two sample dictionaries.

 A sample database (a Lucene index) containing a few drug names is included
 for running the examples.

 A sample database (using 2 Lucene indexes) containing a few anatomical sites, procedures,
 and disorders/diseases is included for running the examples.

 The programs used to create these Lucene indexes are
 scripts/java/edu/mayo/bmi/dictionarytools/CreateLuceneIndexForExampleDrugs.java
 scripts/java/edu/mayo/bmi/dictionarytools/CreateLuceneIndexForSnomedLikeSample.java

 Note: To view the contents of a Lucene index, you could use a tool such as Luke.

 #######################################
 Creating your own dictionaries
 #######################################

 Note: To view the contents of a Lucene index, you could use a tool such as Luke.

 %%%%%%%%%%%%%%%%%%%%%
 Drugs

 To create a more complete dictionary of drug concepts, you could download a copy of the UMLS
 Metathesaurus and build upon the program mentioned above to create a lucene index of the
 complete RxNorm or another source of drug concepts.

 Or you could use a different program in that same package that reads from a pipe-delimited file:
 scripts/java/edu/mayo/bmi/dictionarytools/CreateLuceneIndexFromDelimitedFile

 The pipe-delimited file should contain lines in the following format.
 CUI|drug name aka description|terminology aka source|codeInThatSource|PreferredIndicator|TUI
 Where PreferredIndicator = P if the name is the preferred name for the drug.

 For example, if you want include terms from semantic type "Biomedical or Dental Material" (TUI T122),
 one line in the file you create should be:
 C1154185|Topical Spray|RXNORM|346165|P|T122

 The CreateLuceneIndexFromDelimitedFile class could then be used to create a lucene index from the
 data in the file.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Anatomical sites, procedures, signs/symptoms, diseases/disorders

 To create a more complete dictionary of anatomical sites, procedures, signs/symptoms,
 and/or diseases/disorders, you could download a copy of the UMLS Metathesaurus and build
 upon (add code to) the CreateLuceneIndexForSnomedLikeSample class to create 2 lucene
 indexes - one Lucene index for the concepts and their CUIs, and one that maps the codes
 from the source(s) to the CUIs.

 Alternatively you could create and populate a database with the following two tables
   umls_ms_2005 (or whatever name you specify within LookupDesc_Db.xml)
       with columns "fword"  "cui"  "tui"  "text"
   umls_snomed_map
       with columns "cui"  "code"


 ################################################
 Configuring the annotator to use your dictionary
 ################################################

 These steps are not necessary in order to run the pipeline with the very small sample dictionaries
 that are included with this project.

 If you created your own dictionary(s) as outlined above, here is how you could configure this
 annotator to use your dictionary(s).

 %%%%%%%%%%%%%%%%%%%%%
 Drugs

 If you created a lucene index directory called drug_index, within descriptor DictionaryLookupAnnotator.xml
 you could update the value of the IndexDirectory for external resource RxnormIndex to reference the location
 of your drug_index directory.  Recall you need to use a text editor or you need to be on tab Source
 to edit this portion of that descriptor since it is within a configurableDataResourceSpecifier.

 Alternatively, you could simply replace the contents of directory
 dictionary lookup/resources/lookup/drug_index
 with the contents of the lucene index directory you created.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Anatomical sites, procedures, signs/symptoms, diseases/disorders

 If you created 2 lucene index directories using CreateLuceneIndexForSnomedLikeSample.java,
 you could simply replace the contents of the two directories
 dictionary lookup/resources/lookup/snomed-like_sample
 dictionary lookup/resources/lookup/snomed-like_codes_sample
 with the contents of the lucene index directories you created.

 Alternatively, if you created database tables umls_snomed_map and umls_ms_2005 as
 outlined above, you could do the following steps

 1) Replace the use of DictionaryLookupAnnotator.xml with DictionaryLookupAnnotatorDB.xml
    in your pipeline (in your aggregate flow.  e.g. in AggregatePlaintextProcessor.xml)
    The class UmlsToSnomedDbConsumerImpl.java that is used in this case is included with
    this distribution.

 2) Update some values within DictionaryLookupAnnotatorDB.xml for your environment
     2a) Username
     2b) Password
     2c) DriverClassName
     2d) URL


 #######################################
 Running the analysis engine (annotator)
 #######################################

 %%%%%%%%%%%%%%%%%%%%%%%%%%
 AggregateAE.xml

 The file desc/analysis_engine/AggregateAE.xml defines a "pipeline" for
 testing the installation of the PEAR file for this annotator.

 This descriptor is typically not used in a more complete pipeline - one of the
 DictionaryLookupannotator*.xml descriptors is normally included in a more complete pipeline.


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 DictionaryLookupAnnotator.xml

 Open this descriptor in the Component Descriptor editor, and view the Resources tab
 to see that there are three Lucene indexes used by this analysis engine.
 Unlike most resources, these cannot be edited from this tab.  Open the Source tab
 to view the configurationParameterSettings.  There are three IndexDirectory entries,
 whose values give the path to the directory for the lucene index.


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 DictionaryLookupannotatorCSV.xml

 This is an example of how to use a dictionary contained in a delimited file rather
 than a database or a lucene index.
 This is only recommended for small dictionaries.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 DictionaryLookupannotatorDB.xml

 This is a skeleton of how you could use a dictionary contained in a database that
 can be accessed via a jdbc driver instead of using a lucene index or flatfile.

	Contents
	- Introduction
	- Implementation overview
	- Sample dictionaries included
	- Creating your own dictionaries
	- Configuring the annotator to use your dictionary
	- Running the analysis engine (annotator)
	- AggregateAE.xml
	- DictionaryLookupannotator.xml
	- DictionaryLookupannotatorCSV.xml
	- DictionaryLookupannotatorDB.xml

	############
	Introduction
	############

	The dictionary lookup annotator finds the entries from one or more dictionaries that match the
	document text in some way. Within this annotator, these matches are called lookup hits.

	The dictionary lookup annotator is very customizable.

	It can look for matches where the words in the dictionary entries appear in the same order as
	the words in the document text, or it can look for permutations of the words from the dictionary.

	It can look just for exact matches of the words, or it can also look for matches to the
	canonical forms of the words.

	Searches for a lookup hit are limited to within windows, where the window type is defined in
	the LookupDescriptorFile. A window can be the words that fall within the same Sentence, the same
	Chunk, the same LookupWindowAnnotation or any other annotation. See the clinical documents
	pipeline project for an example of an analysis engine (LookupWindowAnnotator.xml) that
	that creates LookupWindowAnnotations.

	Note: Dictionary entries need to have been tokenized the way the pipeline tokenizes the document text.


	#######################
	Implementation overview
	#######################

	The behavior of the dictionary lookup annotator is controlled by the parameters and resources
	defined in the analysis engine descriptor, and by the contents of the resource called the
	LookupDescriptorFile.

	For example, if the analysis engine descriptor "DictionaryLookup.xml" contains
	a resource named LookupDescriptorFile with value "lookup/LookupDesc.xml", then the parameter
	settings and resources named within "DictionaryLookup.xml", together with the values within
	"lookup/LookupDesc.xml" will control the actions of the dictionary lookup annotator.

	The lookupInitializer and lookupConsumer classes are specified within the LookupDescriptorFile.

	The algorithm used for looking up the terms is defined by the lookupInitializer, which
	creates the lookup hits.

	The lookupConsumer adds annotations to the CAS for some or all of the lookup hits.

	An example adding only some of the lookup hits to the CAS is if you have a dictionary of RxNorm
	terms with their RxNorm codes, and a dictionary of terms from the OrangeBook, and want to create
	annotations for those terms that are in the OrangeBook that also have an RxNorm code.

	This can be done using class org.apache.ctakes.dictionary.lookup.ae.FirstTokenPermLookupInitializerImpl
	as the lookupInitializer, and using class OrangeBookFilterConsumerImpl as the lookupConsumer,
	provided you have the RxNorm dictionary, and you configure the LookupDescriptorFile resource
	to use your RxNorm dictionary.


	Note: The analysis engine descriptors for this annotator use elements of type
	configurableDataResourceSpecifier.
	These cannot be modified from the Parameters or Resources tabs of the Component Descriptor
	Editor (at least not in UIMA 2.2). To view these values or edit them, use the Sources tab or
	open the descriptor with a Text Editor.


	Note: Dictionary entries need to have been tokenized the way the pipeline tokenizes the document text.

	For example, the lookup algorithm will not find a lookup hit if a dictionary entry is
	"ear, skin" but the document text contains the same text ("ear, skin") and the pipeline
	has tokenized that text as the 3 tokens "ear" "," "skin".

	To find a lookup hit for the 3 tokens, the dictionary entry should be tokenized,
	with a space before the comma: "ear , skin"

	To determine the LookupDescriptorFile for an analysis engine, open the analysis engine descriptor
	(e.g. DictionaryLookupannotator.xml) and note the URL for the LookupDescriptorFile resource
	(e.g. lookup/LookupDesc.xml).

	A LookupDescriptorFile such as lookup/LookupDesc.xml, found in resources/,
	defines the dictionary(s) used, and the classes that interact with the
	dictionary(s). The implementation tag identifies the type of dictionary:
	lucene index (luceneImpl), database (jdbcImpl), or delimited flatfile (csvImpl).
	See class org.apache.ctakes.dictionary.lookup.ae.LookupParseUtilities.java for implementation details.


	To better understand the dictionary lookup annotator code you could start by reading the javadocs
	for the classes DictionaryLookupAnnotator.java and FirstTokenPermutationImpl.java.

	#######################################
	Sample dictionaries included
	#######################################

	This project includes two sample dictionaries.

	A sample database (a Lucene index) containing a few drug names is included
	for running the examples.

	A sample database (using 2 Lucene indexes) containing a few anatomical sites, procedures,
	and disorders/diseases is included for running the examples.

	The programs used to create these Lucene indexes are
	scripts/java/edu/mayo/bmi/dictionarytools/CreateLuceneIndexForExampleDrugs.java
	scripts/java/edu/mayo/bmi/dictionarytools/CreateLuceneIndexForSnomedLikeSample.java

	Note: To view the contents of a Lucene index, you could use a tool such as Luke.

	#######################################
	Creating your own dictionaries
	#######################################

	Note: To view the contents of a Lucene index, you could use a tool such as Luke.

	%%%%%%%%%%%%%%%%%%%%%
	Drugs

	To create a more complete dictionary of drug concepts, you could download a copy of the UMLS
	Metathesaurus and build upon the program mentioned above to create a lucene index of the
	complete RxNorm or another source of drug concepts.

	Or you could use a different program in that same package that reads from a pipe-delimited file:
	scripts/java/edu/mayo/bmi/dictionarytools/CreateLuceneIndexFromDelimitedFile

	The pipe-delimited file should contain lines in the following format.
	CUI\|drug name aka description\|terminology aka source\|codeInThatSource\|PreferredIndicator\|TUI
	Where PreferredIndicator = P if the name is the preferred name for the drug.

	For example, if you want include terms from semantic type "Biomedical or Dental Material" (TUI T122),
	one line in the file you create should be:
	C1154185\|Topical Spray\|RXNORM\|346165\|P\|T122

	The CreateLuceneIndexFromDelimitedFile class could then be used to create a lucene index from the
	data in the file.

	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	Anatomical sites, procedures, signs/symptoms, diseases/disorders

	To create a more complete dictionary of anatomical sites, procedures, signs/symptoms,
	and/or diseases/disorders, you could download a copy of the UMLS Metathesaurus and build
	upon (add code to) the CreateLuceneIndexForSnomedLikeSample class to create 2 lucene
	indexes - one Lucene index for the concepts and their CUIs, and one that maps the codes
	from the source(s) to the CUIs.

	Alternatively you could create and populate a database with the following two tables
	umls_ms_2005 (or whatever name you specify within LookupDesc_Db.xml)
	with columns "fword" "cui" "tui" "text"
	umls_snomed_map
	with columns "cui" "code"


	################################################
	Configuring the annotator to use your dictionary
	################################################

	These steps are not necessary in order to run the pipeline with the very small sample dictionaries
	that are included with this project.

	If you created your own dictionary(s) as outlined above, here is how you could configure this
	annotator to use your dictionary(s).

	%%%%%%%%%%%%%%%%%%%%%
	Drugs

	If you created a lucene index directory called drug_index, within descriptor DictionaryLookupAnnotator.xml
	you could update the value of the IndexDirectory for external resource RxnormIndex to reference the location
	of your drug_index directory. Recall you need to use a text editor or you need to be on tab Source
	to edit this portion of that descriptor since it is within a configurableDataResourceSpecifier.

	Alternatively, you could simply replace the contents of directory
	dictionary lookup/resources/lookup/drug_index
	with the contents of the lucene index directory you created.

	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	Anatomical sites, procedures, signs/symptoms, diseases/disorders

	If you created 2 lucene index directories using CreateLuceneIndexForSnomedLikeSample.java,
	you could simply replace the contents of the two directories
	dictionary lookup/resources/lookup/snomed-like_sample
	dictionary lookup/resources/lookup/snomed-like_codes_sample
	with the contents of the lucene index directories you created.

	Alternatively, if you created database tables umls_snomed_map and umls_ms_2005 as
	outlined above, you could do the following steps

	1) Replace the use of DictionaryLookupAnnotator.xml with DictionaryLookupAnnotatorDB.xml
	in your pipeline (in your aggregate flow. e.g. in AggregatePlaintextProcessor.xml)
	The class UmlsToSnomedDbConsumerImpl.java that is used in this case is included with
	this distribution.

	2) Update some values within DictionaryLookupAnnotatorDB.xml for your environment
	2a) Username
	2b) Password
	2c) DriverClassName
	2d) URL


	#######################################
	Running the analysis engine (annotator)
	#######################################

	%%%%%%%%%%%%%%%%%%%%%%%%%%
	AggregateAE.xml

	The file desc/analysis_engine/AggregateAE.xml defines a "pipeline" for
	testing the installation of the PEAR file for this annotator.

	This descriptor is typically not used in a more complete pipeline - one of the
	DictionaryLookupannotator*.xml descriptors is normally included in a more complete pipeline.


	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	DictionaryLookupAnnotator.xml

	Open this descriptor in the Component Descriptor editor, and view the Resources tab
	to see that there are three Lucene indexes used by this analysis engine.
	Unlike most resources, these cannot be edited from this tab. Open the Source tab
	to view the configurationParameterSettings. There are three IndexDirectory entries,
	whose values give the path to the directory for the lucene index.


	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	DictionaryLookupannotatorCSV.xml

	This is an example of how to use a dictionary contained in a delimited file rather
	than a database or a lucene index.
	This is only recommended for small dictionaries.

	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	DictionaryLookupannotatorDB.xml

	This is a skeleton of how you could use a dictionary contained in a database that
	can be accessed via a jdbc driver instead of using a lucene index or flatfile.