ctakes-pos-tagger/README - ctakes - Git at Google

 Contents
 - Listing of README's in this project
 - Introduction
 - Building a model
 - Building a tag dictionary
 - Running the POSTagger analysis engine
 	- POSTagger.xml
 	- POSTaggerAggregate.xml
 	- POSTaggerCPE.xml
 - Evaluating a POS tagger


 ###################################
 Listing of README's in this project
 ###################################

  - data/pos/training/README - describes how to create training data for the
                               part-of-speech (POS) tagger.
  - resources/models/README - describes how the pos-tagging models were created
  - test/data/README - describes data used for the unit tests


 ############
 Introduction
 ############

 This project provides a UIMA wrapper around the popular OpenNLP part-of-speech
 tagger. The UIMA examples project provides a default wrapper from which we have
 borrowed liberally.  We have created our own wrapper so that it will work better
 with our type system and to add features and supporting components.
 Additionally, both the OpenNLP package and the UIMA examples OpenNLP wrappers
 lack documentation for how to do things like generate training data, build a
 part-of-speech tagging model, and build a tag dictionary.  The latter in
 particular can be very confusing if you are new to OpenNLP.  We have
 attempted to provide all of the necessary documentation here.

 A part-of-speech tagging model is included with this project.

 The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized
 clinical data per Safe Harbor HIPAA guidelines. Prior to model building, the clinical data was
 deidentified for patient names to preserve patient confidentiality. Any person name in the model
 will originate from non-patient data sources.


 ################
 Building a model
 ################

 If you wish to build your own part-of-speech model that works with the OpenNLP
 part-of-speech tagger you will need to follow these steps:
 1) obtain training data - see data/pos/training/README
 2) build a model using the training data - see resources/models/README


 #########################
 Building a tag dictionary
 #########################

 One thing that can be confusing about the OpenNLP part-of-speech tagger is that
 there are two data structures with similar sounding names - Dictionary and
 TagDictionary.  In short, the Dictionary construct is one that can and should
 be ignored while the TagDictionary is one that needs a bit of attention.

 A tag dictionary is used when tagging text, not during the training of a POS model.

 Unfortunately, OpenNLP does not provide a mechanism for creating a tag
 dictionary, so we have provided one.  It can be run with the following command:

 java org.apache.ctakes.postagger.TagDictionaryCreator <training-data>
                                                         <tag-dictionary>
                                                          <case-sensitive>

   where <training-data> is a file containing pos-of-speech tagged training data
                         as described in data/pos/training/README
   where <tag-dictionary> is the file that will be created, and where the
                          tag dictionary will be written to
   where <case-sensitive> is either 'true' or 'false' depending on whether the
   						 tag dictionary should be case sensitive or not.

 For relevant material about the difference between Dictionary and TagDictionary
 see the following:
 https://sourceforge.net/forum/forum.php?thread_id=1720863&forum_id=9943
 https://sourceforge.net/forum/forum.php?thread_id=1894043&forum_id=9943
 DefaultPOSContextGenerator.getContext(int, Object[], String[])

 OpenNLP provides a default tag dictionary for the English part-of-speech model
 called tag.bin.gz which can be downloaded from:
 http://opennlp.sourceforge.net/models/english/parser/tagdict
 You should use this tag dictionary only if you are using the model from:
 http://opennlp.sourceforge.net/models/english/parser/tag.bin.gz

 If we want to use the tag dictionary in a case insensitive way, then entries
 in the tag dictionary which are not all lowercased will be ignored because
 the tag dictionary fails to lowercase entries read in from the file.
 It only lowercases the words that are compared against the dictionary when
 "CaseSensitive" is set to false.  Therefore, if you want the tag dictionary
 to be used in a case insensitive way, be sure to build the tag dictionary
 using 'false' as the third argument (as described above).


 #####################################
 Running the POSTagger analysis engine
 #####################################

 %%%%%%%%%%%%%
 POSTagger.xml

 The file desc/POSTagger.xml provides a descriptor for the POSTagger analysis
 engine which is the UIMA component we have written that wraps the OpenNLP
 part-of-speech tagger.  Open this file using the Component Descriptor Editor as
 described in the tutorial.  Click on the tab labeled "Overview" to observe that
 the class called by this descriptor is "org.apache.ctakes.postagger.POSTagger".
 Click on the tab labeled "Parameter Settings" to view the parameters required
 by the POSTagger component.  The descriptor file does not document the
 parameters because they are documented in the api javadocs for the class
 org.apache.ctakes.postagger.POSTagger.  Please consult that documentation for
 additional details.  The parameters are:
  - PosModelFile - the file that contains the part-of-speech tagging model
  - TagDictionary - the file that contains the tag dictionary (if available)
  - CaseSensitive - determines whether to use the TagDictionary in a case
 					sensitive way or not.

 %%%%%%%%%%%%%%%%%%%%%%
 POSTaggerAggregate.xml

 The descriptor desc/POSTaggerAggregate.xml defines a pipeline for part-of-speech
 tagging that creates all the necessary inputs (e.g. token and sentence
 annotations).

 Open it using the Component Descriptor Editor as described in the tutorial.
 Click on the tab labeled "Overview" to observe that the engine type is
 "Aggregate".
 Click on the tab labeled "Aggregate" to see the components that need to be run
 before the POSTagger can run.
 Click on the tab labeled "Parameter Settings" to see that the same three
 parameters need to be set from the POSTagger.xml file.  If you set these
 parameters to acceptable values, you can open and run
 desc/POSTaggerAggregate.xml using the CAS Visual Debugger.

 %%%%%%%%%%%%%%%%
 POSTaggerCPE.xml

 The file desc/POSTaggerCPE.xml provides an xml-specification of a component
 processing engine (CPE) which can be opened, edited, and run using the UIMA
 CPE GUI as described in the tutorial.  Open this file using the UIMA CPE GUI
 and set the parameters for the collection reader to point to a local collection
 of files that you want part-of-speech tagged.  Set the parameters for the
 POSTagger as appropriate for your environment and, finally, set the output
 directory of the XCAS Writer CAS Consumer.

 The results of running the pipeline are written to the output directory
 as XCAS files.

 These files can be viewed in the CAS Visual Debugger.

 #######################
 Evaluating a POS tagger
 #######################

 There are two ways a POS tagger should be evaluated:

 1) Use gold standard tokens

 Run the POS tagger using gold standard tokens and calculate the percentage of
 part-of-speech labels that have been correctly assigned.

 If this is gold standard sentence:
 The_DT major_JJ inducible_JJ protein_NN complex_NN that_WDT binds_VBZ ._.

 And if this is the output for that sentence:
 The_DT major_JJ inducible_NN protein_NN complex_NN that_WDT binds_VBD ._.

 The accuracy should be 6/8 = 75%.

 2) Use tokenizer generated tokens

 Run the tokenizer and use this as input to the POS tagger.  In this scenario, we calculate
 f-measure in the following way:

 true positive - a token that has the correct boundary and part-of-speech label
 false positive - a tagged token that does not have the correct boundary and/or
                  part-of-speech label
 false negative - a token in the gold standard data that was not correctly generated by
                  the tokenizer/POS tagger

 If this is gold standard sentence:
 This_DT complex_NN is_VBZ not_RB cyclosporin_JJ -sensitive_JJ ._.

 And if this is the output for that sentence:
 This_DT complex_JJ is_VBZ not_RB cyclosporin-sensitive_JJ ._.

 true positives = 4
 false positives = 2
 false negatives = 3

 f-measure = (2 * recall * precision) / (precision + recall)
           = (2*TP) / (2*TP + FP + FN)

 f-measure = (2*4) / (2*4 + 2 + 3) = 8 / 13 = .615

 In fact, if you do the evaluation this way for the "gold standard tokens" evaluation, then
 you will get the same answer as the accuracy calculation given above.
	Contents
	- Listing of README's in this project
	- Introduction
	- Building a model
	- Building a tag dictionary
	- Running the POSTagger analysis engine
	- POSTagger.xml
	- POSTaggerAggregate.xml
	- POSTaggerCPE.xml
	- Evaluating a POS tagger


	###################################
	Listing of README's in this project
	###################################

	- data/pos/training/README - describes how to create training data for the
	part-of-speech (POS) tagger.
	- resources/models/README - describes how the pos-tagging models were created
	- test/data/README - describes data used for the unit tests


	############
	Introduction
	############

	This project provides a UIMA wrapper around the popular OpenNLP part-of-speech
	tagger. The UIMA examples project provides a default wrapper from which we have
	borrowed liberally. We have created our own wrapper so that it will work better
	with our type system and to add features and supporting components.
	Additionally, both the OpenNLP package and the UIMA examples OpenNLP wrappers
	lack documentation for how to do things like generate training data, build a
	part-of-speech tagging model, and build a tag dictionary. The latter in
	particular can be very confusing if you are new to OpenNLP. We have
	attempted to provide all of the necessary documentation here.

	A part-of-speech tagging model is included with this project.

	The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized
	clinical data per Safe Harbor HIPAA guidelines. Prior to model building, the clinical data was
	deidentified for patient names to preserve patient confidentiality. Any person name in the model
	will originate from non-patient data sources.


	################
	Building a model
	################

	If you wish to build your own part-of-speech model that works with the OpenNLP
	part-of-speech tagger you will need to follow these steps:
	1) obtain training data - see data/pos/training/README
	2) build a model using the training data - see resources/models/README


	#########################
	Building a tag dictionary
	#########################

	One thing that can be confusing about the OpenNLP part-of-speech tagger is that
	there are two data structures with similar sounding names - Dictionary and
	TagDictionary. In short, the Dictionary construct is one that can and should
	be ignored while the TagDictionary is one that needs a bit of attention.

	A tag dictionary is used when tagging text, not during the training of a POS model.

	Unfortunately, OpenNLP does not provide a mechanism for creating a tag
	dictionary, so we have provided one. It can be run with the following command:

	java org.apache.ctakes.postagger.TagDictionaryCreator <training-data>
	<tag-dictionary>
	<case-sensitive>

	where <training-data> is a file containing pos-of-speech tagged training data
	as described in data/pos/training/README
	where <tag-dictionary> is the file that will be created, and where the
	tag dictionary will be written to
	where <case-sensitive> is either 'true' or 'false' depending on whether the
	tag dictionary should be case sensitive or not.

	For relevant material about the difference between Dictionary and TagDictionary
	see the following:
	https://sourceforge.net/forum/forum.php?thread_id=1720863&forum_id=9943
	https://sourceforge.net/forum/forum.php?thread_id=1894043&forum_id=9943
	DefaultPOSContextGenerator.getContext(int, Object[], String[])

	OpenNLP provides a default tag dictionary for the English part-of-speech model
	called tag.bin.gz which can be downloaded from:
	http://opennlp.sourceforge.net/models/english/parser/tagdict
	You should use this tag dictionary only if you are using the model from:
	http://opennlp.sourceforge.net/models/english/parser/tag.bin.gz

	If we want to use the tag dictionary in a case insensitive way, then entries
	in the tag dictionary which are not all lowercased will be ignored because
	the tag dictionary fails to lowercase entries read in from the file.
	It only lowercases the words that are compared against the dictionary when
	"CaseSensitive" is set to false. Therefore, if you want the tag dictionary
	to be used in a case insensitive way, be sure to build the tag dictionary
	using 'false' as the third argument (as described above).


	#####################################
	Running the POSTagger analysis engine
	#####################################

	%%%%%%%%%%%%%
	POSTagger.xml

	The file desc/POSTagger.xml provides a descriptor for the POSTagger analysis
	engine which is the UIMA component we have written that wraps the OpenNLP
	part-of-speech tagger. Open this file using the Component Descriptor Editor as
	described in the tutorial. Click on the tab labeled "Overview" to observe that
	the class called by this descriptor is "org.apache.ctakes.postagger.POSTagger".
	Click on the tab labeled "Parameter Settings" to view the parameters required
	by the POSTagger component. The descriptor file does not document the
	parameters because they are documented in the api javadocs for the class
	org.apache.ctakes.postagger.POSTagger. Please consult that documentation for
	additional details. The parameters are:
	- PosModelFile - the file that contains the part-of-speech tagging model
	- TagDictionary - the file that contains the tag dictionary (if available)
	- CaseSensitive - determines whether to use the TagDictionary in a case
	sensitive way or not.

	%%%%%%%%%%%%%%%%%%%%%%
	POSTaggerAggregate.xml

	The descriptor desc/POSTaggerAggregate.xml defines a pipeline for part-of-speech
	tagging that creates all the necessary inputs (e.g. token and sentence
	annotations).

	Open it using the Component Descriptor Editor as described in the tutorial.
	Click on the tab labeled "Overview" to observe that the engine type is
	"Aggregate".
	Click on the tab labeled "Aggregate" to see the components that need to be run
	before the POSTagger can run.
	Click on the tab labeled "Parameter Settings" to see that the same three
	parameters need to be set from the POSTagger.xml file. If you set these
	parameters to acceptable values, you can open and run
	desc/POSTaggerAggregate.xml using the CAS Visual Debugger.

	%%%%%%%%%%%%%%%%
	POSTaggerCPE.xml

	The file desc/POSTaggerCPE.xml provides an xml-specification of a component
	processing engine (CPE) which can be opened, edited, and run using the UIMA
	CPE GUI as described in the tutorial. Open this file using the UIMA CPE GUI
	and set the parameters for the collection reader to point to a local collection
	of files that you want part-of-speech tagged. Set the parameters for the
	POSTagger as appropriate for your environment and, finally, set the output
	directory of the XCAS Writer CAS Consumer.

	The results of running the pipeline are written to the output directory
	as XCAS files.

	These files can be viewed in the CAS Visual Debugger.

	#######################
	Evaluating a POS tagger
	#######################

	There are two ways a POS tagger should be evaluated:

	1) Use gold standard tokens

	Run the POS tagger using gold standard tokens and calculate the percentage of
	part-of-speech labels that have been correctly assigned.

	If this is gold standard sentence:
	The_DT major_JJ inducible_JJ protein_NN complex_NN that_WDT binds_VBZ ._.

	And if this is the output for that sentence:
	The_DT major_JJ inducible_NN protein_NN complex_NN that_WDT binds_VBD ._.

	The accuracy should be 6/8 = 75%.

	2) Use tokenizer generated tokens

	Run the tokenizer and use this as input to the POS tagger. In this scenario, we calculate
	f-measure in the following way:

	true positive - a token that has the correct boundary and part-of-speech label
	false positive - a tagged token that does not have the correct boundary and/or
	part-of-speech label
	false negative - a token in the gold standard data that was not correctly generated by
	the tokenizer/POS tagger

	If this is gold standard sentence:
	This_DT complex_NN is_VBZ not_RB cyclosporin_JJ -sensitive_JJ ._.

	And if this is the output for that sentence:
	This_DT complex_JJ is_VBZ not_RB cyclosporin-sensitive_JJ ._.

	true positives = 4
	false positives = 2
	false negatives = 3

	f-measure = (2 * recall * precision) / (precision + recall)
	= (2TP) / (2TP + FP + FN)

	f-measure = (24) / (24 + 2 + 3) = 8 / 13 = .615

	In fact, if you do the evaluation this way for the "gold standard tokens" evaluation, then
	you will get the same answer as the accuracy calculation given above.