ctakes-core/README - ctakes - Git at Google


 Contents
 - Introduction
 - Analysis engines (annotators)
 	- CopyAnnotator.xml
 	- NullAnnotator.xml
 	- OverlapAnnotator.xml
 	- SentenceDetectorAnnotator.xml
 	- SimpleSegmentAnnotator.xml
 	- TokenizerAnnotator.xml


 ############
 Introduction
 ############

 This project contains several annotators, including:
 	- a sentence detector annotator (a wrapper around the OpenNLP sentence detector)
 	- a tokenizer
 	- an annotator that does not update the CAS in any way, which can be useful if you are using the UIMA
 	  CPE GUI and you are required to specify an analysis engine but you don't actually want to specify one.
 	- an annotator that creates a single Segment annotation encompassing the entire document text, which can
 	  be used when processing a plaintext document which therefore doesn't have section (aka segment) tags.

 Of particular interest is that
 	- End-of-line characters are considered end-of-sentence markers.

 A sentence detector model is included with this project.

 The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized
 clinical data per Safe Harbor HIPAA guidelines. Prior to model building, the clinical data was
 deidentified for patient names to preserve patient confidentiality. Any person name in the model
 will originate from non-patient data sources.


 #############################
 Analysis engines (annotators)
 #############################

 %%%%%%%%%%%%%%%%%
 CopyAnnotator.xml

 This is a utility annotator that copies data from an existing JCas object into a new JCas object.

 %%%%%%%%%%%%%%%%%
 NullAnnotator.xml

 %%%%%%%%%%%%%%%%%%%%
 OverlapAnnotator.xml

 An annotator that modifies one annotation (begin and end offsets) or deletes one (or both) of
 the annotations, when two annotations overlap. The action taken depends on the configuration parameters.

 It can extend an annotation to encompass overlapping annotations.

 It can be configured to delete annotations of type A that are subsumed by other annotations of type A
 if you only want the longest annotations of the given type to be kept.

 See the javadoc for org.apache.ctakes.core.ae.OverlapAnnotator for more details.


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 SentenceDetectorAnnotator.xml

 A wrapper around the OpenNLP sentence detector, that creates Sentence annotations based on the
 location of end-of-line characters and on the output of the OpenNLP sentence detector.
 This annotator considers an end-of-line character as an end-of-sentence marker.
 Optionally can skip certain sections of the document.

 Parameters:
   SegmentsToSkip - (optional) the list of sections not to create Sentence annotations for.

 Resources:
   MaxentModelFile - (required) the Maxent model sentence detector.


 %%%%%%%%%%%%%%%%%%%%%%%%%%
 SimpleSegmentAnnotator.xml

 Creates a single Segment annotation, encompassing the entire document.
 For use prior to annotators that require a Segment annotation, when the the pipeline does not
 contain a different annotator that creates Segment annotations.
 This annotator is used for plaintext files, but not for CDA documents, as the
 CdaCasInitializer annotator creates Segment annotations.

 Parameters:
   SegmentID - (optional) the identifier to use for the Segment annotation created.

 %%%%%%%%%%%%%%%%%%%%%%
 TokenizerAnnotator.xml

 Tokenizes text.
 See classes org.apache.ctakes.core.ae.TokenizerAnnotatorPTB and org.apache.ctakes.core.nlp.tokenizer.Tokenizer
 for implementation details.

 Parameters:
   SegmentsToSkip - (optional) the list of sections not to create token annotations for.

	Contents
	- Introduction
	- Analysis engines (annotators)
	- CopyAnnotator.xml
	- NullAnnotator.xml
	- OverlapAnnotator.xml
	- SentenceDetectorAnnotator.xml
	- SimpleSegmentAnnotator.xml
	- TokenizerAnnotator.xml


	############
	Introduction
	############

	This project contains several annotators, including:
	- a sentence detector annotator (a wrapper around the OpenNLP sentence detector)
	- a tokenizer
	- an annotator that does not update the CAS in any way, which can be useful if you are using the UIMA
	CPE GUI and you are required to specify an analysis engine but you don't actually want to specify one.
	- an annotator that creates a single Segment annotation encompassing the entire document text, which can
	be used when processing a plaintext document which therefore doesn't have section (aka segment) tags.

	Of particular interest is that
	- End-of-line characters are considered end-of-sentence markers.

	A sentence detector model is included with this project.

	The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized
	clinical data per Safe Harbor HIPAA guidelines. Prior to model building, the clinical data was
	deidentified for patient names to preserve patient confidentiality. Any person name in the model
	will originate from non-patient data sources.


	#############################
	Analysis engines (annotators)
	#############################

	%%%%%%%%%%%%%%%%%
	CopyAnnotator.xml

	This is a utility annotator that copies data from an existing JCas object into a new JCas object.

	%%%%%%%%%%%%%%%%%
	NullAnnotator.xml

	%%%%%%%%%%%%%%%%%%%%
	OverlapAnnotator.xml

	An annotator that modifies one annotation (begin and end offsets) or deletes one (or both) of
	the annotations, when two annotations overlap. The action taken depends on the configuration parameters.

	It can extend an annotation to encompass overlapping annotations.

	It can be configured to delete annotations of type A that are subsumed by other annotations of type A
	if you only want the longest annotations of the given type to be kept.

	See the javadoc for org.apache.ctakes.core.ae.OverlapAnnotator for more details.


	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	SentenceDetectorAnnotator.xml

	A wrapper around the OpenNLP sentence detector, that creates Sentence annotations based on the
	location of end-of-line characters and on the output of the OpenNLP sentence detector.
	This annotator considers an end-of-line character as an end-of-sentence marker.
	Optionally can skip certain sections of the document.

	Parameters:
	SegmentsToSkip - (optional) the list of sections not to create Sentence annotations for.

	Resources:
	MaxentModelFile - (required) the Maxent model sentence detector.


	%%%%%%%%%%%%%%%%%%%%%%%%%%
	SimpleSegmentAnnotator.xml

	Creates a single Segment annotation, encompassing the entire document.
	For use prior to annotators that require a Segment annotation, when the the pipeline does not
	contain a different annotator that creates Segment annotations.
	This annotator is used for plaintext files, but not for CDA documents, as the
	CdaCasInitializer annotator creates Segment annotations.

	Parameters:
	SegmentID - (optional) the identifier to use for the Segment annotation created.

	%%%%%%%%%%%%%%%%%%%%%%
	TokenizerAnnotator.xml

	Tokenizes text.
	See classes org.apache.ctakes.core.ae.TokenizerAnnotatorPTB and org.apache.ctakes.core.nlp.tokenizer.Tokenizer
	for implementation details.

	Parameters:
	SegmentsToSkip - (optional) the list of sections not to create token annotations for.