blob: 5f70162f2b0f03123e4c118b3a8e00b40ed51c42 [file] [log] [blame]
# This module has been deprecated. It is much faster to precompute lexical variants for your dictionary.
Contents
- Introduction
- Description of resources
- lvg.properties
- LVG database
- Running the LVG annotator
- LvgAnnotator.xml
- AggregateAE.xml
############
Introduction
############
This annotator wraps the National Library of Medicine (NLM) SPECIALIST lexical tools.
See the cTAKES Wiki for the latest information about this annotator:
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES
Documentation for the SPECIALIST lexical tools is at:
https://lsg3.nlm.nih.gov/Specialist/Home/index.html
Documentation for Lvg and Norm can be found at:
https://lsg3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/web/index.html
This annotator generates a canonical form for each word and also generates a list of lemma
entries with Penn Treebank tags. These tags could be useful for a part of speech (POS) tagger.
However, for the OpenNLP POS tagger, cTAKES uses a tag dictionary rather than lemma information.
See the documentation for the POS tagger annotator.
########################
Description of resources
########################
%%%%%%%%%%%%%%%%
lvg.properties
%%%%%%%%%%%%%%%%
The LVG configuration file lvg.properties defines the location
and attributes of the LVG database and the jdbc driver used.
%%%%%%%%%%%%%%%%
LVG database
%%%%%%%%%%%%%%%%
The database engine used is hsqldb.
The LVG database available from the NLM is hundreds of megabytes. To keep this
project relatively small, the database tables included with this project have a
relatively small number of rows.
#########################
Running the LVG annotator
#########################
%%%%%%%%%%%%%%%%
LvgAnnotator.xml
%%%%%%%%%%%%%%%%
The parameters are:
UseSegments - controls whether only certain sections will be annotated by this annotator
SegmentsToSkip - list of sections not to be processed by this annotator
UseCmdCache - controls whether to look up information in a cache before using norm
CmdCacheFileLocation - location of norm cache file
CmdCacheFrequencyCutoff -
ExclusionSet - words for which canonicalForm is never set and Lemma entries are never posted
XeroxTreebankMap - mapping of part of speech tags, used to POS tags from lexical tools to Penn Treebank tags
PostLemmas - controls whether any lemma entries are posted to the CAS
UseLemmaCache - controls whether to look up lemma information in a cache before using lvg
LemmaCacheFileLocation - the location of the cache file
LemmaCacheFileFrequencyCutoff -
Note: as distributed, PostLemmas is set to false. This is done to reduce the size of the CAS.
Set PostLemmas to true to have org.apache.ctakes.typesystem.type.Lemma annotations added to the CAS.