blob: d44a2f742f2ea906dcdff438dc787ef19d15c5b6 [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
[[ugr.ref.jcas]]
= JCas Reference
The CAS is a system for sharing data among annotators, consisting of data structures (definable at run time), sets of indexes over these data, metadata describing these, subjects of analysis, and a high performance serialization/deserialization mechanism.
JCas provides Java approach to accessing CAS data, and is based on using generated, specific Java classes for each CAS type.
Annotators process one CAS per call to their process method.
During processing, annotators can retrieve feature structures from the passed in CAS, add new ones, modify existing ones, and use and update CAS indexes.
Of course, an annotator can also use plain Java Objects in addition; but the data in the CAS is what is shared among annotators within an application.
All the facilities present in the APIs for the CAS are available when using the JCas APIs; indeed, you can use the getCas() method to get the corresponding CAS object from a JCas (and vice-versa). The JCas APIs often have helper methods that make using this interface more convenient for Java developers.
The data in the CAS are typed objects having fields.
JCas uses a set of generated Java classes (each corresponding to a particular CAS type) with "`getter`" and "`setter`" methods for the features, plus a constructor so new instances can be made.
The Java classes stores the data in the class instance.
Users can modify the JCas generated Java classes by adding fields to them; this allows arbitrary non-CAS data to also be represented within the JCas objects, as well; however, the non-CAS data stored in the JCas object instances cannot be shared with annotators using the plain CAS, unless special provision is made - see the chapter in the v3 user's guide on storing arbitrary Java objects in the CAS.
The JCas class Java source files are generated from XML type system descriptions.
The JCasGen utility does the work of generating the corresponding Java Class Model for the CAS types.
There are a variety of ways JCasGen can be run; these are described later.
You include the generated classes with your UIMA component, and you can publish these classes for others who might want to use your type system.
JCas classes are not required for all UIMA types.
Those types which don't have corresponding JCas classes use the nearest JCas class corresponding to a type in their superchain.
The specification of the type system in XML can be written using a conventional text editor, an XML editor, or using the Eclipse plug-in that supports editing UIMA descriptors.
Changes to the type system are done by changing the XML and regenerating the corresponding Java Class Models.
Of course, once you've published your type system for others to use, you should be careful that any changes you make don't adversely impact the users.
Additional features can be added to existing types without breaking other code.
A separate Java class is generated for each type; this type implements the CAS FeatureStructure interface, as well as having the special getters and setters for the included features.
The generated Java classes have methods (getters and setters) for the fields as defined in the XML type specification.
Descriptor comments are reflected in the generated Java code as Java-doc style comments.
[[ugr.ref.jcas.name_spaces]]
== Name Spaces
Full Type names consist of a "`namespace`" prefix dotted with a simple name.
Namespaces are used like packages to avoid collisions between types that are defined by different people at different times.
The namespace is used as the Java package name for generated Java files.
Type names used in the CAS correspond to the generated Java classes directly.
If the CAS name is com.myCompany.myProject.ExampleClass, the generated Java class is in the package com.myCompany.myProject, and the class is ExampleClass.
An exception to this rule is the built-in types starting with ``uima.cas ``and ``uima.tcas``; these names are mapped to Java packages named `org.apache.uima.jcas.cas` and ``org.apache.uima.jcas.tcas``.
[[ugr.ref.jcas.use_of_description]]
== XML description element
// <titleabbrev>Use of XML Description</titleabbrev>
Each XML type specification can have <description ... > tags.
The description for a type will be copied into the generated Java code, as a Javadoc style comment for the class.
When writing these descriptions in the XML type specification file, you might want to use html tags, as allowed in Javadocs.
If you use the Component Description Editor, you can write the html tags normally, for instance, "`<h1>My Title</h1>`".
The Component Descriptor Editor will take care of coverting the actual descriptor source so that it has the leading "`<`" character written as "`&lt;`", to avoid confusing the XML type specification.
For example, <p> would be written in the source of the descriptor as &lt;p>. Any characters used in the Javadoc comment must of course be from the character set allowed by the XML type specification.
These specifications often start with the line <?xml version="`1.0`" encoding="`UTF-8`" ?>, which means you can use any of the UTF-8 characters.
[[ugr.ref.jcas.mapping_built_ins]]
== Mapping built-in CAS types to Java types
The built-in primitive CAS types map to Java types as follows:
[source]
----
uima.cas.Boolean boolean
uima.cas.Byte byte
uima.cas.Short short
uima.cas.Integer int
uima.cas.Long long
uima.cas.Float float
uima.cas.Double double
uima.cas.String String
----
[[ugr.ref.jcas.augmenting_generated_code]]
== Augmenting the generated Java Code
The Java Class Models generated for each type can be augmented by the user.
Typical augmentations include adding additional (non-CAS) fields and methods, and import statements that might be needed to support these.
Commonly added methods include additional constructors (having different parameter signatures), and implementations of toString().
To augment the code, just edit the generated Java source code for the class named the same as the CAS type.
Here's an example of an additional method you might add; the various getter methods are retrieving values from the instance:
[source]
----
public String toString() { // for debugging
return "XsgParse "
+ getslotName() + ": "
+ getheadWord().getCoveredText()
+ " seqNo: " + getseqNo()
+ ", cAddr: " + id
+ ", size left mods: " + getlMods().size()
+ ", size right mods: " + getrMods().size();
}
----
[[ugr.ref.jcas.keeping_augmentations_when_regenerating]]
=== Keeping hand-coded augmentations when regenerating
If the type system specification changes, you have to re-run the JCasGen generator.
This will produce updated Java for the Class Models that capture the changed specification.
If you have previously augmented the source for these Java Class Models, your changes must be merged with the newly (re)generated Java source code for the Class Models.
This can be done by hand, or you can run the version of JCasGen that is integrated with Eclipse, and use automatic merging that is done using Eclipse's EMF plug-in.
You can obtain Eclipse and the needed EMF plug-in from http://www.eclipse.org/.
If you run the generator version that works without using Eclipse, it will not merge Java source changes you may have previously made; if you want them retained, you'll have to do the merging by hand.
The Java source merging will keep additional constructors, additional fields, and any changes you may have made to the readObject method (see below). Merging will _not_ delete classes in the target corresponding to deleted CAS types, which no longer are in the source – you should delete these by hand.
[WARNING]
====
The merging supports Java 1.4 syntactic constructs only.
JCasGen generates Java 1.4 code, so as long as any code you change here also sticks to only Java 1.4 constructs, the merge will work.
If you use Java 5 or later specific syntax or constructs, the merge operation will likely fail to merge properly.
====
[[ugr.ref.jcas.additional_constructors]]
=== Additional Constructors
Any additional constructors that you add must include the JCas argument.
The first line of your constructor is required to be
[source]
----
this(jcas); // run the standard constructor
----
where jcas is the passed in JCas reference.
If the type you're defining extends ``uima.tcas.Annotation``, JCasGen will automatically add a constructor which takes 2 additional parameters – the begin and end Java int values, and set the `uima.tcas.Annotation```begin`` and `end` fields.
Here's an example: If you're defining a type MyType which has a feature parent, you might make an additional constructor which has an additional argument of parent:
[source]
----
MyType(JCas jcas, MyType parent) {
this(jcas); // run the standard constructor
setParent(parent); // set the parent field from the parameter
}
----
[[ugr.ref.jcas.using_readobject]]
==== Using readObject
Fields defined by augmenting the Java Class Model to include additional fields represent data that exist for this class in Java, in a local JVM (Java Virtual Machine), but do not exist in the CAS when it is passed to other environments (for example, passing to a remote annotator).
A problem can arise when new instances are created, perhaps by the underlying system when it iterates over an index, which is: how to insure that any additional non-CAS fields are properly initialized.
To allow for arbitrary initialization at instance creation time, an initialization method in the Java Class Model, called readObject is used.
The generated default for this method is to do nothing, but it is one of the methods that you can modify –to do whatever initialization might be needed.
It is called with 0 parameters, during the constructor for the object, after the basic object fields have been set up.
It can refer to fields in the CAS using the getters and setters, and other fields in the Java object instance being initialized.
A pre-existing CAS feature structure could exist if a CAS was being passed to this annotator; in this case the JCas system calls the readObject method when creating the corresponding Java instance for the first time for the CAS feature structure.
This can happen at two points: when a new object is being returned from an iterator over a CAS index, or a getter method is getting a field for the first time whose value is a feature structure.
[[ugr.ref.jcas.modifying_generated_items]]
=== Modifying generated items
The following modifications, if made in generated items, will be preserved when regenerating.
The public/private etc.
flags associated with methods (getters and setters). You can change the default ("`public`") if needed.
"`final`" or "`abstract`" can be added to the type itself, with the usual semantics.
[[ugr.ref.jcas.merging_types_from_other_specs]]
== Merging types
// <titleabbrev>Merging Types</titleabbrev>
Type definitions are merged by the framework from all the components being run together.
[[ugr.ref.jcas.merging_types.aggregates_and_cpes]]
=== Aggregate AEs and CPEs as sources of types
When running aggregate AEs (Analysis Engines), or a set of AEs in a collection processing engine, the UIMA framework will build a merged type system (Note: this "`merge`" is merging types, not to be confused with merging Java source code, discussed above). This merged type system has all the types of every component used in the application.
In addition, application code can use UIMA Framework APIs to read and merge type descriptions, manually.
In most cases, each type system can have its own Java Class Models generated individually, perhaps at an earlier time, and the resulting class files (or .jar files containing these class files) can be put in the class path to enable JCas.
However, it is possible that there may be multiple definitions of the same CAS type, each of which might have different features defined.
In this case, the UIMA framework will create a merged type by accumulating all the defined features for a particular type into that type's type definition.
However, the JCas classes for these types are not automatically merged, which can create some issues for JCas users, as discussed in the next section.
[[ugr.ref.jcas.merging_types.jcasgen_support]]
=== JCasGen support for type merging
When there are multiple definitions of the same CAS type with different features defined, then xref:tools.adoc#ugr.tools.jcasgen[JCasGen] can be re-run on the merged type system, to create one set of JCas Class definitions for the merged types, which can then be shared by all the components.
This is typically done by the person who is assembling the Aggregate Analysis Engine or Collection Processing Engine.
The resulting merged Java Class Model will then contain get and set methods for the complete set of features.
These Java classes must then be made available in the class path, __replacing__ the pre-merge versions of the classes.
If hand-modifications were done to the pre-merge versions of the classes, these must be applied to the merged versions, as described in section <<ugr.ref.jcas.keeping_augmentations_when_regenerating>>, above.
If just one of the pre-merge versions had hand-modifications, the source for this hand-modified version can be put into the file system where the generated output will go, and the -merge option for JCasGen will automatically merge the hand-modifications with the generated code.
If _both_ pre-merged versions had hand-modifications, then these modifications must be manually merged.
An alternative to this is packaging the components as individual PEAR files, each with their own version of the JCas generated Classes.
The Framework can run PEAR files using the pear file descriptor, and supply each component with its particular version of the JCas generated class.
[[ugr.ref.jcas.impact_of_type_merging_on_composability]]
=== Type Merging impacts on Composability
The recommended approach in UIMA is to build and maintain type systems as separate components, which are imported by Annotators.
Using this approach, Type Merging does not occur because the Type System and its JCas classes are centrally managed and shared by the annotators.
If you do choose to create a JCas Annotator that relies on Type Merging (meaning that your annotator redefines a Type that is already in use elsewhere, and adds its own features), this can negatively impact the reusability of your annotator, unless your component is used as a PEAR file.
If not using PEAR file packaging isolation capability, whenever anyone wants to combine your annotator with another annotator that uses a different version of the same Type, they will need to be aware of all of the issues described in the previous section.
They will need to have the know-how to re-run JCasGen and appropriately set up their classpath to include the merged Java classes and to not include the pre-merge classes.
(To enable this, you should package these classes separately from other .jar files for your annotator, so that they can be more easily excluded.) And, if you have done hand-modifications to your JCas classes, the person assembling your annotator will need to properly merge those changes.
These issues significantly complicate the task of combining annotators, and will cause your annotator not to be as easily reusable as other UIMA annotators.
[[ugr.ref.jcas.documentannotation_issues]]
=== Adding Features to DocumentAnnotation
There is one built-in type, ``uima.tcas.DocumentAnnotation``, to which applications can add additional features.
(All other built-in types are "feature-final" and you cannot add additional features to them.) Frequently, additional features are added to `uima.tcas.DocumentAnnotation` to provide a place to store document-level metadata.
For the same reasons mentioned in the previous section, adding features to DocumentAnnotation is not recommended if you are using JCas.
Instead, it is recommended that you define your own type for storing your document-level metadata.
You can create an instance of this type and add it to the indexes in the usual way.
You can then retrieve this instance using the iterator returned from the method``getAllIndexedFS(type)`` on an instance of a JFSIndexRepository object.
(As of UIMA v2.1, you do not have to declare a custom index in your descriptor to get this to work).
If you do choose to add features to DocumentAnnotation, there are additional issues to be aware of.
The UIMA SDK provides the JCas cover class for the built-in definition of DocumentAnnotation, in the separate jar file ``uima-document-annotation.jar``.
If you add additional features to DocumentAnnotation, you must remove this jar file from your classpath, because you will not want to use the default JCas cover class.
You will need to re-run JCasGen as described in <<ugr.ref.jcas.merging_types.jcasgen_support>>.
JCasGen will generate a new cover class for DocumentAnnotation, which you must place in your classpath in lieu of the version in ``uima-document-annotation.jar``.
Also, this is the reason why the method `JCas.getDocumentAnnotationFs()` returns type ``TOP``, rather than type ``DocumentAnnotation``.
Because the `DocumentAnnotation` class can be replaced by users, it is not part of `uima-core.jar` and so the core UIMA framework cannot have any references to it.
In your code, you may "`cast`" the result of `JCas.getDocumentAnnotationFs()` to type ``DocumentAnnotation``, which must be available on the classpath either via `uima-document-annotation.jar` or by including a custom version that you have generated using JCasGen.
[[ugr.ref.jcas.using_within_an_annotator]]
== Using JCas within an Annotator
To use JCas within an annotator, you must include the generated Java classes output from JCasGen in the class path.
An annotator written using JCas is built by defining a class for the annotator that extends JCasAnnotator_ImplBase.
The process method for this annotator is written
[source]
----
public void process(JCas jcas)
throws AnalysisEngineProcessException {
... // body of annotator goes here
}
----
The process method is passed the JCas instance to use as a parameter.
The JCas reference is used throughout the annotator to refer to the particular JCas instance being worked on.
In pooled or multi-threaded implementations, there will be a separate JCas for each thread being (simultaneously) worked on.
You can do several kinds of operations using the JCas APIs: create new feature structures (instances of CAS types) (using the new operator), access existing feature structures passed to your annotator in the JCas (for example, by using the next method of an iterator over the feature structures), get and set the fields of a particular instance of a feature structure, and add and remove feature structure instances from the CAS indexes.
To support iteration, there are also functions to get and use indexes and iterators over the instances in a JCas.
[[ugr.ref.jcas.new_instances]]
=== Creating new instances using the Java "`new`" operator
// <titleabbrev>Creating new instances</titleabbrev>
The new operator creates new instances of JCas types.
It takes at least one parameter, the JCas instance in which the type is to be created.
For example, if there was a type Meeting defined, you can create a new instance of it using:
[source]
----
Meeting m = new Meeting(jcas);
----
Other variations of constructors can be added in custom code; the single parameter version is the one automatically generated by JCasGen.
For types that are subtypes of Annotation, JCasGen also generates an additional constructor with additional "`begin`" and "`end`" arguments.
[[ugr.ref.jcas.getters_and_setters]]
=== Getters and Setters
If the CAS type Meeting had fields location and time, you could get or set these by using getter or setter methods.
These methods have names formed by splicing together the word "`get`" or "`set`" followed by the field name, with the first letter of the field name capitalized.
For instance
[source]
----
getLocation()
----
The getter forms take no parameters and return the value of the field; the setter forms take one parameter, the value to set into the field, and return void.
There are built-in CAS types for arrays of integers, strings, floats, and feature structures.
For fields whose values are these types of arrays, there is an alternate form of getters and setters that take an additional parameter, written as the first parameter, which is the index in the array of an item to get or set.
[[ugr.ref.jcas.obtaining_refs_to_indexes]]
=== Obtaining references to Indexes
The only way to access instances (not otherwise referenced from other instances) passed in to your annotator in its JCas is to use an iterator over some index.
Indexes in the CAS are specified in the annotator descriptor.
Indexes have a name; text annotators have a built-in, standard index over all annotations.
To get an index, first get the JFSIndexRepository from the JCas using the method jcas.getJFSIndexRepository(). Here are the calls to get indexes:
[source]
----
JFSIndexRepository ir = jcas.getJFSIndexRepository();
ir.getIndex(name-of-index) // get the index by its name, a string
ir.getIndex(name-of-index, Foo.type) // filtered by specific type
ir.getAnnotationIndex() // get AnnotationIndex
jcas.getAnnotationIndex() // get directly from jcas
ir.getAnnotationIndex(Foo.type) // filtered by specific type
----
For convenience, the getAnnotationIndex method is available directly on the JCas object instance; the implementation merely forwards to the associated index repository.
Filtering types have to be a subtype of the type specified for this index in its index specification.
They can be written as either Foo.type or if you have an instance of Foo, you can write
[source]
----
fooInstance.getClass()
----
Foo is (of course) an example of the name of the type.
[[ugr.ref.jcas.adding_removing_instances_to_indexes]]
=== Adding (and removing) instances to (from) indexes
// <titleabbrev>Updating Indexes</titleabbrev>
CAS indexes are maintained automatically by the CAS.
But you must add any instances of feature structures you want the index to find, to the indexes by using the call:
[source]
----
myInstance.addToIndexes();
----
Do this after setting all features in the instance __which could be used in indexing__, for example, in determining the sorting order.
See <<ugr.ref.cas.updating_indexed_feature_structures>> for details on updating indexed feature structures.
When writing a Multi-View component, you may need to index instances in multiple CAS views.
The methods above use the indexes associated with the current JCas object.
There is a variation of the `addToIndexes / removeFromIndexes` methods which takes one argument: a reference to a JCas object holding the view in which you want to index this instance.
[source]
----
myInstance.addToIndexes(anotherJCas)
myInstance.removeFromIndexes(anotherJCas)
----
You can also explicitly add instances to other views using the addFsToIndexes method on other JCas (or CAS) objects.
For instance, if you had 2 other CAS views (myView1 and myView2), in which you wanted to index myInstance, you could write:
[source]
----
myInstance.addToIndexes(); //addToIndexes used with the new operator
myView1.addFsToIndexes(myInstance); // index myInstance in myView1
myView2.addFsToIndexes(myInstance); // index myInstance in myView2
----
The rules for determining which index to use with a particular JCas object are designed to behave the way most would think they should; if you need specific behavior, you can always explicitly designate which view the index adding and removing operations should work on.
The rules are: If the instance is a subtype of AnnotationBase, then the view is the view associated with the annotation as specified in the feature holding the view reference in AnnotationBase.
Otherwise, if the instance was created using the "new" operator, then the view is the view passed to the instance's constructor.
Otherwise, if the instance was created by getting a feature value from some other instance, whose range type is a feature structure, then the view is the same as the referring instance.
Otherwise, if the instance was created by any of the Feature Structure Iterator operations over some index, then it is the view associated with the index.
As of release 2.4.1, there are two efficient bulk-remove methods to remove all instances of a given type, or all instances of a given type and its subtypes.
These are invoked on an instance of an IndexRepository, for a particular view.
For example, to remove all instances of Token from a particular JCas instance:
[source]
----
jcas.removeAllIncludingSubtypes(Token.type) or
jcas.removeAllIncludingSubtypes(aTokenInstance.getTypeIndexID()) or
jcas.getFsIndexRepository().
removeAllIncludingSubtypes(jcas.getCasType(Token.type))
----
[[ugr.ref.jcas.using_iterators]]
=== Using Iterators
This chapter describes obtaining and using iterators.
However, it is recommended that instead you use the select framework, described in a chapter in the version 3 user's guide.
Once you have an index obtained from the JCas, you can get an iterator from the index; here is an example:
[source]
----
FSIndexRepository ir = jcas.getFSIndexRepository();
FSIndex myIndex = ir.getIndex("myIndexName");
FSIterator myIterator = myIndex.iterator();
JFSIndexRepository ir = jcas.getJFSIndexRepository();
FSIndex myIndex = ir.getIndex("myIndexName", Foo.type); // filtered
FSIterator myIterator = myIndex.iterator();
----
xref:ref.adoc#ugr.ref.cas.indexes_and_iterators[Iterators] work like normal Java iterators, but are augmented to support additional capabilities.
[[ugr.ref.jcas.class_loaders]]
=== Class Loaders in UIMA
The basic concept of a UIMA application includes assembling engines into a flow.
The application made up of these Engines are run within the UIMA Framework, either by the Collection Processing Manager, or by using more basic UIMA Framework APIs.
The UIMA Framework exists within a JVM (Java Virtual Machine). A JVM has the capability to load multiple applications, in a way where each one is isolated from the others, by using a separate class loader for each application.
For instance, one set of UIMA Framework Classes could be shared by multiple sets of application - specific classes, even if these application-specific classes had the same names but were different versions.
[[ugr.ref.jcas.class_loaders.optional]]
==== Use of Class Loaders is optional
The UIMA framework will use a specific ClassLoader, based on how ResourceManager instances are used.
Specific ClassLoaders are only created if you specify an ExtensionClassPath as part of the ResourceManager.
If you do not need to support multiple applications within one UIMA framework within a JVM, don't specify an ExtensionClassPath; in this case, the classloader used will be the one used to load the UIMA framework - usually the overall application class loader.
Of course, you should not run multiple UIMA applications together, in this way, if they have different class definitions for the same class name.
This includes the JCas "`cover`" classes.
This case might arise, for instance, if both applications extended `uima.tcas.DocumentAnnotation` in differing, incompatible ways.
Each application would need its own definition of this class, but only one could be loaded (unless you specify ExtensionClassPath in the ResourceManager which will cause the UIMA application to load its private versions of its classes, from its classpath).
[[ugr.ref.jcas.accessing_jcas_objects_outside_uima_components]]
=== Issues accessing JCas objects outside of UIMA Engine Components
If you are using the ExtensionClassPaths, the JCas cover classes are loaded under a class loader created by the ResourceManager part of the UIMA Framework.
If you reference the same JCas classes outside of any UIMA component, for instance, in top level application code, the JCas classes used by that top level application code also must be in the class path for the application code.
Alternatively, you could do all the JCas processing inside a UIMA component (and do no processing using JCas outside of the UIMA pipeline).
[[ugr.ref.jcas.setting_up_classpath]]
== Setting up Classpath for JCas
The JCas Java classes generated by JCasGen are typically compiled and put into a JAR file, which, in turn, is put into the application's class path.
This JAR file must be generated from the application's merged type system.
This is most conveniently done by opening the top level descriptor used by the application in the Component Descriptor Editor tool, and pressing the Run-JCasGen button on the Type System Definition page.
[[ugr.ref.jcas.pear_support]]
== PEAR isolation
As of version 2.2, the framework supports component descriptors which are PEAR descriptors.
These descriptors define components plus include information on the class path needed to run them.
The framework uses the class path information to set up a localized class path, just for code running within the PEAR context.
This allows PEAR files requiring different versions of common code to work well together, even if the class names in the different versions have the same names.
The mechanism used to switch the class loaders when entering a PEAR-packaged annotator in a flow depends on the framework knowing if JCas is being used within that annotator code.
The framework will know this if the particular view being passed has had a previous call to getJCas(), or if the particular annotator is marked as a JCas-using one (by having it extend the class `JCasAnnotator_ImplBase).`