<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/tools/tools.doc_analyzer/" > | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.tools.doc_analyzer"> | |
<title>Document Analyzer User's Guide</title> | |
<para>The <emphasis>Document Analyzer</emphasis> is a tool provided by the | |
UIMA SDK for testing annotators and AEs. It reads text files from your disk, processes them using an AE, and | |
allows you to view the results. The | |
Document Analyzer is designed to work with text files and cannot be used with | |
Analysis Engines that process other types of data.</para> | |
<para>For an introduction to developing annotators and Analysis | |
Engines, read | |
<olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/>. | |
This chapter is a user's guide for using the Document Analyzer tool, and | |
does not describe the process of developing annotators and Analysis Engines.</para> | |
<section id="ugr.tools.doc_analyzer.starting"> | |
<title>Starting the Document Analyzer</title> | |
<para>To run the Document Analyzer, execute the <literal>documentAnalyzer</literal> script that is in the <literal>bin</literal> directory of your UIMA SDK installation, or, if you | |
are using the example Eclipse project, execute the <quote>UIMA Document Analyzer</quote> | |
run configuration supplied with that project.</para> | |
<para>Note that if you're planning to run an Analysis Engine | |
other than one of the examples included in the UIMA SDK, you'll first need to | |
update your CLASSPATH environment variable to include the classes needed by | |
that Analysis Engine.</para> | |
<para>When you first run the Document Analyzer, you should see a | |
screen that looks like this: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.8in" format="JPG" fileref="&imgroot;image002.jpg"/> | |
</imageobject> | |
<textobject><phrase>Document Analyzer GUI</phrase> | |
</textobject> | |
</mediaobject> | |
</screenshot></para> | |
</section> | |
<section id="ugr.tools.doc_analyzer.running_an_ae"> | |
<title>Running an AE</title> | |
<para>To run a AE, you must first configure the six fields on | |
the main screen of the Document Analyzer.</para> | |
<para><emphasis role="bold">Input Directory:</emphasis> | |
Browse to or type the path of a directory containing text files that you | |
want to analyze. Some sample documents | |
are provided in the UIMA SDK under the <literal>examples/data</literal> | |
directory.</para> | |
<para><emphasis role="bold">Output Directory:</emphasis> Browse to or type the path of a directory where you want | |
output to be written. (As we'll see later, you won't normally need to look directly at these files, but the | |
Document Analyzer needs to know where to write them.) The files written to this directory will be an XML | |
representation of the analyzed documents. If this directory doesn't exist, it will be created. If the | |
directory exists, any files in it will be deleted (but the tool will ask you to confirm this before doing so). If you | |
leave this field blank, your AE will be run but no output will be generated.</para> | |
<para><emphasis role="bold">Location of AE XML Descriptor:</emphasis> | |
Browse to or type the path of the descriptor | |
for the AE that you want to run. There | |
are some example descriptors provided in the UIMA SDK under the <literal>examples/descriptors/analysis_engine</literal> and <literal>examples/descriptors/tutorial</literal> directories.</para> | |
<para><emphasis role="bold">XML Tag containing Text:</emphasis> | |
This is an optional feature. If you enter a value here, it specifies the | |
name of an XML tag, expected to be found within the input documents, that | |
contains the text to be analyzed. For | |
example, the value <literal>TEXT</literal> would cause the AE to only | |
analyze the portion of the document enclosed within <TEXT>...</TEXT> | |
tags. Also, any XML tags occuring within that text will be removed prior to analysis.</para> | |
<para><emphasis role="bold">Language:</emphasis> | |
Specify | |
the language in which the documents are written. Some Analysis Engines, but not all, require | |
that this be set correctly in order to do their analysis. You can select a value from the drop-down | |
list or type your own. The value entered | |
here must be an ISO language identifier, the list of which can be found here: | |
<ulink url="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt"/>. | |
</para> | |
<para><emphasis role="bold">Character Encoding:</emphasis> | |
The character encoding of the input files. The default, UTF-8, also works fine for ASCII | |
text files. If you have a different | |
encoding, enter it here. For more | |
information on character sets and their names, see the Javadocs for | |
<literal>java.nio.charset.Charset</literal>.</para> | |
<para>Once you've filled in the appropriate values, press the | |
<quote>Run</quote> button.</para> | |
<para>If an error occurs, a dialog will appear with the error | |
message. (A stack trace will also be | |
printed to the console, which may help you if the error was generated by your | |
own annotator code.) Otherwise, an | |
<quote>Analysis Results</quote> window will appear.</para> | |
</section> | |
<section id="ugr.tools.doc_analyzer.viewing_results"> | |
<title>Viewing the Analysis Results</title> | |
<para>After a successful analysis, the <quote>Analysis | |
Results</quote> window will appear. | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="4.2in" format="JPG" fileref="&imgroot;image004.jpg"/> | |
</imageobject> | |
<textobject><phrase>Analysis Results Window</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
<para>The <quote>Results Display Format</quote> options at the | |
bottom of this window show the different ways you can view your analysis – the | |
Java Viewer, Java Viewer (JV) with User Colors, HTML, and XML. | |
The default, Java Viewer, is recommended.</para> | |
<para>Once you have selected your desired Results Display | |
Format, you can double-click on one of the files in the list to view the | |
analysis done on that file.</para> | |
<para>For the Java viewer, the results display looks like this | |
(for the AE descriptor <literal>examples/descriptors/tutorial/ex4/MeetingDetectorAE.xml</literal>): | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.8in" format="JPG" fileref="&imgroot;image006.jpg"/> | |
</imageobject> | |
<textobject><phrase>Analysis Results Window showing results from tutorial example 4</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
<para>You can click the mouse on one of the highlighted | |
annotations to see a list of all its features in the frame on the right.</para> | |
<para>If there are multiple annotation types in the view, you | |
can control which ones are selected by using the checkboxes in the legend, the | |
Select All button, or the Deselect All button.</para> | |
<para>If you are viewing a CAS that contains multiple subjects | |
of analysis, then a selector will appear at the bottom right of the Annotation | |
Viewer window. This will allow you to | |
choose the Sofa that you wish to view. Note that only text Sofas containing a non-null document are available | |
for viewing.</para> | |
</section> | |
<section id="ugr.tools.doc_analyzer.configuring"> | |
<title>Configuring the Annotation Viewer</title> | |
<para>The <quote>JV User Colors</quote> and the HTML viewer allow | |
you to specify exactly which colors are used to display each of your annotation | |
types. For the Java Viewer, you can also | |
specify which types should be initially selected, and you can hide types | |
entirely.</para> | |
<para>To configure the viewer, click the <quote>Edit Style | |
Map</quote> button on the <quote>Analysis Results</quote> dialog. | |
You should see a dialog that looks like this: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.8in" format="JPG" fileref="&imgroot;image008.jpg"/> | |
</imageobject> | |
<textobject><phrase>Configuring the Analysis Results Viewer</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
<para>To change the color assigned to a type, simply click on | |
the colored cell in the <quote>Background</quote> column for the type you wish to | |
edit. This will display a dialog that | |
allows you to choose the color. For the | |
HTML viewer only, you can also change the foreground color.</para> | |
<para>If you would like the type to be initially checked | |
(selected) in the legend when the viewer is first launched, check the box in | |
the <quote>Checked</quote> column. If you | |
would like the type to never be shown in the viewer, click the box in the | |
<quote>Hidden</quote> column. These | |
settings only affect the Java Viewer, not the HTML view.</para> | |
<para>When you are done editing, click the <quote>Save</quote> | |
button. This will save your choices to a | |
file in the same directory as your AE descriptor. From now on, when you view analysis results | |
produced by this AE using the <quote>JV User Colors</quote> or <quote>HTML</quote> | |
options, the viewer will be configured as you have specified.</para> | |
</section> | |
<section id="ugr.tools.doc_analyzer.interactive_mode"> | |
<title>Interactive Mode</title> | |
<para>Interactive Mode allows you to analyze text that you type | |
or cut-and-paste into the tool, rather than requiring that the documents be | |
stored as files.</para> | |
<para>In the main Document Analyzer window, you can invoke | |
Interactive Mode by clicking the <quote>Interactive</quote> button instead of the | |
<quote>Run</quote> button. This will | |
display a dialog that looks like this: | |
<screenshot> | |
<mediaobject> | |
<imageobject> | |
<imagedata width="5.5in" format="JPG" fileref="&imgroot;image010.jpg"/> | |
</imageobject> | |
<textobject><phrase>Invoking Interactive Mode</phrase></textobject> | |
</mediaobject> | |
</screenshot></para> | |
<para>You can type or cut-and-paste your text into this window, | |
then choose your Results Display Format and click the <quote>Analyze</quote> | |
button. Your AE will be run on the text | |
that you supplied and the results will be displayed as usual.</para> | |
</section> | |
<section id="ugr.tools.doc_analyzer.view_mode"> | |
<title>View Mode</title> | |
<para>If you have previously run a AE and saved its analysis | |
results, you can use the Document Analyzer's View mode to view those results, | |
without re-running your analysis. To do | |
this, on the main Document Analyzer window simply select the location of your | |
analyzed documents in the <quote>Output Directory</quote> dialog and click the | |
<quote>View</quote> button. You can then | |
view your analysis results as described in Section | |
<xref linkend="ugr.tools.doc_analyzer.viewing_results"/>.</para> | |
</section> | |
</chapter> | |