blob: 43b1905f91c4226f3613cde4b4df348a5f9cf6ce [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
<!ENTITY imgroot "../images/tools/tools.doc_analyzer/" >
<!ENTITY % uimaents SYSTEM "../entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.tools.doc_analyzer">
<title>Document Analyzer User&apos;s Guide</title>
<para>The <emphasis>Document Analyzer</emphasis> is a tool provided by the
UIMA SDK for testing annotators and AEs. It reads text files from your disk, processes them using an AE, and
allows you to view the results. The
Document Analyzer is designed to work with text files and cannot be used with
Analysis Engines that process other types of data.</para>
<para>For an introduction to developing annotators and Analysis
Engines, read
<olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/>.
This chapter is a user&apos;s guide for using the Document Analyzer tool, and
does not describe the process of developing annotators and Analysis Engines.</para>
<section id="ugr.tools.doc_analyzer.starting">
<title>Starting the Document Analyzer</title>
<para>To run the Document Analyzer, execute the <literal>documentAnalyzer</literal> script that is in the <literal>bin</literal> directory of your UIMA SDK installation, or, if you
are using the example Eclipse project, execute the <quote>UIMA Document Analyzer</quote>
run configuration supplied with that project.</para>
<para>Note that if you&apos;re planning to run an Analysis Engine
other than one of the examples included in the UIMA SDK, you&apos;ll first need to
update your CLASSPATH environment variable to include the classes needed by
that Analysis Engine.</para>
<para>When you first run the Document Analyzer, you should see a
screen that looks like this:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.8in" format="JPG" fileref="&imgroot;image002.jpg"/>
</imageobject>
<textobject><phrase>Document Analyzer GUI</phrase>
</textobject>
</mediaobject>
</screenshot></para>
</section>
<section id="ugr.tools.doc_analyzer.running_an_ae">
<title>Running an AE</title>
<para>To run a AE, you must first configure the six fields on
the main screen of the Document Analyzer.</para>
<para><emphasis role="bold">Input Directory:</emphasis>
Browse to or type the path of a directory containing text files that you
want to analyze. Some sample documents
are provided in the UIMA SDK under the <literal>examples/data</literal>
directory.</para>
<para><emphasis role="bold">Output Directory:</emphasis> Browse to or type the path of a directory where you want
output to be written. (As we&apos;ll see later, you won&apos;t normally need to look directly at these files, but the
Document Analyzer needs to know where to write them.) The files written to this directory will be an XML
representation of the analyzed documents. If this directory doesn&apos;t exist, it will be created. If the
directory exists, any files in it will be deleted (but the tool will ask you to confirm this before doing so). If you
leave this field blank, your AE will be run but no output will be generated.</para>
<para><emphasis role="bold">Location of AE XML Descriptor:</emphasis>
Browse to or type the path of the descriptor
for the AE that you want to run. There
are some example descriptors provided in the UIMA SDK under the <literal>examples/descriptors/analysis_engine</literal> and <literal>examples/descriptors/tutorial</literal> directories.</para>
<para><emphasis role="bold">XML Tag containing Text:</emphasis>
This is an optional feature. If you enter a value here, it specifies the
name of an XML tag, expected to be found within the input documents, that
contains the text to be analyzed. For
example, the value <literal>TEXT</literal> would cause the AE to only
analyze the portion of the document enclosed within &lt;TEXT&gt;...&lt;/TEXT&gt;
tags. Also, any XML tags occuring within that text will be removed prior to analysis.</para>
<para><emphasis role="bold">Language:</emphasis>
Specify
the language in which the documents are written. Some Analysis Engines, but not all, require
that this be set correctly in order to do their analysis. You can select a value from the drop-down
list or type your own. The value entered
here must be an ISO language identifier, the list of which can be found here:
<ulink url="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt"/>.
</para>
<para><emphasis role="bold">Character Encoding:</emphasis>
The character encoding of the input files. The default, UTF-8, also works fine for ASCII
text files. If you have a different
encoding, enter it here. For more
information on character sets and their names, see the Javadocs for
<literal>java.nio.charset.Charset</literal>.</para>
<para>Once you&apos;ve filled in the appropriate values, press the
<quote>Run</quote> button.</para>
<para>If an error occurs, a dialog will appear with the error
message. (A stack trace will also be
printed to the console, which may help you if the error was generated by your
own annotator code.) Otherwise, an
<quote>Analysis Results</quote> window will appear.</para>
</section>
<section id="ugr.tools.doc_analyzer.viewing_results">
<title>Viewing the Analysis Results</title>
<para>After a successful analysis, the <quote>Analysis
Results</quote> window will appear.
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="4.2in" format="JPG" fileref="&imgroot;image004.jpg"/>
</imageobject>
<textobject><phrase>Analysis Results Window</phrase></textobject>
</mediaobject>
</screenshot></para>
<para>The <quote>Results Display Format</quote> options at the
bottom of this window show the different ways you can view your analysis &ndash; the
Java Viewer, Java Viewer (JV) with User Colors, HTML, and XML.
The default, Java Viewer, is recommended.</para>
<para>Once you have selected your desired Results Display
Format, you can double-click on one of the files in the list to view the
analysis done on that file.</para>
<para>For the Java viewer, the results display looks like this
(for the AE descriptor <literal>examples/descriptors/tutorial/ex4/MeetingDetectorAE.xml</literal>):
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.8in" format="JPG" fileref="&imgroot;image006.jpg"/>
</imageobject>
<textobject><phrase>Analysis Results Window showing results from tutorial example 4</phrase></textobject>
</mediaobject>
</screenshot></para>
<para>You can click the mouse on one of the highlighted
annotations to see a list of all its features in the frame on the right.</para>
<para>If there are multiple annotation types in the view, you
can control which ones are selected by using the checkboxes in the legend, the
Select All button, or the Deselect All button.</para>
<para>If you are viewing a CAS that contains multiple subjects
of analysis, then a selector will appear at the bottom right of the Annotation
Viewer window. This will allow you to
choose the Sofa that you wish to view. Note that only text Sofas containing a non-null document are available
for viewing.</para>
</section>
<section id="ugr.tools.doc_analyzer.configuring">
<title>Configuring the Annotation Viewer</title>
<para>The <quote>JV User Colors</quote> and the HTML viewer allow
you to specify exactly which colors are used to display each of your annotation
types. For the Java Viewer, you can also
specify which types should be initially selected, and you can hide types
entirely.</para>
<para>To configure the viewer, click the <quote>Edit Style
Map</quote> button on the <quote>Analysis Results</quote> dialog.
You should see a dialog that looks like this:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.8in" format="JPG" fileref="&imgroot;image008.jpg"/>
</imageobject>
<textobject><phrase>Configuring the Analysis Results Viewer</phrase></textobject>
</mediaobject>
</screenshot></para>
<para>To change the color assigned to a type, simply click on
the colored cell in the <quote>Background</quote> column for the type you wish to
edit. This will display a dialog that
allows you to choose the color. For the
HTML viewer only, you can also change the foreground color.</para>
<para>If you would like the type to be initially checked
(selected) in the legend when the viewer is first launched, check the box in
the <quote>Checked</quote> column. If you
would like the type to never be shown in the viewer, click the box in the
<quote>Hidden</quote> column. These
settings only affect the Java Viewer, not the HTML view.</para>
<para>When you are done editing, click the <quote>Save</quote>
button. This will save your choices to a
file in the same directory as your AE descriptor. From now on, when you view analysis results
produced by this AE using the <quote>JV User Colors</quote> or <quote>HTML</quote>
options, the viewer will be configured as you have specified.</para>
</section>
<section id="ugr.tools.doc_analyzer.interactive_mode">
<title>Interactive Mode</title>
<para>Interactive Mode allows you to analyze text that you type
or cut-and-paste into the tool, rather than requiring that the documents be
stored as files.</para>
<para>In the main Document Analyzer window, you can invoke
Interactive Mode by clicking the <quote>Interactive</quote> button instead of the
<quote>Run</quote> button. This will
display a dialog that looks like this:
<screenshot>
<mediaobject>
<imageobject>
<imagedata width="5.5in" format="JPG" fileref="&imgroot;image010.jpg"/>
</imageobject>
<textobject><phrase>Invoking Interactive Mode</phrase></textobject>
</mediaobject>
</screenshot></para>
<para>You can type or cut-and-paste your text into this window,
then choose your Results Display Format and click the <quote>Analyze</quote>
button. Your AE will be run on the text
that you supplied and the results will be displayed as usual.</para>
</section>
<section id="ugr.tools.doc_analyzer.view_mode">
<title>View Mode</title>
<para>If you have previously run a AE and saved its analysis
results, you can use the Document Analyzer&apos;s View mode to view those results,
without re-running your analysis. To do
this, on the main Document Analyzer window simply select the location of your
analyzed documents in the <quote>Output Directory</quote> dialog and click the
<quote>View</quote> button. You can then
view your analysis results as described in Section
<xref linkend="ugr.tools.doc_analyzer.viewing_results"/>.</para>
</section>
</chapter>