uimaj-2.2.0-incubating/uima-docbooks/src/docbook/tools/tools.doc_analyzer.xml - uima-uimaj - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
 <!ENTITY imgroot "../images/tools/tools.doc_analyzer/" >
 <!ENTITY % uimaents SYSTEM "../entities.ent" >
 %uimaents;
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <chapter id="ugr.tools.doc_analyzer">
   <title>Document Analyzer User&apos;s Guide</title>


 <para>The <emphasis>Document Analyzer</emphasis> is a tool provided by the
 UIMA SDK for testing annotators and AEs. It reads text files from your disk, processes them using an AE, and
 allows you to view the results.  The
 Document Analyzer is designed to work with text files and cannot be used with
 Analysis Engines that process other types of data.</para>

 <para>For an introduction to developing annotators and Analysis
 Engines, read
  <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/>.
   This chapter is a user&apos;s guide for using the Document Analyzer tool, and
 does not describe the process of developing annotators and Analysis Engines.</para>

 <section id="ugr.tools.doc_analyzer.starting">
   <title>Starting the Document Analyzer</title>

 <para>To run the Document Analyzer, execute the <literal>documentAnalyzer</literal> script that is in the <literal>bin</literal> directory of your UIMA SDK installation, or, if you
 are using the example Eclipse project, execute the <quote>UIMA Document Analyzer</quote>
 run configuration supplied with that project.</para>

 <para>Note that if you&apos;re planning to run an Analysis Engine
 other than one of the examples included in the UIMA SDK, you&apos;ll first need to
 update your CLASSPATH environment variable to include the classes needed by
 that Analysis Engine.</para>

 <para>When you first run the Document Analyzer, you should see a
 screen that looks like this:

   <screenshot>
     <mediaobject>
       <imageobject>
         <imagedata width="5.8in" format="JPG" fileref="&imgroot;image002.jpg"/>
       </imageobject>
       <textobject><phrase>Document Analyzer GUI</phrase>
       </textobject>
     </mediaobject>
   </screenshot></para>


   </section>

   <section id="ugr.tools.doc_analyzer.running_an_ae">
     <title>Running an AE</title>


 <para>To run a AE, you must first configure the six fields on
 the main screen of the Document Analyzer.</para>

 <para><emphasis role="bold">Input Directory:</emphasis>
   Browse to or type the path of a directory containing text files that you
 want to analyze.  Some sample documents
 are provided in the UIMA SDK under the <literal>examples/data</literal>
 directory.</para>

 <para><emphasis role="bold">Output Directory:</emphasis> Browse to or type the path of a directory where you want
   output to be written. (As we&apos;ll see later, you won&apos;t normally need to look directly at these files, but the
   Document Analyzer needs to know where to write them.) The files written to this directory will be an XML
   representation of the analyzed documents. If this directory doesn&apos;t exist, it will be created. If the
   directory exists, any files in it will be deleted (but the tool will ask you to confirm this before doing so). If you
   leave this field blank, your AE will be run but no output will be generated.</para>

 <para><emphasis role="bold">Location of AE XML Descriptor:</emphasis>
   Browse to or type the path of the descriptor
 for the AE that you want to run.  There
 are some example descriptors provided in the UIMA SDK under the <literal>examples/descriptors/analysis_engine</literal> and <literal>examples/descriptors/tutorial</literal> directories.</para>

 <para><emphasis role="bold">XML Tag containing Text:</emphasis>
   This is an optional feature.  If you enter a value here, it specifies the
 name of an XML tag, expected to be found within the input documents, that
 contains the text to be analyzed.  For
 example, the value <literal>TEXT</literal> would cause the AE to only
 analyze the portion of the document enclosed within &lt;TEXT&gt;...&lt;/TEXT&gt;
 tags.  Also, any XML tags occuring within that text will be removed prior to analysis.</para>

 <para><emphasis role="bold">Language:</emphasis>
   Specify
 the language in which the documents are written.  Some Analysis Engines, but not all, require
 that this be set correctly in order to do their analysis.  You can select a value from the drop-down
 list or type your own.  The value entered
 here must be an ISO language identifier, the list of which can be found here:
   <ulink url="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt"/>.
 </para>

 <para><emphasis role="bold">Character Encoding:</emphasis>
   The character encoding of the input files.  The default, UTF-8, also works fine for ASCII
 text files.  If you have a different
 encoding, enter it here.  For more
 information on character sets and their names, see the Javadocs for
   <literal>java.nio.charset.Charset</literal>.</para>

 <para>Once you&apos;ve filled in the appropriate values, press the
 <quote>Run</quote> button.</para>

 <para>If an error occurs, a dialog will appear with the error
 message.  (A stack trace will also be
 printed to the console, which may help you if the error was generated by your
 own annotator code.)  Otherwise, an
 <quote>Analysis Results</quote> window will appear.</para>


 </section>

   <section id="ugr.tools.doc_analyzer.viewing_results">
     <title>Viewing the Analysis Results</title>

 <para>After a successful analysis, the <quote>Analysis
 Results</quote> window will appear.

   <screenshot>
     <mediaobject>
       <imageobject>
         <imagedata width="4.2in" format="JPG" fileref="&imgroot;image004.jpg"/>
       </imageobject>
       <textobject><phrase>Analysis Results Window</phrase></textobject>
     </mediaobject>
   </screenshot></para>


 <para>The <quote>Results Display Format</quote> options at the
 bottom of this window show the different ways you can view your analysis &ndash; the
 Java Viewer, Java Viewer (JV) with User Colors, HTML, and XML.
   The default, Java Viewer, is recommended.</para>

 <para>Once you have selected your desired Results Display
 Format, you can double-click on one of the files in the list to view the
 analysis done on that file.</para>

 <para>For the Java viewer, the results display looks like this
 (for the AE descriptor <literal>examples/descriptors/tutorial/ex4/MeetingDetectorAE.xml</literal>):

   <screenshot>
     <mediaobject>
       <imageobject>
         <imagedata width="5.8in" format="JPG" fileref="&imgroot;image006.jpg"/>
       </imageobject>
       <textobject><phrase>Analysis Results Window showing results from tutorial example 4</phrase></textobject>
     </mediaobject>
   </screenshot></para>


 <para>You can click the mouse on one of the highlighted
 annotations to see a list of all its features in the frame on the right.</para>

 <para>If there are multiple annotation types in the view, you
 can control which ones are selected by using the checkboxes in the legend, the
 Select All button, or the Deselect All button.</para>

 <para>If you are viewing a CAS that contains multiple subjects
 of analysis, then a selector will appear at the bottom right of the Annotation
 Viewer window.  This will allow you to
 choose the Sofa that you wish to view.  Note that only text Sofas containing a non-null document are available
 for viewing.</para>

 </section>

   <section id="ugr.tools.doc_analyzer.configuring">
     <title>Configuring the Annotation Viewer</title>

 <para>The <quote>JV User Colors</quote> and the HTML viewer allow
 you to specify exactly which colors are used to display each of your annotation
 types.  For the Java Viewer, you can also
 specify which types should be initially selected, and you can hide types
 entirely.</para>

 <para>To configure the viewer, click the <quote>Edit Style
 Map</quote> button on the <quote>Analysis Results</quote> dialog.
   You should see a dialog that looks like this:


   <screenshot>
     <mediaobject>
       <imageobject>
         <imagedata width="5.8in" format="JPG" fileref="&imgroot;image008.jpg"/>
       </imageobject>
       <textobject><phrase>Configuring the Analysis Results Viewer</phrase></textobject>
     </mediaobject>
   </screenshot></para>

 <para>To change the color assigned to a type, simply click on
 the colored cell in the <quote>Background</quote> column for the type you wish to
 edit.  This will display a dialog that
 allows you to choose the color.  For the
 HTML viewer only, you can also change the foreground color.</para>

 <para>If you would like the type to be initially checked
 (selected) in the legend when the viewer is first launched, check the box in
 the <quote>Checked</quote> column.  If you
 would like the type to never be shown in the viewer, click the box in the
 <quote>Hidden</quote> column.  These
 settings only affect the Java Viewer, not the HTML view.</para>

 <para>When you are done editing, click the <quote>Save</quote>
 button.  This will save your choices to a
 file in the same directory as your AE descriptor.  From now on, when you view analysis results
 produced by this AE using the <quote>JV User Colors</quote> or <quote>HTML</quote>
 options, the viewer will be configured as you have specified.</para>

 </section>

 <section id="ugr.tools.doc_analyzer.interactive_mode">
   <title>Interactive Mode</title>


 <para>Interactive Mode allows you to analyze text that you type
 or cut-and-paste into the tool, rather than requiring that the documents be
 stored as files.</para>

 <para>In the main Document Analyzer window, you can invoke
 Interactive Mode by clicking the <quote>Interactive</quote> button instead of the
 <quote>Run</quote> button.  This will
 display a dialog that looks like this:


   <screenshot>
     <mediaobject>
       <imageobject>
         <imagedata width="5.5in" format="JPG" fileref="&imgroot;image010.jpg"/>
       </imageobject>
       <textobject><phrase>Invoking Interactive Mode</phrase></textobject>
     </mediaobject>
   </screenshot></para>

 <para>You can type or cut-and-paste your text into this window,
 then choose your Results Display Format and click the <quote>Analyze</quote>
 button.  Your AE will be run on the text
 that you supplied and the results will be displayed as usual.</para>


 </section>

   <section id="ugr.tools.doc_analyzer.view_mode">
     <title>View Mode</title>

 <para>If you have previously run a AE and saved its analysis
 results, you can use the Document Analyzer&apos;s View mode to view those results,
 without re-running your analysis.  To do
 this, on the main Document Analyzer window simply select the location of your
 analyzed documents in the <quote>Output Directory</quote> dialog and click the
 <quote>View</quote> button.  You can then
 view your analysis results as described in Section
  <xref linkend="ugr.tools.doc_analyzer.viewing_results"/>.</para>

 </section>
   </chapter>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
	"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
	<!ENTITY imgroot "../images/tools/tools.doc_analyzer/" >
	<!ENTITY % uimaents SYSTEM "../entities.ent" >
	%uimaents;
	]>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<chapter id="ugr.tools.doc_analyzer">
	<title>Document Analyzer User's Guide</title>


	<para>The <emphasis>Document Analyzer</emphasis> is a tool provided by the
	UIMA SDK for testing annotators and AEs. It reads text files from your disk, processes them using an AE, and
	allows you to view the results. The
	Document Analyzer is designed to work with text files and cannot be used with
	Analysis Engines that process other types of data.</para>

	<para>For an introduction to developing annotators and Analysis
	Engines, read
	<olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae"/>.
	This chapter is a user's guide for using the Document Analyzer tool, and
	does not describe the process of developing annotators and Analysis Engines.</para>

	<section id="ugr.tools.doc_analyzer.starting">
	<title>Starting the Document Analyzer</title>

	<para>To run the Document Analyzer, execute the <literal>documentAnalyzer</literal> script that is in the <literal>bin</literal> directory of your UIMA SDK installation, or, if you
	are using the example Eclipse project, execute the <quote>UIMA Document Analyzer</quote>
	run configuration supplied with that project.</para>

	<para>Note that if you're planning to run an Analysis Engine
	other than one of the examples included in the UIMA SDK, you'll first need to
	update your CLASSPATH environment variable to include the classes needed by
	that Analysis Engine.</para>

	<para>When you first run the Document Analyzer, you should see a
	screen that looks like this:

	<screenshot>
	<mediaobject>
	<imageobject>
	<imagedata width="5.8in" format="JPG" fileref="&imgroot;image002.jpg"/>
	</imageobject>
	<textobject><phrase>Document Analyzer GUI</phrase>
	</textobject>
	</mediaobject>
	</screenshot></para>


	</section>

	<section id="ugr.tools.doc_analyzer.running_an_ae">
	<title>Running an AE</title>



	<para>To run a AE, you must first configure the six fields on
	the main screen of the Document Analyzer.</para>

	<para><emphasis role="bold">Input Directory:</emphasis>
	Browse to or type the path of a directory containing text files that you
	want to analyze. Some sample documents
	are provided in the UIMA SDK under the <literal>examples/data</literal>
	directory.</para>

	<para><emphasis role="bold">Output Directory:</emphasis> Browse to or type the path of a directory where you want
	output to be written. (As we'll see later, you won't normally need to look directly at these files, but the
	Document Analyzer needs to know where to write them.) The files written to this directory will be an XML
	representation of the analyzed documents. If this directory doesn't exist, it will be created. If the
	directory exists, any files in it will be deleted (but the tool will ask you to confirm this before doing so). If you
	leave this field blank, your AE will be run but no output will be generated.</para>

	<para><emphasis role="bold">Location of AE XML Descriptor:</emphasis>
	Browse to or type the path of the descriptor
	for the AE that you want to run. There
	are some example descriptors provided in the UIMA SDK under the <literal>examples/descriptors/analysis_engine</literal> and <literal>examples/descriptors/tutorial</literal> directories.</para>

	<para><emphasis role="bold">XML Tag containing Text:</emphasis>
	This is an optional feature. If you enter a value here, it specifies the
	name of an XML tag, expected to be found within the input documents, that
	contains the text to be analyzed. For
	example, the value <literal>TEXT</literal> would cause the AE to only
	analyze the portion of the document enclosed within <TEXT>...</TEXT>
	tags. Also, any XML tags occuring within that text will be removed prior to analysis.</para>

	<para><emphasis role="bold">Language:</emphasis>
	Specify
	the language in which the documents are written. Some Analysis Engines, but not all, require
	that this be set correctly in order to do their analysis. You can select a value from the drop-down
	list or type your own. The value entered
	here must be an ISO language identifier, the list of which can be found here:
	<ulink url="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt"/>.
	</para>

	<para><emphasis role="bold">Character Encoding:</emphasis>
	The character encoding of the input files. The default, UTF-8, also works fine for ASCII
	text files. If you have a different
	encoding, enter it here. For more
	information on character sets and their names, see the Javadocs for
	<literal>java.nio.charset.Charset</literal>.</para>

	<para>Once you've filled in the appropriate values, press the
	<quote>Run</quote> button.</para>

	<para>If an error occurs, a dialog will appear with the error
	message. (A stack trace will also be
	printed to the console, which may help you if the error was generated by your
	own annotator code.) Otherwise, an
	<quote>Analysis Results</quote> window will appear.</para>



	</section>

	<section id="ugr.tools.doc_analyzer.viewing_results">
	<title>Viewing the Analysis Results</title>

	<para>After a successful analysis, the <quote>Analysis
	Results</quote> window will appear.

	<screenshot>
	<mediaobject>
	<imageobject>
	<imagedata width="4.2in" format="JPG" fileref="&imgroot;image004.jpg"/>
	</imageobject>
	<textobject><phrase>Analysis Results Window</phrase></textobject>
	</mediaobject>
	</screenshot></para>


	<para>The <quote>Results Display Format</quote> options at the
	bottom of this window show the different ways you can view your analysis – the
	Java Viewer, Java Viewer (JV) with User Colors, HTML, and XML.
	The default, Java Viewer, is recommended.</para>

	<para>Once you have selected your desired Results Display
	Format, you can double-click on one of the files in the list to view the
	analysis done on that file.</para>

	<para>For the Java viewer, the results display looks like this
	(for the AE descriptor <literal>examples/descriptors/tutorial/ex4/MeetingDetectorAE.xml</literal>):

	<screenshot>
	<mediaobject>
	<imageobject>
	<imagedata width="5.8in" format="JPG" fileref="&imgroot;image006.jpg"/>
	</imageobject>
	<textobject><phrase>Analysis Results Window showing results from tutorial example 4</phrase></textobject>
	</mediaobject>
	</screenshot></para>


	<para>You can click the mouse on one of the highlighted
	annotations to see a list of all its features in the frame on the right.</para>

	<para>If there are multiple annotation types in the view, you
	can control which ones are selected by using the checkboxes in the legend, the
	Select All button, or the Deselect All button.</para>

	<para>If you are viewing a CAS that contains multiple subjects
	of analysis, then a selector will appear at the bottom right of the Annotation
	Viewer window. This will allow you to
	choose the Sofa that you wish to view. Note that only text Sofas containing a non-null document are available
	for viewing.</para>

	</section>

	<section id="ugr.tools.doc_analyzer.configuring">
	<title>Configuring the Annotation Viewer</title>

	<para>The <quote>JV User Colors</quote> and the HTML viewer allow
	you to specify exactly which colors are used to display each of your annotation
	types. For the Java Viewer, you can also
	specify which types should be initially selected, and you can hide types
	entirely.</para>

	<para>To configure the viewer, click the <quote>Edit Style
	Map</quote> button on the <quote>Analysis Results</quote> dialog.
	You should see a dialog that looks like this:


	<screenshot>
	<mediaobject>
	<imageobject>
	<imagedata width="5.8in" format="JPG" fileref="&imgroot;image008.jpg"/>
	</imageobject>
	<textobject><phrase>Configuring the Analysis Results Viewer</phrase></textobject>
	</mediaobject>
	</screenshot></para>

	<para>To change the color assigned to a type, simply click on
	the colored cell in the <quote>Background</quote> column for the type you wish to
	edit. This will display a dialog that
	allows you to choose the color. For the
	HTML viewer only, you can also change the foreground color.</para>

	<para>If you would like the type to be initially checked
	(selected) in the legend when the viewer is first launched, check the box in
	the <quote>Checked</quote> column. If you
	would like the type to never be shown in the viewer, click the box in the
	<quote>Hidden</quote> column. These
	settings only affect the Java Viewer, not the HTML view.</para>

	<para>When you are done editing, click the <quote>Save</quote>
	button. This will save your choices to a
	file in the same directory as your AE descriptor. From now on, when you view analysis results
	produced by this AE using the <quote>JV User Colors</quote> or <quote>HTML</quote>
	options, the viewer will be configured as you have specified.</para>

	</section>

	<section id="ugr.tools.doc_analyzer.interactive_mode">
	<title>Interactive Mode</title>


	<para>Interactive Mode allows you to analyze text that you type
	or cut-and-paste into the tool, rather than requiring that the documents be
	stored as files.</para>

	<para>In the main Document Analyzer window, you can invoke
	Interactive Mode by clicking the <quote>Interactive</quote> button instead of the
	<quote>Run</quote> button. This will
	display a dialog that looks like this:


	<screenshot>
	<mediaobject>
	<imageobject>
	<imagedata width="5.5in" format="JPG" fileref="&imgroot;image010.jpg"/>
	</imageobject>
	<textobject><phrase>Invoking Interactive Mode</phrase></textobject>
	</mediaobject>
	</screenshot></para>

	<para>You can type or cut-and-paste your text into this window,
	then choose your Results Display Format and click the <quote>Analyze</quote>
	button. Your AE will be run on the text
	that you supplied and the results will be displayed as usual.</para>


	</section>

	<section id="ugr.tools.doc_analyzer.view_mode">
	<title>View Mode</title>

	<para>If you have previously run a AE and saved its analysis
	results, you can use the Document Analyzer's View mode to view those results,
	without re-running your analysis. To do
	this, on the main Document Analyzer window simply select the location of your
	analyzed documents in the <quote>Output Directory</quote> dialog and click the
	<quote>View</quote> button. You can then
	view your analysis results as described in Section
	<xref linkend="ugr.tools.doc_analyzer.viewing_results"/>.</para>

	</section>
	</chapter>