blob: 65195d4a36d93e4fcb199d3c040567c0775224a0 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE.txt file distributed with this work for
additional information regarding copyright ownership. The ASF licenses this
file to you under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.
<title>Setting Up the CAS-Curator</title>
<author email="">David Woollard</author>
<section name="Introduction">
<p>This document serves as a basic user's guide for the CAS-Curator
project. The goal of the document is to allow users to check out,
build, and install the base version of the CAS-Curator, as well
as perform basic configuration tasks. For advanced topics, such
as customizing the look and feel of the CAS-Curator for your
project, please see our <a href="../user/advanced.html">Advanced
<p>The remainder of this guide is separated into the following
<li><a href="#section1">Download and Build</a></li>
<li><a href="#section2">Tomcat Deployment</a></li>
<li><a href="#section3">Staging Area Setup</a></li>
<li><a href="#section4">Extractor Setup</a></li>
<li><a href="#section5">File Manager Configuration</a></li>
<a name="section1"/>
<section name="Download And Build">
<p>The most recent CAS-Curator project can be downloaded from
the OODT <a href="">website</a> or it can
be checked out from the OODT repository using Subversion. The
We recommend checking
out the latest released version (v1.0.0 at the time of writing).
<p>Maven is the build management system used for OODT projects. We
currently support Maven 2.0 and later. For more information on
Maven, see our <a href="../development/maven.html">Maven Guide.</a>
<p>Assuming a *nix-like environment, with both Maven and Subversion
clients installed and on your path, an example of the checkout and
build process is presented below:</p>
> mkdir /usr/local/src
> cd /usr/local/src
> svn checkout http://oodt/repo/cas-curator/tags/1_0_0_release \
<p>After the Subversion command completes, you will have the source
for the CAS-Curator project in the <code>/usr/local/src/cas-curator-v1.0.0</code>
<p>In order to build the WAR (Web ARchive) file from this source,
issue the following commands:</p>
> cd /usr/local/src/cas-curator-v1.0.0
> mvn package
<p>Once the Maven command completes successfully, you should have a
<code>target</code> directory under <code>cas-curator-v1.0.0/</code>. The
WAR file, called <code>cas-curator-1.0.0.war</code>, can be found under
<p>In the next section, we will discuss deploying this WAR file to
a Tomcat instance.</p>
<a name="section2"/>
<section name="Tomcat Deployment">
<p>Once you have built a war file, it is necessary to deploy the web
application using a servlet container such as
<a href="">Tomcat</a> or
<a href="">Jetty</a>. For the purposes of
this guide, we will assume that you are using Tomcat. Tomcat can be
installed in a user account or at the system level. The base configuration
launches a web server on port 8080. You can learn more about Tomcat and
download the latest release from their
<a href="">website</a>. NOTE: There are two
concurrent versions of Tomcat: 5.5.X and 6.0.X. CAS-Curator is compatible
with both versions.</p>
<p>We will assume that you have downloaded Tomcat to an appropriate
directory, are using the default configuration, and have taken the
appropriate steps to allow access to port 8080. See your System
Administrator is you have any questions about firewall security and policy
regarding port access. We will further assume that you have set an
environment variable, <code>$TOMCAT_HOME</code>, to the base directory
of your Tomcat installation.</p>
<p>There are a number of ways to deploy a WAR file to Tomcat, though we
recommend using a context file. A context file is a XML file that provides
Tomcat with "context" for using a particular web application. In order to
create a context file for the CAS-Curator, open your favorite text editor
and copy and paste the following:</p>
<source><![CDATA[<Context path="/my-curator"
<Parameter name=""
<Parameter name="org.apache.oodt.cas.curator.projectName"
value="My Project"/>
<p>Save the context file to
<code>$TOMCAT_HOME/conf/Catalina/localhost/my-curator.xml</code>. Now you
can point a web browser to <a href="http://localhost:8080/my-curator/">
http://localhost:8080/my-curator</a> and you should see a log-in screen
for CAS-Curator. <em>Note</em>: Tomcat will only use the path attribute
if the context is defined in server.xml. Tomcat uses the xml file name
instead. See the
<a href="" class="externalLink">
Tomcat documentation</a> for further information</p>
<img src="../images/basic_login.jpg"/>
<p>The <code></code> parameter
that we set in the context file configures the CAS-Curator for a "dummy"
log-in to its Single Sign On service. Because of this, we are able to
log into the web application with a blank user name and a blank password.
For help in implementing security with CAS-Curator, see our
<a href="../user/advanced.html">Advanced Guide.</a></p>
<img src="../images/basic_page.jpg"/>
<p>In the next sections, we will talk about setting up staging areas,
metadata extractors, and launching a CAS-Filemgr instance into which
CAS-Curator will ingest data products.</p>
<a name="section3"/>
<section name="Staging Area Setup">
<p>Staging areas are directories on your local machine that hold data
products to be curated. The staging area can have arbitrary structure.
The only requirement that CAS-Curator has with regard to this structure
is that the directory structure be mirrored in a metadata generation
area. This generation area is used by CAS-Curator to create metadata
files to associate with data products.</p>
<p>For example, if there is a product, say an MP3 file of Bach's <i>Der
Geist hilft unsrer Schwachheit auf</i>, in the staging area at:</p>
<p>Then the CAS-Curator will generate all associated metadata products
in <code>[metadata_gen_base]/audio/classical/bach/</code>.</p>
<p>In order to set up the staging area and the metadata generation area,
we first create base directories for each, shown below:</p>
> mkdir /usr/local/staging
> mkdir /usr/local/staging/products
> mkdir /usr/local/staging/metadata
<p>Next, we will set the following parameters in the CAS-Curator context file:</p>
<source><![CDATA[<Parameter name="org.apache.oodt.cas.curator.stagingAreaPath"
<Parameter name="org.apache.oodt.cas.curator.metAreaPath"
<Parameter name="org.apache.oodt.cas.curator.metExtension"
<p>The <code>org.apache.oodt.cas.curator.stagingAreaPath</code> parameter should
be set to the product staging area and the
<code>org.apache.oodt.cas.curator.metAreaPath</code> should be set to the metedata
generation area. Additionally, we specified the parameter
<code>org.apache.oodt.cas.curator.metExtension</code> to be <code>.met</code>.
This parameter specifies the extension for all of the metadata files produced in
the metadata generation area.</p>
<p>For illustrative purposes, we will load an mp3 file into the staging area:</p>
> mkdir /usr/local/staging/products/mp3
> cd /usr/local/staging/products/mp3
> curl -LO
<p>We should note that this music file was produced by the
<a href="">Fulda Symphonic
Orchestra</a> and is freely distributed under the
<a href="">EFF Open Audio License</a>, version 1.0. We
have edited the ID3 tag of this file (in order to make the later metadata extraction
example more interesting), but original authorship is retained. Now back to the
<p>Remember that we need to mirror the product staging area and the metadata
generation area, so will also need to create the matching directory structure
> mkdir /usr/local/staging/metadata/mp3
<p>Once you restart Tomcat, the changes you have made to the context file will be
used. The staging area will now be set to <code>/usr/local/staging/products</code>.
See the screenshot below:</p>
<img src="../images/basic_staging.jpg"/>
<p>Double-clicking on "mp3", we can see that the staging area path in the top left
is now <code>/mp3</code> and <code>Bach-SuiteNo2.mp3</code> can be seen the main
left staging pane. For the time-being, there is no metadata detected (as reported
in the main right staging pane), but in the next section, we will be setting up a
basic, command-line metadata extractor in order to show how extractors are
integrated into CAS-Curator.</p>
<a name="section4"/>
<section name="Extractor Setup">
<p>The CAS-Curator uses ancillary programs called metadata extractors to produce
the metadata that it associates with products. More information about metadata
extractors can be found in the
<a href="../../metadata/user/extractorBasics.html">
Extractor Basics</a> User's Guide.</p>
<p>Like the staging area, we first need to set up an area in the file system for
metadata extractors. We will call this directory <code>extractors</code>:</p>
> mkdir /usr/local/extractors
<p>In order to register the metadata extractor path with the CAS-Curator, we will
need to add another parameter to the web application's context file. Add the
following parameter:</p>
<source><![CDATA[<Parameter name="org.apache.oodt.cas.curator.metExtractorConf.uploadPath"
value="/usr/local/extractors" />
<p>We are going to make a metadata extractor that will extractor ID3 tag metadata,
such as author, title, resource type, etc from mp3s. As a first step, we will create
a directory for the new extractor. The name of this directory is important, because
CAS-Curator will use the directory name to register the extractor. We will name this
directory <code>mp3extractor</code></p>
> mkdir /usr/local/extractors/mp3extractor
<p>While we could write a custom extractor in Java for the Cas-Curator, there are
multiple existing software packages that read mp3 ID3 tags. For these situations,
where an external, command-line extractor exists, we have developed the
<code>ExternMetExtractor</code> class in the CAS-Metadata project.</p>
<p>For this example, we are going to leaverage an existing, open source mime-type
detector with text and metadata parsing capabilities called
<a href="">Apache Tika</a>. Tika parses a number of
different common data formats, including a number of audio formats like mp3.
I'll leave it to the reader of this guide to download and install Tika. We
will assume that the latest release of the tika-app jar is in the
<code>mp3extractor</code> directory.</p>
<p>We have a little work to do to convert the output of Tika into a metadata file
compatible with CAS-Curator. By default, Tika produces metadata in a "key: value"
format as shown in the command-line session below:</p>
> java -jar tika-app-0.5-SNAPSHOT.jar -m \
Author: Johann Sebastian Bach
Content-Type: audio/mpeg
resourceName: Bach-SuiteNo2.mp3
title: Bach Cello Suite No 2
<p>With a little AWK magic, we can convert this output to the Cas-Metadata xml
<!-- FIXME: change namespace URI? -->
> java -jar tika-app-0.5-SNAPSHOT.jar -m \
/usr/local/staging/products/mp3/Bach-SuiteNo2.mp3 | awk -F:\
{print "<cas:metadata xmlns:cas=\"\">"}\
{print "<keyval><key>"$1"</key><val>"substr($2,2)"</val></keyval>"}\
END {print "</cas:metadata>"}'
<cas:metadata xmlns:cas="">
<keyval><key>Author</key><val>Johann Sebastian Bach</val></keyval>
<keyval><key>title</key><val>Bach Cello Suite No 2</val></keyval>
<p>Cool as a one line format translater is, we are actually going to have to
do a little more work to create an extractor capable of producing metadata
for CAS-Curator. A requirement for metadata extractors that are to be integrated
with CAS-Curator is that they product three pieces of metadata:</p>
<p>We should note that this is NOT a general requirement of all metadata
extractors, but a ramification of the current implementation of CAS-Curator.
In order to product this extra metadata, we will develop a small Python
import os
import sys
fullPath = sys.argv[1]
pathElements = fullPath.split("/");
fileName = pathElements[len(pathElements)-1]
fileLocation = fullPath[:(len(fullPath)-len(fileName))]
productType = "MP3"
cmd = "java -jar /Users/woollard/Desktop/extractors/mp3extractor/"
cmd += "tika-app-0.5-SNAPSHOT.jar -m "+fullPath+" | awk -F:"
cmd += " 'BEGIN {print \"<cas:metadata xmlns:cas="
cmd += "\\\"\\\">\"}"
cmd += " {print \"<keyval><key>\"$1\"</key><val>\"substr($2,2)\""
cmd += "</val></keyval>\"}' > "+fileName+".met"
f = open(fileName+".met", 'a')
<p>We'll assume that you have Python installed at <code>/usr/bin/python</code>
and you have named this script <code></code> and placed
it in <code>/usr/local/extractors/mp3extractor</code>. We'll need
to make sure it is executable from the command-line:</p>
> cd /usr/local/extractors/mp3extractor
> chmod +x
> ./ \
<cas:metadata xmlns:cas="">
<keyval><key>Author</key><val>Johann Sebastian Bach</val></keyval>
<keyval><key>title</key><val>Bach Cello Suite No 2</val></keyval>
<p>Now that we have a metadata extractor that meets our requirements (it's
callable from the command-line, it produces CAS-Metadata compatible XML, and
it extracts <i>ProductType</i>, <i>Filename</i>, and <i>FileLocation</i>),
the next step is to create an <code>ExternMetExtractor</code> configuration
file. This file will configure CAS-Metadata's <code>ExternMetExtractor</code>
to call the <code></code> script correctly.</p>
<p>There is more information about <code>ExternMetExtractor</code>
configuration available in CAS-Metadata's
<a href="">
Extractor Basics</a> User's Guide. For the purposes of this guide, we will
assume that the reader is familiar with configuration of this extractor, so we
will just present the configuration below (we assume that you name this file
<?xml version="1.0" encoding="UTF-8"?>
<cas:externextractor xmlns:cas="">
<exec workingDir="">
<arg isDataFile="true"/>
<p>The last step in configuring our mp3 metadata extractor is to provide a
properties file for CAS-Curator so that it knows how to call the
<code>ExternMetExtractor</code>. Each extractor used by CAS-Curator needs
a <code></code> file. This file sets two properties:</p>
<p>Create a <code></code> file (this name is important for
CAS-Curator to pick up the cofiguration) in the
<code>/usr/local/extractors/mp3extractor</code> directory. This file should
consist of the following parameters:</p>
<p>To recap, we first created a Python script that calls
<a href="">Apache Tika</a> to extract metadata
from mp3 files. Then we created a configuration file that configures
CAS-Metadata's <code>ExternMetExtractor</code> to call this python script.
Finally, we created a properties file for the CAS-Curator to call the
<code>ExternMetExtractor</code>. To confirm the configuration of this
extractor, we can long list the extractor directory:</p>
> cd /usr/local/extractors/mp3extractor
> ls -l
total 51448
-rw-r--r-- 1 - - 167 Nov 27 13:50
-rw-r--r-- 1 - - 328 Nov 27 13:49 mp3PythonExtractor.config
-rwxr-xr-x 1 - - 702 Nov 27 13:49
-rw-r--r-- 1 - - 26325155 Nov 27 13:46 tika-app-0.5-SNAPSHOT.jar
<p>Once you restart Tomcat, the change you have made to the context file will be
used. The extractor area will now be set to <code>/usr/local/extractors</code>.
See the screenshot below:</p>
<img src="../images/basic_extractor.jpg"/>
<p>In the above screenshot, we see that, upon clicking on the mp3 file,
metadata produced by the <code>mp3extractor</code> is shown in the main right
staging pane. Now staging and extraction are set up. In the next section, we
will set up a CAS-Filemgr instance and show how CAS-Curator can be used to
ingest products.</p>
<a name="section5"/>
<section name="File Manager Configuration">
<p>The final step in our basic configuration of CAS-Curator is to configure a
CAS-Filemgr instance into which we will ingest our mp3s. There is a lot of
information on configuring the CAS-Filemgr in its
<a href="../../filemgr/user/">User's Guide</a>. We will
assume familiarity with the CAS-Filemgr for the remainder of this guide.</p>
<p>In this guide, we will focus on the basic configuration necessary to tailor
a vanilla build of the CAS-Filemgr for use with our CAS-Curator. We will assume
that you have built the latest release of the CAS-Filemgr (v1.8.0 at the time of
this writing) and installed it at:</p>
<p>The first step in configuring the CAS-Filemgr is to edit the
<code></code> file in the <code>etc</code> directory. This
file controls the basic configuration of the CAS-Filemgr, including its
various extension points. For this example, we are going to run the CAS-Filemgr
in a very basic configuration, with both its repository and validation layer
controlled by XML configuration, a local data transfer factory, and a
<a href="">Lucene</a>-based metadata
<p>In order to create this configuration, we will change the following
parameters in the <code></code> file:</p>
<li>Set <code>org.apache.oodt.cas.filemgr.catalog.lucene.idxPath</code>
to <code>/usr/local/src/cas-filemgr-1.8.0/catalog</code>. This parameter
tells CAS-Filemgr where to create the Lucene index. The first time you start
the CAS-Filemgr, make sure that this file does NOT exist. The CAS-Filemgr
will take care of creating it and populating it with the appropriate files.
<li>Set <code>org.apache.oodt.cas.filemgr.repositorymgr.dirs</code> to
<code>file:///usr/local/src/cas-filemgr-1.8.0/policy/mp3</code>. The value needs
to be a URL and we are pointing to a policy folder we will create.</li>
<li>Set <code>org.apache.oodt.cas.filemgr.validation.dirs</code> to
<code>file:///usr/local/src/cas-filemgr-1.8.0/policy/mp3</code>. Like the last
parameter we configured, this parameter should be a URL and point to the
same policy folder.</li>
<p>With these changes, you are ready to run the basic configuration of the
CAS-Filemgr. In order to make this install of CAS-Filemgr work with our
CAS-Curator, however, we will also need to augment the basic policy for both
the repository manager and validation layer.</p>
<p>First, we will create a policy directory for our mp3 curator. We can do this
by moving the current policy files from the base <code>policy</code> directory to
a <code>mp3</code> directory:</p>
> cd /usr/local/src/cas-filemgr-1.8.0/policy
> mkdir mp3
> mv *.xml mp3/
<p>Next, we will add a product type to our instance of the CAS-Filemgr. In order
to do this, we will edit the <code>product-types.xml</code> file in the
<code>policy/mp3</code> directory. We will add the following as a child of the
<code>&lt;cas:producttypes&gt;</code> node (we purposefully elide any
commentary on the details of this configuration and leave it to the
<type id="urn:example:MP3" name="MP3">
<repository path="file:///usr/local/archive"/>
<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
<description>A product type for mp3 audio files.</description>
<property name="nsAware" value="true" />
<property name="elementNs" value="CAS" />
<property name="elements"
value="ProductReceivedTime,ProductName,ProductId" />
<p>Next, we will create a number of elements in the <code>elements.xml</code>
file. There will be an element node for each of the metadata elements we
want to associate with MP3 products. We can do this be adding the following
as children nodes of <code>&lt;cas:elements&gt;</code> tag:</p>
<element id="urn:example:FileLocation" name="FileLocation">
<element id="urn:example:ProductType" name="ProductType">
<element id="urn:example:Author" name="Author">
<element id="urn:example:Filename" name="Filename">
<element id="urn:example:resourceName" name="resourceName">
<element id="urn:example:title" name="title">
<element id="urn:example:Content-Type" name="tContent-Type">
<p>After we have configured the new metadata elements, we will need to map
these elements to our MP3 product. We do this by editing the
<code>product-type-element-map.xml</code> file in the <code>policy/mp3</code>
directory to add the following as a child node to
<type id="urn:example:MP3">
<element id="urn:example:FileLocation"/>
<element id="urn:example:ProductType"/>
<element id="urn:example:Author"/>
<element id="urn:example:Filename"/>
<element id="urn:example:resourceName"/>
<element id="urn:example:title"/>
<element id="urn:example:Content-Type"/>
<p>A final configuration step will be to create the archive area for the
CAS-Filemgr (You'll remember that we set the repository path for MP3 products
in the <code>product-types.xml</code> file). In order to do this, we will just
make the directory:</p>
> mkdir /usr/local/archive
<p>We will now start the CAS-Filemgr instance. This instance will run on
port 9000 by default. In order to start the Filemgr, we will issue the
following commands:</p>
> cd /usr/local/src/cas-filemgr-1.8.0/bin
> ./filemgr start
<p>Now that we have started the CAS-Filemgr, we will need to configure the
CAS-Curator to use this Filemgr instance. In order to do this, we will add
the following parameters to the CAS-Curator context file:</p>
<Parameter name=""
<Parameter name="org.apache.oodt.cas.curator.dataDefinition.uploadPath"
value="/usr/local/src/cas-filemgr-1.8.0/policy" />
<Parameter name="org.apache.oodt.cas.curator.fmProps"
<p>Once we restart Tomcat, the CAS-Curator will now recognize the policy
and properties of the configured CAS-Filemgr instance and use this
instance during the ingest process.</p>
<img src="../images/basic_filemgr.jpg"/>
<p>From the above image, you can see that the CAS-Filemgr configuration
has been picked up by CAS-Curator. If you double-click on MP3 in the left
filemgr main pane, you will see the product types that are contained in
the mp3 policy: <code>GenericFile</code> which was part of the default
configuration, and <code>MP3</code> which we added. Clicking on MP3,
we bring up the ingest interface in the right filemgr main pane.</p>
<img src="../images/basic_ingest.jpg"/>
<p>Once we drag the Bach-SuiteNo2.mp3 from the staging pane to the green
box in the right filemgr main pane, we can then select a metadata extractor
from the pulldown menu and click on the "Save as Ingestion Task." This will
add the Ingest task to the bottom pane as illustrated in the above
screenshot. In order to test file ingestion, we will click on the "Start"
<p>As a final step, we will confirm that the mp3 file was archived. We
can do this by listing the archive:</p>
> ls -lR /usr/local/archive
total 0
drwxr-xr-x 3 - - 102 Nov 27 23:53 Bach-SuiteNo2.mp3
total 9344
-rw-r--r-- 1 - - 4781079 Nov 25 20:14 Bach-SuiteNo2.mp3
<p>Worth noting is the fact that our configuration of the CAS-Filemgr
included a selection of the <code>BasicVersioner</code> as the MP3
product type versioner. This means that mp3s are placed at
[archive_base]/[filename]/[filename] during ingest.</p>
<p>We have now completed a base configuration of the CAS-Curator. In
the <a href="../user/advanced.html">Advanced Guide</a>, we will cover
topics like changing the look and feel of the Curator, and security