| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more contributor |
| license agreements. See the NOTICE.txt file distributed with this work for |
| additional information regarding copyright ownership. The ASF licenses this |
| file to you under the Apache License, Version 2.0 (the "License"); you may not |
| use this file except in compliance with the License. You may obtain a copy of |
| the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, WITHOUT |
| WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the |
| License for the specific language governing permissions and limitations under |
| the License. |
| --> |
| <document> |
| <properties> |
| <title>CAS-Metadata Basic User Guide</title> |
| <author email="woollard@jpl.nasa.gov">David Woollard</author> |
| </properties> |
| |
| <body> |
| <section name="Introduction"> |
| |
| <p>The purpose of this guide is to instruct the user in the basic concepts |
| behind the CAS-Metadata project, including the basics of metadata, how to write |
| metadata extractors, and explanations of existing metadata extractors. In |
| addition to this guide, some of these concepts are also covered in our |
| CAS-File Manager <a href="../../filemgr/user/">User Guide</a> |
| and in our CAS-Curator <a href="../../curator/user/basic.html"> |
| Basic User Guide</a>. For advanced topics, including extracting techniques to allow |
| for type-based matching, see our <a href="../user/advanced.html">Advanced Guide.</a> |
| In the rest of this guide, we will cover the following topics:</p> |
| |
| <ul> |
| <li><a href="#basics">Metadata Basics</a></li> |
| <li><a href="#extractors">Metadata Extractors</a></li> |
| <li><a href="#existing">Existing Implementations</a></li> |
| <li><a href="#filemgr">Relationship to CAS-Filemgr</a></li> |
| </ul> |
| </section> |
| |
| <a name="basics"/> |
| <section name="Metadata Basics"> |
| |
| <p>Metadata is <i>data about data</i>. Simply put, metadata is information about data |
| that aids the user in finding what they are looking for and clarifying what they are |
| looking at. There are many examples of metadata standards, including |
| <a href="http://dublincore.org/">Dublin Core</a> and ISO |
| <a href="http://metadata-stds.org/">Standards.</a></p> |
| |
| <p>Examples of metadata include filename, URL, data producer, start and stop datetime |
| for temporal files, bounding polygons for geo-referenced data, etc. In the CAS-Metadata |
| project, and in all of OODT, Metadata Objects are considered a container for |
| product-related metadata. Interfaces for the <code>Metadata</code> Object are shown |
| below:</p> |
| |
| <source><![CDATA[ |
| public class Metadata { |
| |
| public Metadata() {...} |
| |
| public Metadata(InputStream is) throws Exception {...} |
| |
| public void addMetadata(Hashtable metadata) {...} |
| |
| public void addMetadata(Hashtable metadata, boolean replace) {...} |
| |
| public void replaceMetadata(Hashtable metadata) {...} |
| |
| public void addMetadata(String key, String value) {...} |
| |
| public void addMetadata(String key, List values) {...} |
| |
| public void replaceMetadata(String key, List values) {...} |
| |
| public void replaceMetadata(String key, String value) {...} |
| |
| public Object removeMetadata(String key) {...} |
| |
| public List getAllMetadata(String key) {...} |
| |
| public String getMetadata(String key) {...} |
| |
| public Hashtable getHashtable() {...} |
| |
| public boolean containsKey(String key) {...} |
| |
| public boolean isMultiValued(String key) {...} |
| |
| public boolean equals(Object obj) {....} |
| |
| public Document toXML() throws Exception {...} |
| |
| private void parse(Document document) throws Exception {...} |
| } |
| ]]></source> |
| |
| <p>The CAS-Metadata <code>Metadata</code> object is a key-multivalue container. |
| Users can add metadata elements to the <code>Metadata</code> Object via |
| InputStream, HashTable Object, key with an array of values, and a simple |
| key and signal value interface. All keys and values are represented as Strings. |
| See our <a href="../user/advanced.html">Advanced Guide</a> for more information |
| about the ramification of this design decision during type-based metadata search |
| and comparison.</p> |
| |
| <p>In addition to the accessor and modifier methods that work with simple Strings, |
| the <code>Metadata</code> Object can work with XML Documents. An example of |
| metadata in the XML format is given below:</p> |
| <!-- FIXME: change namespace URI? --> |
| <source><![CDATA[<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas"> |
| <keyval> |
| <key>ProductType</key> |
| <val>MP3</val> |
| </keyval> |
| <keyval> |
| <key>Filename</key> |
| <val>blue_suede_shoes.mp3</val> |
| </keyval> |
| <keyval> |
| <key>Artist</key> |
| <val>The Beatles</val> |
| </keyval> |
| <keyval> |
| <key>Album</key> |
| <val>Revolver</val> |
| </keyval> |
| <keyval> |
| <key>SongName</key> |
| <val>Blue Suede Shoes</val> |
| </keyval> |
| </cas:metadata>]]></source> |
| |
| <p>The above metadata example has been extracted from an MP3 file. There are a |
| number of metadata elements, including the Artist, Album, and SongName, as well |
| as the product type (in this case, 'MP3'), and the name of the mp3 file. In |
| addition, metadata elements can be multivalued. In this case, the |
| <code><keyval></code> element can have multiple <code><val></code> |
| child elements.</p> |
| </section> |
| |
| |
| <a name="extractors"/> |
| <section name="Metadata Extractors"> |
| <p>The role of a metadata extractor is extract metadata from one or more product |
| <i>types</i>. In order to extract metadata, the extractor must understand the product |
| type format, parse the product, and return metadata to be associated with the |
| product. CAS-Curator, for example, uses metadata extractors to generate metadata for |
| products in its staging area, both as a preview to the curator, and also during the |
| course of data ingestion.</p> |
| |
| |
| <subsection name="Java API"> |
| |
| <p>The CAS-Metadata project contains an interface class, |
| <code>org.apache.oodt.cas.metadata.MetExtractor</code>. This API consists of |
| two primary methods (with multiple method signatures each). This API can be seen |
| below:</p> |
| |
| <source><![CDATA[public interface MetExtractor { |
| |
| public Metadata extractMetadata(File f) |
| throws MetExtractionException; |
| |
| public Metadata extractMetadata(String filePath) |
| throws MetExtractionException; |
| |
| public Metadata extractMetadata(URL fileUrl) |
| throws MetExtractionException; |
| |
| public Metadata extractMetadata(File f, File |
| configFile) throws MetExtractionException; |
| |
| public Metadata extractMetadata(File f, String |
| configFilePath) throws MetExtractionException; |
| |
| public Metadata extractMetadata(File f, |
| MetExtractorConfig config) |
| throws MetExtractionException; |
| |
| public Metadata extractMetadata(URL fileUrl, |
| MetExtractorConfig config) |
| throws MetExtractionException; |
| |
| public void setConfigFile(File f) |
| throws MetExtractionException; |
| |
| public void setConfigFile(String filePath) |
| throws MetExtractionException; |
| |
| public void setConfigFile(MetExtractorConfig config); |
| }]]></source> |
| |
| <p>In order to implement a new extractor, a developer may implement the |
| <code>MetExtractor</code> interface, or develop a metadata extractor |
| that adheres to this interface in the development language of choice.</p> |
| |
| </subsection> |
| </section> |
| |
| <a name="existing"/> |
| <section name="Existing Implementations"> |
| |
| <p>The CAS-Metadata project contains a number of existing metadata |
| extractor implementations that the develop can directly leverage.</p> |
| |
| <subsection name="External Metadata Extractor"> |
| |
| <p>There are many situations in which developers are interested in using |
| a metadata extractor that is not written in Java. Perhaps there is an |
| existing extractor written in a different programming language the source |
| of which you do not have access, or perhaps there are functional or |
| non-functional requirements that make a different language more |
| appropriate.</p> |
| |
| <p>We have developed the <code>ExternMetExtractor</code> as part of the |
| CAS Metadata project to address this issue. The <code>ExternMetExtractor</code> |
| uses a configuration file to specify the extractor working directory, the path |
| to the executable, and any commandline arguments. This configuration file |
| is specified below:</p> |
| |
| <source><![CDATA[<?xml version="1.0" encoding="UTF-8"?> |
| <cas:externextractor xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas"> |
| <exec workingDir=""> |
| <extractorBinPath envReplace="true">[PWD]/extractor</extractorBinPath> |
| <args> |
| <arg isDataFile="true"/> |
| <arg isPath="true">/usr/local/etc/testExtractor.config</arg> |
| </args> |
| </exec> |
| </cas:externextractor>]]></source> |
| |
| <p>There are a number of important elements to the external metadata extractor |
| configuration file, including working directory (the <code>workingDir</code> |
| attribute on the <code>exec</code> tag), the path the the executable extractor |
| (the value of the <code>extractorBinPath</code> tag), and any arguments required |
| by the extractor (values of the <code>args</code> tags).</p> |
| |
| <p>The working directory (the directory in which the metadata file is to be |
| generated), is assumed to be the directory in which the extractor is run. This |
| is signaled by a null value.</p> |
| |
| <p>Command-line arguments are delivered to the external extractor in the order |
| they are listed in the configuration file. In order words,</p> |
| |
| <source><![CDATA[ |
| <args> |
| <arg>arg1</arg> |
| <arg>arg2</arg> |
| <arg>arg3</arg> |
| </args> |
| ]]></source> |
| |
| <p>would be passed to the extractor as <code>arg1 arg2 arg3</code>.</p> |
| |
| <p>Additionally, there are a number of specializations of the |
| <code>arg</code> tag that can be set with tag attributes. Specifically:</p> |
| |
| <ul> |
| <li><code>isDataFile="true"</code> - This attribute passes the full path to |
| the product from which metadata is to be extracted as the argument.</li> |
| <li><code>isPath="true"</code> - This attribute passes the argument encoded |
| as a properly formed path (no char-set replacement, etc).</li> |
| <li><code>envReplace="true"</code> - This attribute replaces any part of |
| the value of the argument that is inside brackets (<code>[</code> and |
| <code>]</code>) with the environment variable matching the text inside the |
| brackets, if such an enviroment variable exists.</li> |
| </ul> |
| |
| <p>For an example of the use of this type of metadata extractor, we our |
| CAS-Curator <a href="../../curator/user/basic.html"> |
| Basic User Guide</a>.</p> |
| |
| </subsection> |
| <subsection name="The Filename Token Metadata Extractor"> |
| |
| <p>In many cases, products that are to be ingested are named with metadata |
| that should be extracted from the product name and cataloged upon ingest. For |
| this type of situation, we have developed the |
| <code>FilenameTokenMetExtractor</code>. This extractor uses a configuration |
| file that specifies, for each metadata element, the index of the start |
| position in the name for this metadata and its character length.</p> |
| |
| <p>Below is an example configuration file used by the |
| <code>FilenameTokenMetExtractor</code>. It assumes a product name formatted |
| as follows:</p> |
| |
| <p><code>MissionName_Date_StartOrbitNumber_StopOrbitNumber.txt</code></p> |
| |
| <source><![CDATA[ |
| <input> |
| <group name="SubstringOffsetGroup"> |
| <vector name="MissionName"> |
| <element>1</element> |
| <element>11</element> |
| </vector> |
| <vector name="Date"> |
| <element>13</element> |
| <element>4</element> |
| </vector> |
| <vector name="StartOrbitNumber"> |
| <element>18</element> |
| <element>16</element> |
| </vector> |
| <vector name="StopOrbitNumber"> |
| <element>35</element> |
| <element>15</element> |
| </vector> |
| </group> |
| |
| <group name="CommonMetadata"> |
| <scalar name="DataVersion">1.0</scalar> |
| <scalar name="CollectionName">Test</scalar> |
| <scalar name="DataProvider">OODT</scalar> |
| </group> |
| </input> |
| ]]></source> |
| |
| <p>In this configuration, the <code>FilenameTokenMetExtractor</code> will |
| produce four metadata elements from the product name: <i>MissionName</i>, |
| <i>Date</i>, <i>StartOrbitNumber</i>, and <i>StopOrbitNumber</i>. The first |
| element of each of these groups is the start index (this assumes 1-indexed |
| strings). The second element is the substring length.</p> |
| |
| <p>Additionally, this configuration specifies that metadata for all products |
| additionally contain three comment metadata elements that are static: |
| <i>DataVersion</i>, <i>CollectionName</i>, and <i>DataProvider</i>.</p> |
| |
| </subsection> |
| |
| <subsection name="Product Type Pattern Metadata Extractor"> |
| <p>The <code>ProdTypePatternMetExtractor</code> can also be used to extract |
| metadata from the filename. Unlike the <code>FilenameTokenMetExtractor</code>, |
| metadata elements do not have to be fixed-offset and fixed-length. Instead, |
| metadata elements are represented by regular expressions. These elements are |
| used in filename templates that, when matched, dynamically determine the |
| ProductType of the file.</p> |
| |
| <p>Below is an example of a <code>product-type-patterns.xml</code> configuration |
| file used by the <code>ProdTypePatternMetExtractor:</code></p> |
| |
| <source><![CDATA[ |
| <config> |
| <!-- <element> MUST be defined before <product-type> so their patterns can be resolved --> |
| <!-- name MUST be an element defined in elements.xml (also only upper and lower case alpha chars) --> |
| <!-- regexp MUST be valid input to java.util.regex.Pattern.compile() --> |
| <element name="ISBN" regexp="[0-9]{10}"/> |
| <element name="Page" regexp="[0-9]*"/> |
| |
| <!-- name MUST be a ProductType name defined in product-types.xml --> |
| <!-- metadata elements inside brackets MUST be mapped to the ProductType, |
| as defined in product-type-element-map.xml --> |
| <product-type name="Book" template="book-[ISBN].txt"/> |
| <product-type name="BookPage" template="page-[ISBN]-[Page].txt"/> |
| </config> |
| ]]></source> |
| |
| <p>This file defines a regular expression for the "ISBN" metadata element, in this case, a 10-digit number. Also, the |
| "Page" metadata element is defined as a sequence of 0 or more digits. |
| <p/> |
| Next, the file defines a filename pattern for the "Book" product type. The pattern is compiled into a regular |
| expression, substituting the previously defined regexes as capture groups. For example, "book-[ISBN].txt" compiles to |
| "book-([0-9]{10}).txt", and the ISBN met element is assigned to capture group 1. When the filename matches this |
| pattern, 2 metadata assignments occur: (1) the ISBN met element is set to the matched regex group, and (2) the |
| ProductType met element is set to "Book". |
| <p/> |
| Similarly, the second pattern sets ISBN, Page, and ProductType for files matching "page-([0-9]{10})-([0-9]*).txt". |
| <p/> |
| This achieves several things: |
| <ol> |
| <li>assigning met elements based on regular expressions</li> |
| <li>assigning product type based on easy-to-understand pattern with met elements clearly indicated</li> |
| <li>reuse of met element regular expressions</li> |
| </ol> |
| <p>See the <a href="../apidocs/org/apache/oodt/cas/metadata/extractors/ProdTypePatternMetExtractor.html">JavaDoc</a> |
| for more detailed information about using the <code>ProdTypePatternMetExtractor</code></p> |
| </p> |
| </subsection> |
| |
| <subsection name="Metadata Reader Extractor"> |
| |
| <p>The <code>MetReaderExtractor</code>, part of the OODT CAS-Metadata project, |
| assumes that a metadata file with then nameing convention "<Product Name>.met" |
| is present in the same directory as the product. This extractor further |
| assumes that the metadata is in the format specified in this document.</p> |
| |
| </subsection> |
| <subsection name="Copy And Rewrite Extractor"> |
| |
| <p>The <code>CopyAndRewriteExtractor</code> is a metadata extractor, that, |
| like the <code>MetReaderExtractor</code>, assumes that a metadata file exists |
| for the product from which metadata is to be extracted. This extractor reads |
| in the original metadata file and replaces particular metadata values in that |
| metadata file.</p> |
| |
| <p>The <code>CopyAndRewriteExtractor</code> takes in a configuration file that |
| is a java properties object with the following properties defined:</p> |
| |
| <ul> |
| <li>numRewriteFields - The number of fields to rewrite within the original |
| metadata file.</li> |
| <li>rewriteFieldN - The name(s) of the fields to rewrite in the original |
| metadata file.</li> |
| <li>orig.met.file.path - The original path to the metadata file from which |
| to draw the original metadata fields.</li> |
| <li>fieldN.pattern - The string specification that details which fields to |
| replace and to use in building the new field value.</li> |
| </ul> |
| |
| <p> An example of the configuration file is given below:</p> |
| |
| <source><![CDATA[ |
| numRewriteFields=2 |
| rewriteField1=ProductType |
| rewriteField2=FileLocation |
| orig.met.file.path=./src/resources/examples/samplemet.xml |
| ProductType.pattern=NewProductType[ProductType] |
| FileLocation.pattern=/new/loc/[FileLocation] |
| ]]></source> |
| |
| <p>In ths example configuration, two metadata elements will be rewritten, |
| <i>ProductType</i> and <i>FileLocation</i>. The original metadata file is |
| located on at <code>./src/resources/examples/samplemet.xml</code>. The |
| Product Type will be rewritten as NewProductType<original ProductType |
| value>. The File location will now be set to |
| <code>/new/location./src/resources/examples/samplemet.xml</code>.</p> |
| |
| </subsection> |
| </section> |
| |
| <a name="filemgr"/> |
| <section name="Relationship to CAS-Filemgr"> |
| |
| <p>The most common use-case of CAS-Metadata is to capture the output of a metadata |
| extractor for use in the CAS-Filemgr's ingestion process.</p> |
| |
| <p><img src="../images/metadata.jpg"/></p> |
| |
| <p>In the above diagram, a metadata object is producted by an extractor. The |
| product and its associated metadata are both ingested into the CAS-Filemgr. |
| The metadata will go into the Filemgr's metadata catalog and the product will |
| go to the archive.</p> |
| |
| <p>Because metadata extractors and the CAS-Filemgr are not tightly-coupled, |
| there are a number of implicit design assumptions that effect how you design |
| metadata extractors in this use-case. For example, CAS-Filemgr differentiates |
| between products based on product <i>type</i>. File type and product type are |
| not necessarily identical, so you should write extractors to to produce |
| metadata specific to product types (See the <a href="../user/advanced.html"> |
| Advanced Guide</a> for information on mime-type detection).</p> |
| |
| <p>Along the same lines, remember that there is no mechanism to enforce the |
| metadata extracted for a particular product type be ingested into the |
| Filemgr's catalog. The command-line ingest client for the CAS-Filemgr is given |
| below (note that the command-line interface and the API are equivelent):</p> |
| |
| <source><![CDATA[ |
| filemgr-client --url <url to xml rpc service> --operation \ |
| --ingestProduct --productName <name> --productStructure <Hierarchical|Flat> |
| --productTypeName <name of product type> --metadataFile <file> \ |
| [--clienTransfer --dataTransfer <java class name of data transfer factory>] \ |
| --refs <ref1>...<refn> |
| ]]></source> |
| |
| <p>In the above interface, the important feature to note is that the user |
| supplies not only the product, but also the metadata file (or Metadata |
| Object in the case of the API), the Product Name, Structure and Type |
| <i>on ingest</i>. Since each of these pieces of information is independant, |
| it is the user's responsibility to maintain consistancy of the product type |
| metadata between the extraction process and the ingest process.</p> |
| |
| </section> |
| |
| <section name="Conclusion"> |
| <p>This is intended to be a basic guide to users of the CAS-Metadata project. We |
| have purposely omitted the discussion of metadata stardards, though we strongly |
| encourage you to investigate the role of standards and ontology in your |
| particular application. In our <a href="../user/advanced.html">Advanced Guide</a>, |
| we cover more topics regarding the nuences of metadata extraction, including |
| the impact of String-based representation on metadata element comparisons.</p> |
| </section> |
| |
| </body> |
| </document> |