blob: 5ceeb2dd41ca07c47b624dfa9815f4884f914537 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE.txt file distributed with this work for
additional information regarding copyright ownership. The ASF licenses this
file to you under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.
-->
<document>
<properties>
<title>CAS-Metadata Advanced User Guide</title>
<author email="woollard@jpl.nasa.gov">David Woollard</author>
</properties>
<body>
<section name="Introduction">
<p>The purpose of this guide is to instruct the user in some advanced concepts
within the CAS-Metadata project, including the ramifications of metadata
extraction with regard to repository search, type, checking, etc, and the use
of mime-type detection. For basic topics, including the basics of metadata, how
to write metadata extractors, and explanations of existing metadata extractors,
see our <a href="../user/basic.html">Basic Guide.</a> In the rest of this guide,
we will cover the following topics:</p>
<ul>
<li><a href="#search">Planning Metadata for Search</a></li>
<li><a href="#provenance">Planning Metadata for Provenance</a></li>
<li><a href="#mime">MIME-type Detection</a></li>
</ul>
</section>
<a name="search"/>
<section name="Planning Metadata for Search">
<p>As discussed in the <a href="../user/basic.html">Basic Guide,</a> one of the
primary uses of metadata is search. When you consider search, remember that
neither the CAS-Metadata container nor the CAS-Filemgr are cognizant over
how people will search your data - you are.</p>
<p>We recommend developing a data dictionary as a "best practice." The
<a href="http://portal.acm.org/citation.cfm?id=541721">IBM Dictionary of
Computing</a> describes a data dictionary as a "centralized repository of
information about data such as meaning, relationships to other data, origin,
usage, and format." This is a highly-related concept to ontology.</p>
<p>The attributes of and relationships between products in your system will
not only help you to develop appropriate product types but also the metadata
you will need to extract from products to establish both these attributes and
relationships.</p>
<p>In the next subsections, we will discuss specific aspects of metadata
extraction that impact downstream search.</p>
<subsection name="Missing Elements">
<p>Because metadata extraction is a separate activity from extraction (as
discussed in the <a href="../user/basic.html">Basic Guide</a>), it is possible
that there is a miss-match between the metadata elements extracted by an
extractor and the metadata elements that the CAS-Filemgr associated with
a product type. CAS-Filemgr, therefore, only ingests the <i>intersection</i>
of the metadata extracted from the product and the metadata associated with
the product in CAS-Filemgr configuration. This means that missing elements
are possible.</p>
</subsection>
<subsection name="String-based Comparison">
<p>Metadata values are stored as strings in CAS-Metadata. While there are a
number of good reasons for this, it is a design point that has a number of
important ramifications for search. Specifically, all metadata elements should
be comparable. Of course strings are comparable, but without some forethought, a
string-based comparison can act differently than would a type-based comparison.</p>
<p>This is where standards come into play. For various types, there are standard
string-based representations that ensure comparisons that behave identically to how
a type-based comparison would work. There is currently no plan to enforce these
standards (and, depending on the particular type, the application domain, etc.,
there might be multiple competing standards - e.g., TAI vs. UTC formatted Time
strings).</p>
<p>Some example string representation standards by type:</p>
<p><strong>Date-Time</strong></p>
<p>With time, consistency is key. There are multiple formats, such as
<a href="http://tycho.usno.navy.mil/mjd.html">Julian Day Numbers</a>, or UTC.
Additionally, there are different time standards such as
<a href="http://www.w3.org/TR/NOTE-datetime">UTC</a>, local time, and
<a href="http://www.bipm.org/en/scientific/tai/">TAI</a>. One also needs to be
aware of leap second observance and local time conventions such as daylight savings
time, depending on representation and standard selection.</p>
<p>Additionally, it is important to remember that inter-product consistency can be
just as important as intra-product consistency because there are many downstream
use cases of the search features of CAS-Filemgr and other CAS components that
involve cross-product comparisons.</p>
<p><strong>Integers</strong></p>
<p>Integers can be easily represented as comparable strings, but you must remember
to pad correctly. The string "1" is greater than "01234", but "0001" is less than
"01234."</p>
<p>Like Date-Time, appropriate numerical representation is the responsibility
of the metadata extractor, though we have built some additionally support for
representational transformations during the ingest process of the CAS-Filemgr.</p>
<p>More on Padding..</p>
<p><strong>Floating Point Numbers</strong></p>
<p>The most prevalent of Floating point representations is
<a href="http://ieeexplore.ieee.org/servlet/opac?punumber=4610933">IEEE 754-2008.</a>
This is a convenient representation because, amongst other reasons, it is the
representation that Java uses so if you use Java to develop an extractor, you can
use the <code>floatToIntBits(float value)</code> method of the <code>Float</code>
class. Remember that, like Integers, you will need padding.</p>
<p><strong>Blobs, URLs, and Character Sets</strong></p>
<p>Remember that text encoding is important. Depending on the catalog extension
point used by the CAS-Filemgr, metadata values might not be formatting correctly
for storage. For example, if a metadata object containing a blob of text is ingested
into a CAS-Filemgr instance that is configured to run with a DataStore extension
backed by an Oracle DBMS, then UTF8 encoding is important.</p>
</subsection>
</section>
<a name="provenance"/>
<section name="Planning Metadata for Provenance">
<p>Coming Soon...</p>
</section>
<a name="mime"/>
<section name="MIME-type Detection">
<p>Coming Soon...</p>
</section>
<section name="Conclusion">
<p>This is intended to a living document discussing advanced topics within the
CAS-Metadata project, though it is not comprehensive. In our
<a href="../user/basic.html">Basic Guide,</a> we cover more topics regarding
basic topics within CAS-Metadata.</p>
</section>
</body>
</document>