0.10/metadata/src/site/xdoc/user/advanced.xml - oodt - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one or more contributor
 license agreements.  See the NOTICE.txt file distributed with this work for
 additional information regarding copyright ownership.  The ASF licenses this
 file to you under the Apache License, Version 2.0 (the "License"); you may not
 use this file except in compliance with the License.  You may obtain a copy of
 the License at

      http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the
 License for the specific language governing permissions and limitations under
 the License.
 -->
 <document>
   <properties>
     <title>CAS-Metadata Advanced User Guide</title>
     <author email="woollard@jpl.nasa.gov">David Woollard</author>
   </properties>

   <body>
     <section name="Introduction">

     <p>The purpose of this guide is to instruct the user in some advanced concepts
     within the CAS-Metadata project, including the ramifications of metadata
     extraction with regard to repository search, type, checking, etc, and the use
     of mime-type detection. For basic topics, including the basics of metadata, how
     to write metadata extractors, and explanations of existing metadata extractors,
     see our <a href="../user/basic.html">Basic Guide.</a> In the rest of this guide,
     we will cover the following topics:</p>

     <ul>
       <li><a href="#search">Planning Metadata for Search</a></li>
       <li><a href="#provenance">Planning Metadata for Provenance</a></li>
       <li><a href="#mime">MIME-type Detection</a></li>
     </ul>

     </section>

     <a name="search"/>
     <section name="Planning Metadata for Search">

     <p>As discussed in the <a href="../user/basic.html">Basic Guide,</a> one of the
     primary uses of metadata is search. When you consider search, remember that
     neither the CAS-Metadata container nor the CAS-Filemgr are cognizant over
     how people will search your data - you are.</p>

     <p>We recommend developing a data dictionary as a "best practice." The
     <a href="http://portal.acm.org/citation.cfm?id=541721">IBM Dictionary of
     Computing</a> describes a data dictionary as a "centralized repository of
     information about data such as meaning, relationships to other data, origin,
     usage, and format." This is a highly-related concept to ontology.</p>

     <p>The attributes of and relationships between products in your system will
     not only help you to develop appropriate product types but also the metadata
     you will need to extract from products to establish both these attributes and
     relationships.</p>

     <p>In the next subsections, we will discuss specific aspects of metadata
     extraction that impact downstream search.</p>

     <subsection name="Missing Elements">

     <p>Because metadata extraction is a separate activity from extraction (as
     discussed in the <a href="../user/basic.html">Basic Guide</a>), it is possible
     that there is a miss-match between the metadata elements extracted by an
     extractor and the metadata elements that the CAS-Filemgr associated with
     a product type. CAS-Filemgr, therefore, only ingests the <i>intersection</i>
     of the metadata extracted from the product and the metadata associated with
     the product in CAS-Filemgr configuration. This means that missing elements
     are possible.</p>

     </subsection>

     <subsection name="String-based Comparison">
     <p>Metadata values are stored as strings in CAS-Metadata. While there are a
     number of good reasons for this, it is a design point that has a number of
     important ramifications for search. Specifically, all metadata elements should
     be comparable. Of course strings are comparable, but without some forethought, a
     string-based comparison can act differently than would a type-based comparison.</p>

     <p>This is where standards come into play. For various types, there are standard
     string-based representations that ensure comparisons that behave identically to how
     a type-based comparison would work. There is currently no plan to enforce these
     standards (and, depending on the particular type, the application domain, etc.,
     there might be multiple competing standards - e.g., TAI vs. UTC formatted Time
     strings).</p>

     <p>Some example string representation standards by type:</p>

     <p><strong>Date-Time</strong></p>

     <p>With time, consistency is key. There are multiple formats, such as
     <a href="http://tycho.usno.navy.mil/mjd.html">Julian Day Numbers</a>, or UTC.
     Additionally, there are different time standards such as
     <a href="http://www.w3.org/TR/NOTE-datetime">UTC</a>, local time, and
     <a href="http://www.bipm.org/en/scientific/tai/">TAI</a>. One also needs to be
     aware of leap second observance and local time conventions such as daylight savings
     time, depending on representation and standard selection.</p>

     <p>Additionally, it is important to remember that inter-product consistency can be
     just as important as intra-product consistency because there are many downstream
     use cases of the search features of CAS-Filemgr and other CAS components that
     involve cross-product comparisons.</p>

     <p><strong>Integers</strong></p>

     <p>Integers can be easily represented as comparable strings, but you must remember
     to pad correctly. The string "1" is greater than "01234", but "0001" is less than
     "01234."</p>

     <p>Like Date-Time, appropriate numerical representation is the responsibility
     of the metadata extractor, though we have built some additionally support for
     representational transformations during the ingest process of the CAS-Filemgr.</p>

     <p>More on Padding..</p>

     <p><strong>Floating Point Numbers</strong></p>

     <p>The most prevalent of Floating point representations is
     <a href="http://ieeexplore.ieee.org/servlet/opac?punumber=4610933">IEEE 754-2008.</a>
     This is a convenient representation because, amongst other reasons, it is the
     representation that Java uses so if you use Java to develop an extractor, you can
     use the <code>floatToIntBits(float value)</code> method of the <code>Float</code>
     class. Remember that, like Integers, you will need padding.</p>

     <p><strong>Blobs, URLs, and Character Sets</strong></p>

     <p>Remember that text encoding is important. Depending on the catalog extension
     point used by the CAS-Filemgr, metadata values might not be formatting correctly
     for storage. For example, if a metadata object containing a blob of text is ingested
     into a CAS-Filemgr instance that is configured to run with a DataStore extension
     backed by an Oracle DBMS, then UTF8 encoding is important.</p>


     </subsection>
     </section>

     <a name="provenance"/>
     <section name="Planning Metadata for Provenance">
     <p>Coming Soon...</p>
     </section>

     <a name="mime"/>
     <section name="MIME-type Detection">

     <p>Coming Soon...</p>

     </section>
     <section name="Conclusion">
       <p>This is intended to a living document discussing advanced topics within the
       CAS-Metadata project, though it is not comprehensive. In our
       <a href="../user/basic.html">Basic Guide,</a> we cover more topics regarding
       basic topics within CAS-Metadata.</p>
     </section>

   </body>
 </document>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more contributor
	license agreements. See the NOTICE.txt file distributed with this work for
	additional information regarding copyright ownership. The ASF licenses this
	file to you under the Apache License, Version 2.0 (the "License"); you may not
	use this file except in compliance with the License. You may obtain a copy of
	the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
	WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
	License for the specific language governing permissions and limitations under
	the License.
	-->
	<document>
	<properties>
	<title>CAS-Metadata Advanced User Guide</title>
	<author email="woollard@jpl.nasa.gov">David Woollard</author>
	</properties>

	<body>
	<section name="Introduction">

	<p>The purpose of this guide is to instruct the user in some advanced concepts
	within the CAS-Metadata project, including the ramifications of metadata
	extraction with regard to repository search, type, checking, etc, and the use
	of mime-type detection. For basic topics, including the basics of metadata, how
	to write metadata extractors, and explanations of existing metadata extractors,
	see our <a href="../user/basic.html">Basic Guide.</a> In the rest of this guide,
	we will cover the following topics:</p>

	<ul>
	<li><a href="#search">Planning Metadata for Search</a></li>
	<li><a href="#provenance">Planning Metadata for Provenance</a></li>
	<li><a href="#mime">MIME-type Detection</a></li>
	</ul>

	</section>

	<a name="search"/>
	<section name="Planning Metadata for Search">

	<p>As discussed in the <a href="../user/basic.html">Basic Guide,</a> one of the
	primary uses of metadata is search. When you consider search, remember that
	neither the CAS-Metadata container nor the CAS-Filemgr are cognizant over
	how people will search your data - you are.</p>

	<p>We recommend developing a data dictionary as a "best practice." The
	<a href="http://portal.acm.org/citation.cfm?id=541721">IBM Dictionary of
	Computing</a> describes a data dictionary as a "centralized repository of
	information about data such as meaning, relationships to other data, origin,
	usage, and format." This is a highly-related concept to ontology.</p>

	<p>The attributes of and relationships between products in your system will
	not only help you to develop appropriate product types but also the metadata
	you will need to extract from products to establish both these attributes and
	relationships.</p>

	<p>In the next subsections, we will discuss specific aspects of metadata
	extraction that impact downstream search.</p>

	<subsection name="Missing Elements">

	<p>Because metadata extraction is a separate activity from extraction (as
	discussed in the <a href="../user/basic.html">Basic Guide</a>), it is possible
	that there is a miss-match between the metadata elements extracted by an
	extractor and the metadata elements that the CAS-Filemgr associated with
	a product type. CAS-Filemgr, therefore, only ingests the <i>intersection</i>
	of the metadata extracted from the product and the metadata associated with
	the product in CAS-Filemgr configuration. This means that missing elements
	are possible.</p>

	</subsection>

	<subsection name="String-based Comparison">
	<p>Metadata values are stored as strings in CAS-Metadata. While there are a
	number of good reasons for this, it is a design point that has a number of
	important ramifications for search. Specifically, all metadata elements should
	be comparable. Of course strings are comparable, but without some forethought, a
	string-based comparison can act differently than would a type-based comparison.</p>

	<p>This is where standards come into play. For various types, there are standard
	string-based representations that ensure comparisons that behave identically to how
	a type-based comparison would work. There is currently no plan to enforce these
	standards (and, depending on the particular type, the application domain, etc.,
	there might be multiple competing standards - e.g., TAI vs. UTC formatted Time
	strings).</p>

	<p>Some example string representation standards by type:</p>

	<p><strong>Date-Time</strong></p>

	<p>With time, consistency is key. There are multiple formats, such as
	<a href="http://tycho.usno.navy.mil/mjd.html">Julian Day Numbers</a>, or UTC.
	Additionally, there are different time standards such as
	<a href="http://www.w3.org/TR/NOTE-datetime">UTC</a>, local time, and
	<a href="http://www.bipm.org/en/scientific/tai/">TAI</a>. One also needs to be
	aware of leap second observance and local time conventions such as daylight savings
	time, depending on representation and standard selection.</p>

	<p>Additionally, it is important to remember that inter-product consistency can be
	just as important as intra-product consistency because there are many downstream
	use cases of the search features of CAS-Filemgr and other CAS components that
	involve cross-product comparisons.</p>

	<p><strong>Integers</strong></p>

	<p>Integers can be easily represented as comparable strings, but you must remember
	to pad correctly. The string "1" is greater than "01234", but "0001" is less than
	"01234."</p>

	<p>Like Date-Time, appropriate numerical representation is the responsibility
	of the metadata extractor, though we have built some additionally support for
	representational transformations during the ingest process of the CAS-Filemgr.</p>

	<p>More on Padding..</p>

	<p><strong>Floating Point Numbers</strong></p>

	<p>The most prevalent of Floating point representations is
	<a href="http://ieeexplore.ieee.org/servlet/opac?punumber=4610933">IEEE 754-2008.</a>
	This is a convenient representation because, amongst other reasons, it is the
	representation that Java uses so if you use Java to develop an extractor, you can
	use the <code>floatToIntBits(float value)</code> method of the <code>Float</code>
	class. Remember that, like Integers, you will need padding.</p>

	<p><strong>Blobs, URLs, and Character Sets</strong></p>

	<p>Remember that text encoding is important. Depending on the catalog extension
	point used by the CAS-Filemgr, metadata values might not be formatting correctly
	for storage. For example, if a metadata object containing a blob of text is ingested
	into a CAS-Filemgr instance that is configured to run with a DataStore extension
	backed by an Oracle DBMS, then UTF8 encoding is important.</p>


	</subsection>
	</section>

	<a name="provenance"/>
	<section name="Planning Metadata for Provenance">
	<p>Coming Soon...</p>
	</section>

	<a name="mime"/>
	<section name="MIME-type Detection">

	<p>Coming Soon...</p>

	</section>
	<section name="Conclusion">
	<p>This is intended to a living document discussing advanced topics within the
	CAS-Metadata project, though it is not comprehensive. In our
	<a href="../user/basic.html">Basic Guide,</a> we cover more topics regarding
	basic topics within CAS-Metadata.</p>
	</section>

	</body>
	</document>