| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more contributor |
| license agreements. See the NOTICE.txt file distributed with this work for |
| additional information regarding copyright ownership. The ASF licenses this |
| file to you under the Apache License, Version 2.0 (the "License"); you may not |
| use this file except in compliance with the License. You may obtain a copy of |
| the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, WITHOUT |
| WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the |
| License for the specific language governing permissions and limitations under |
| the License. |
| --> |
| <document> |
| <properties> |
| <title>CAS Crawler User Guide</title> |
| <author email="Chris.Mattmann@jpl.nasa.gov">Chris Mattmann</author> |
| </properties> |
| |
| <body> |
| <section name="User Guide"> |
| <p> |
| This is the user guide for the OODT Catalog and Archive Service (CAS) Crawler Framework, |
| or CAS Crawler for short. This guide explains the CAS Crawler architecture |
| including its extension points. The guide also discusses available services provided |
| by the CAS Crawler, how to utilize them, and the different APIs that exist. The guide |
| concludes with a description of CAS Crawler use cases. |
| </p> |
| </section> |
| <section name="Architecture"> |
| <p>The CAS Crawler Framework represents an effort to standardize the common ingestion activities |
| that occur in data acquisition and archival. These types of activities regularly involve identification |
| of files and directories to crawl (based on e.g., mime type, regular expressions, or direct user |
| input), satisfaction of ingestion pre-conditions (e.g., the current crawled file has not been |
| previously ingested), followed by metadata extraction. After metadata extraction, crawled data follows |
| a standard three state lifecycle: |
| </p> |
| <ol> |
| <li>preIngestion - where e.g., a file may be unzipped or pre-processed prior to ingestion;</li> |
| <li>postIngestion success, indicating a successful ingestion to the file management component has occurred |
| and e.g., the origin data file from the ingest area should be deleted; and</li> |
| <li>postIngestion failure, indicating that ingestion was not successful and some corrective action, |
| e.g., moving the failed file to a failure area for later examination, should occur.</li> |
| </ol> |
| |
| <p><img src='../images/crawler_phases.png' alt='Crawler Phase Model'/></p> |
| <p><img src='../images/crawler-state-diagram.png' alt='Detailed Crawler Phase Model'/></p> |
| |
| <p>The critical objects managed by the CAS Crawler framework include:</p> |
| |
| <ul> |
| <li>Crawler Config - Configures a Product Crawler, providing specialized options, such as specifying |
| the Crawler Framework URL to talk to, specifying the Met Extractor to call (in the case of the Met Extractor Product |
| Crawler), or specifying some property for a Crawler Action.</li> |
| <li>Product Crawler - Crawlers crawl a starting product path, looking for Products, which are collections of |
| one or more files, and an associated Metadata object. Metadata can be: |
| <ul> |
| <li>Provided ahead of time - in this case, the Product Crawler reads the Metadata file, looking for <code>Filename</code>, |
| <code>FileLocation</code>, and <code>ProductType</code> met keys. The Crawler Framework provides a <code>StdProductCrawler</code> |
| for this scenario.</li> |
| <li>Generated on the fly, using a CAS Met Extractor - in this case, the Product Crawler calls a CAS Met Extractor, and configures |
| it with an associated Met Extractor Config File, and uses this mechanism to obtain the Metadata for a Product. After the Metadata is |
| obtained, the Product Crawler engages in the same procedure as for the StdProductCrawler case. The Crawler Framework provides a |
| <code>MetExtractorProductCrawler</code> for this scenario.</li> |
| <li>Generated on the fly, using a CAS Met Extractor defined by a particualr Product mime type, or regular expression identifier - in this |
| case, Met Extractors are called on the fly to generate a Product's metadata. The Met Extractors are called based on a Product's mime type association, |
| or by a regular expression identifying the Product file(s)'s pattern. The Crawler Framework provides an <code>AutoDetectProductCrawler</code> for this |
| scenario.</li> |
| </ul> |
| </li> |
| <li>Action - Crawler Actions are tasks that the Crawler can perform that are associated with one of the three above phases. Each Action can be associated |
| with zero or more Crawler Phases. Examples of Crawler Actions are provided in the Appendix.</li> |
| <li>Crawler Phase - Crawler Phases are one of the three aforementioned phases of the Crawler lifecycle: preIngest, postIngestSuccess, or postIngestFailure. Each |
| Phase can have zero or more associated Crawler Actions.</li> |
| </ul> |
| |
| <p>Each Product Crawler is configured by exactly one Crawler Config. Each Product Crawler has 1...* Crawler Actions |
| which are each associated with 1...* Crawler Phases. Each Product Crawler crawls 1...* Products, which are each associated |
| with exactly 1 Metadata object. These relationships are shown in the below figure.</p> |
| |
| <p><img src="../images/crawler-object-model.png" alt="Crawler Object Model"/></p> |
| |
| <subsection name="Extension Points"> |
| <p> |
| There are several extension points for the Crawler Framework. An extension point is an interface |
| within the crawler framework that can have many implementations. This is particularly useful when |
| it comes to software component configuration because it allows different implementations of an |
| existing interface to be selected at deployment time. So, the crawler framework may |
| leverage a preIngest action that checks against a Crawler Framework Catalog to determine if a Product has |
| already been ingested, or the crawler may choose a preIngest action that performs an MD5 checksum |
| on the Product before ingesting. The selection of the actual component implementations is handled |
| entirely by the extension point mechanism. Using extension points, it is fairly simple to support many |
| different types of what are typically referred to as "plug-in architectures" Each of the core |
| extension points for the Crawler Framework is described below:</p> |
| |
| <table> |
| <tr> |
| <td>Crawler Daemon</td> |
| <td>The Crawler Daemon extension point is responsible for periodically (as defined by the User) |
| executing a particular ProductCrawler, as well as providing statistics on crawls (number of crawls, |
| average crawl time, etc.). |
| </td> |
| </tr> |
| <tr> |
| <td>Product Crawler</td> |
| <td>The Product Crawler extension point is responsible for defining the method in which Crawler |
| Preconditions are checked before ingestion, for managing the 3-phase ingestion process, and for dictating |
| the method by which Metadata is obtained for a particular product. In addition, the extension point |
| also specifies how Product files are crawled on disk. |
| </td> |
| </tr> |
| <tr> |
| <td>Mime Type Extractor Repository</td> |
| <td>The Mime Type Extractor Repository is an extension point that specifies how |
| Met Extractors are associated with mime types related to a Product file being ingested. |
| </td> |
| </tr> |
| <tr> |
| <td>Crawler Action Repository</td> |
| <td>The Crawler Action Repository is a home for all Crawler Actions, mapping each Action |
| to a lifecycle phase (preIngest, postIngestSuccess, or postIngestFailure). |
| </td> |
| </tr> |
| <tr> |
| <td>Precondition Comparator</td> |
| <td>The Precondition Comparator extension point allows for the specification of preconditions |
| that must be satisfied before metadata is extracted for a particular crawled Product, and ultimately |
| which must be satisfied before ingestion. |
| </td> |
| </tr> |
| <tr> |
| <td>Met Extractor, Met Extractor Config, Met Extractor Config Reader</td> |
| <td>The Crawler Framework leverages CAS Metadata's Met Extractor, Met Extractor Config, and |
| Met Extractor Config Reader extension point. Each Met Extractor is provided a Met Extractor Config |
| file, which is read by an associated Met Extractor Config Reader, allowing a User to entirely customize |
| how Metadata is extracted for a particular Product file. |
| </td> |
| </tr> |
| |
| </table> |
| |
| <p>The relationships between the extension points for the Crawler Framework are shown in the below |
| figure. |
| </p> |
| |
| <p><img width='600' src="../images/crawler-extension-points.png" alt="Crawler Extension Points"/></p> |
| |
| </subsection> |
| <subsection name="Key Capabilities"> |
| <p>The Crawler Framework is responsible for providing the necessary key capabilities for manually |
| and automatically ingesting files and metadata. Each high level capability provided by the |
| Crawler Framework is detailed below:</p> |
| |
| <ol> |
| <li>Automatic and manual ingesting of Products - The Crawler Framework can easily be configured |
| as a Daemon process that periodically crawls a starting directory for Products, or as a |
| standalone program, performing ingestion and then terminating. |
| </li> |
| <li>Command-line interface, as well as APIs - The Crawler Framework can be called from a simple |
| command line interface, allowing GNU-style options to configure a crawler, and start ingesting. In |
| addition, the Crawler also provides Java-based APIs that allows a user to integrate crawling into |
| her Java application with ease. |
| </li> |
| <li>Configurable Actions, and Action Extensibility Mechanism - The Crawler Framework ships with a set |
| of predefined actions out of the box (delete data file, after postIngestSuccess, delete met file, after |
| the same, check on pre ingest for product uniqueness, etc.) that provide an excellent starting point for |
| leveraging the Crawler Framework in a user's application. If more specific functionality is required, a |
| user has the ability to write her own Crawler Action by implementing the Crawler Action interface.</li> |
| <li>Configurable and extensible Preconditions for Met Extraction - Much like Crawler Actions, Preconditions |
| allow a user to customize the Crawler Framework, specifying if, how, and when, Met Extractors should be called |
| by a Crawler to extract metadata from an underlying Product. |
| </li> |
| <li>Support for a wide variety of Product file identification mechanisms - The Crawler Framework provides three |
| different crawlers out of the box that enforce a particular Product file identification strategy, e.g., StdProductCrawler |
| looks for Metadata files, MetExtractorProductCrawler generates Metadata files on the fly, and AutoDetectProductCrawler |
| uses mime type information and file regular expressions to identify Prouct files to crawl. If more customized functionality |
| is required, the user can extend one of these existing ProductCrawlers, or write an entirely new crawler.</li> |
| <li>Ability to plug in Met Extraction rules - A user can leverage a custom developed Met Extractor to generate Product metadata |
| used by the crawler during the ingestion process. Met Extractors are easily plugged in using Java's class loading mechanism |
| the capabilities of the <a href="../../metadata/">CAS Metadata Framework</a>.</li> |
| <li>Communication over lightweight, standard protocols – The Crawler Framework uses XML-RPC, as its main |
| external interface, when the Crawler runs in Daemon mode. XML-RPC, the little brother of SOAP, is |
| fast, extensible, and uses the underlying HTTP protocol for data transfer.</li> |
| </ol> |
| |
| <p>This capability set is not exhaustive, and is meant to give the user a "feel" for what |
| general features are provided by the Crawler Framework. Most likely the user will find that the |
| Crawler Framework provides many other capabilities besides those described here.</p> |
| |
| </subsection> |
| <subsection name="Current Extension Point Implementations"> |
| |
| <p>There is at least one implementation of all of the aforementioned extension points for |
| Crawler Framework. Each existing extension point implementation is detailed below:</p> |
| |
| <ul> |
| <li><b>Crawler Actions</b><br/> |
| <ol> |
| <li>Delete File – deletes an origin data file from its location in the staging area. Tied to |
| the postIngestSuccess stage.</li> |
| <li>FilemgrUniquenessChecker - Checks with the Crawler Framework after Metadata extraction has occured and |
| determines if the product has been ingested into the Crawler Framework. Tied to the preIngest stage.</li> |
| <li>MimeTypeCrawlerAction - Calls a specified Crawler Action is the Product matches a specified list of |
| mime-types. Tied to the preIngest stage.</li> |
| <li>WorkflowMgrStatusUpdate - Notifies the Workflow Manager that a file has been ingested into the File |
| Manager successfully. Tied to the postIngestSuccess stage.</li> |
| </ol> |
| </li> |
| <li><b>Crawler Precondition Comparators</b><br/> |
| <ol> |
| <li>FilemgrUniqunessCheckComparator - Determines, before Metadata extraction has occured, whether or not |
| a Product has been ingested into the Crawler Framework. |
| </li> |
| </ol> |
| </li> |
| <li><b>Crawler Daemons (Controller and Daemon)</b><br/> |
| <ol> |
| <li>XML-RPC – The default implementation of the Crawl Daemon Controller and the Crawler Daemon leverage |
| XML-RPC, the little brother of SOAP, as their mode of communication. |
| </li> |
| </ol> |
| </li> |
| </ul> |
| |
| </subsection> |
| |
| </section> |
| <section name="Configuration and Installation"> |
| <p> |
| To install the Crawler Framework, you need to download a release |
| of the software, available from its home web site. For bleeding-edge features, you can |
| also check out the cas-crawler trunk project from the Apache OODT subversion repository. You can browse |
| the repository using ViewCVS, located at:</p> |
| |
| <screen>http://svn.apache.org/viewvc/oodt/</screen> |
| |
| <p>The actual web url for the repository is located at:</p> |
| |
| <screen>https://svn.apache.org/repos/asf/oodt</screen> |
| |
| <p>To check out the Crawler Framework, use your favorite Subversion client.</p> |
| |
| <subsection name="Project Organization"> |
| <p> |
| The cas-crawler project follows the traditional Subversion-style <code>trunk</code>, <code>tag</code> |
| and <code>branches</code> format. Trunk corresponds to the latest and greatest development on the |
| cas-crawler. Tags are official release versions of the project. Branches correspond to deviations |
| from the trunk large enough to warrant a separate development tree. </p> |
| |
| <p>For the purposes of this the User Guide, we'll assume you already have downloaded a built release |
| of the Crawler Framework, from its web site. If you were building cas-crawler from the trunk, a tagged release |
| (or branch) the process would be quite similar. To build cas-crawler, you would need the Apache Maven |
| software. Maven is an XML-based, project management system similar to Apache Ant, but with many extra |
| bells and whistles. Maven makes cross-platform project development a snap. You can download Maven from: |
| |
| <a href="http://maven.apache.org">http://maven.apache.org/</a> |
| |
| All cas-crawler releases post 1.x.0 are now <b>Maven 2 compatible</b>. This is <b>very</b> important. |
| That means that if you have any cas-crawler release > 1.x.0, you will need Maven 2 to compile the software, |
| and Maven 1 will no longer work.</p> |
| |
| <p>Follow the procedures in the below Sections to build a fresh copy of the Crawler Framework. These procedures |
| are specifically targeted on using Maven 2 to build the software: |
| </p> |
| |
| </subsection> |
| |
| <subsection name="Building the Crawler Framework"> |
| <p> |
| <ol> |
| <li>cd to cas-crawler, and then type: |
| <source># mvn package</source> |
| This will perform several tasks, including compiling the source code, downloading |
| required jar files, running unit tests, and so on. When the command completes, cd |
| to the <code>target</code> directory within cas-crawler. This will contain the build of the |
| Crawler Framework component, of the following form: |
| |
| <source> |
| cas-crawler-${version}-dist.tar.gz |
| </source> |
| |
| This is a distribution tar ball, that you would copy to a deployment directory, such as |
| <code>/usr/local/</code>, and then unpack using <code># tar xvzf </code>. The resultant directory |
| layout from the unpacked tarball is as follows: |
| |
| <source> |
| bin/ etc/ logs/ doc/ lib/ policy/ LICENSE.txt CHANGES.txt |
| </source> |
| <ul> |
| <li>bin - contains the "crawler_launcher" server script, and the "crawlctl" client script.</li> |
| <li>etc - contains the logging.properties file for the Crawler Framework.</li> |
| <li>logs - the default directory for log files to be written to.</li> |
| <li>doc - contains Javadoc documentation, and user guides for using the particular CAS component.</li> |
| <li>lib - the required Java jar files to run the Crawler Framework.</li> |
| <li>policy – the default Spring-XML based configuration for Crawler Actions, Preconditions, Command Line options, and Crawlers.</li> |
| <li>CHANGES.txt - contains the CHANGES present in this released version of the CAS component.</li> |
| <li>LICENSE.txt - the LICENSE for the CAS Crawler Framework project.</li> |
| </ul> |
| </li> |
| </ol> |
| </p> |
| |
| </subsection> |
| <subsection name="Deploying the Crawler Framework"> |
| <p>To deploy the Crawler Framework, you'll need to create an installation directory. Typically this |
| would be somewhere in /usr/local (on *nix style systems), or C:\Program Files\ (on windows |
| style systems). We'll assume that you're installing on a *nix style system though the Windows |
| instructions are quite similar.</p> |
| |
| <p>Follow the process below to deploy the Crawler Framework:</p> |
| |
| <ol> |
| <li>Copy the binary distribution to the deployment directory |
| <source># cp -R cas-crawler/trunk/target/cas-crawler-${version}-dist.tar.gz /usr/local/</source> |
| </li> |
| <li>Untar the distribution |
| <source># cd /usr/local ; tar xvzf cas-crawler-${version}-dist.tar.gz</source> |
| </li> |
| <li>Set up a symlink |
| <source># ln -s /usr/local/cas-crawler-${version} /usr/local/crawler</source> |
| </li> |
| <li>Set your <code>JAVA_HOME</code> environment variable to point to the location of your |
| installed JRE runtime. |
| </li> |
| </ol> |
| |
| <p>Other configuration options are possible: check the <a href="../apidocs">API documentation</a>, |
| as well as the comments within the XML files in the policy directory to find out the rest of the configurable |
| properties for the extension points you choose. A full listing of all the extension point factory |
| class names are provided in the Appendix. After step 4, you are officially done configuring the Crawler |
| Framework for deployment.</p> |
| |
| </subsection> |
| <subsection name="Running the Crawler Framework"> |
| <p>To run the crawler_launcher, cd to <code>/usr/local/crawler/bin</code> and type:</p> |
| |
| <source># ./crawler_launcher -h</source> |
| |
| <p>This will print the help options for the Crawler launcher script. Your Crawler Framework |
| is now ready to run! You can test out the Crawler Framework by running a simple ingest |
| command shown below. Be sure that a separate File Manager instance is up and running on the default |
| port of 9000, running on localhost, before starting this test command (see the |
| <a href="../../filemgr/user/">File Manager User Guide</a> for more information). |
| |
| First create a simple text file called "blah.txt" and place it inside a test directory, e.g., <code>/data/test</code>. Then, create a default |
| metadata file for the product, using the schema or DTD |
| provided in the cas-metadata project. An example XML file might be:</p> |
| |
| <!-- FIXME: change namespace URI? --> |
| <source> |
| <cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas"> |
| <keyval> |
| <key>Filename</key> |
| <val>blah.txt</val> |
| </keyval> |
| <keyval> |
| <key>FileLocation</key> |
| <val>/data/test</val> |
| </keyval> |
| <keyval> |
| <key>ProductType</key> |
| <val>GenericFile</val> |
| </keyval> |
| </cas:metadata> |
| </source> |
| |
| <p>Call this metadata file <code>blah.txt.met</code>, and place it also in <code>/data/test</code>. |
| Then, run the below command, assuming that you started a File Manager on localhost on the default port of <code>9000</code>:</p> |
| |
| <source># ./crawler_launcher \ |
| --crawlerId StdProductCrawler \ |
| --productPath /data/test \ |
| --filemgrUrl http://localhost:9000/ \ |
| --failureDir /tmp \ |
| --actionIds DeleteDataFile MoveDataFileToFailureDir Unique \ |
| --metFileExtension met \ |
| --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory |
| </source> |
| |
| <p>You should see a response message at the end similar to:</p> |
| |
| <source> |
| log4j:WARN No appenders could be found for logger (org.springframework.context.support.FileSystemXmlApplicationContext). |
| log4j:WARN Please initialize the log4j system properly. |
| http://localhost:9000/ |
| StdProductCrawler |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.ProductCrawler crawl |
| INFO: Crawling /data/test |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.ProductCrawler handleFile |
| INFO: Handling file /data/test/blah.txt |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.metadata.extractors.MetReaderExtractor extrMetadata |
| INFO: Reading metadata from /data/test/blah.txt.met |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.ProductCrawler ingest |
| INFO: ProductCrawler: Ready to ingest product: [/data/test/blah.txt]: ProductType: [GenericFile] |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.ingest.StdIngester setFileManager |
| INFO: StdIngester: connected to file manager: [http://localhost:9000/] |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferer setFileManagerUrl |
| INFO: Local Data Transfer to: [http://localhost:9000/] enabled |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.ingest.StdIngester ingest |
| INFO: StdIngester: ingesting product: ProductName: [blah.txt]: ProductType: [GenericFile]: FileLocation: [/data/test/] |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.versioning.VersioningUtils createBasicDataStoreRefsFlat |
| FINE: VersioningUtils: Generated data store ref: file:/Users/mattmann/files/blah.txt/blah.txt from origRef: file:/data/test/blah.txt |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.system.XmlRpcFileManager runExtractors |
| INFO: Running Met Extractor: [org.apache.oodt.cas.filemgr.metadata.extractors.CoreMetExtractor] for product type: [GenericFile] |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.system.XmlRpcFileManager runExtractors |
| INFO: Running Met Extractor: [org.apache.oodt.cas.filemgr.metadata.extractors.examples.MimeTypeExtractor] for product type: [GenericFile] |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.catalog.LuceneCatalog toDoc |
| WARNING: No Metadata specified for product [blah.txt] for required field [DataVersion]: Attempting to continue processing metadata |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient ingestProduct |
| FINEST: File Manager Client: clientTransfer enabled: transfering product [blah.txt] |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferer moveFile |
| INFO: LocalDataTransfer: Moving File: file:/data/test/blah.txt to file:/Users/mattmann/files/blah.txt/blah.txt |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.filemgr.catalog.LuceneCatalog toDoc |
| WARNING: No Metadata specified for product [blah.txt] for required field [DataVersion]: Attempting to continue processing metadata |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.ProductCrawler ingest |
| INFO: Successfully ingested product: [/data/test/blah.txt]: product id: 72db4bba-658a-11de-bedb-77f2d752c436 |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.ProductCrawler handleFile |
| INFO: Successful ingest of product: [/data/test/blah.txt] |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.ProductCrawler performProductCrawlerActions |
| INFO: Performing action (id = DeleteDataFile : description = Deletes the current data file) |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.action.DeleteFile performAction |
| INFO: Deleting file /data/test/blah.txt |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.ProductCrawler handleFile |
| INFO: Handling file /data/test/blah.txt.met |
| Jun 30, 2009 8:26:57 AM org.apache.oodt.cas.crawl.ProductCrawler handleFile |
| WARNING: Failed to pass preconditions for ingest of product: [/data/test/blah.txt.met] |
| </source> |
| |
| <p>which means that everything installed okay!</p> |
| |
| </subsection> |
| |
| </section> |
| <section name="Use Cases"> |
| <p> |
| The Crawler Framework was built to support several of the above capabilities outlined in |
| Section 3. In particular there were several use cases that we wanted to support, some |
| of which are described below. |
| </p> |
| |
| <p><img width='600' src="../images/crawler-use-case-1.png" alt="Crawler Framework Ingest Use Case"/></p> |
| |
| <p>The numbers in the above Figure correspond to a sequence of steps that occurs and a |
| series of interactions between the different Crawler Framework extension points in order to |
| perform the file ingestion activity. In Step 1, the Crawler Daemon determines if the Product |
| file in the staging area matches a specified File Filter regular expression. In step 2, if the |
| Product file does match the desired filter, the Crawler Daemon checks with the File Manager |
| (using the Ingester) to determine whether or not the Product file has been ingested. In step 3, |
| assuming that the Product file has <b>not</b> been ingested, then the Crawler Daemon ingests |
| the Product file into the File Manager using the Ingester interface. During the ingestion, |
| Metadata is first extracted (in sub-step 1) and then the File Manager client interface is |
| used (in sub-step 2) to ingest into the File Manager. During ingestion, the File Manager is |
| responsible for moving the Product file(s) into controlled access-storage and cataloging the |
| provided Product metadata.</p> |
| |
| </section> |
| <section name="Appendix"> |
| <p> |
| Full list of Crawler Framework extension point classes and their associated property names from the |
| Spring Bean XML files: |
| </p> |
| |
| <table> |
| <tr> |
| <td>actions</td> |
| <td>org.apache.oodt.cas.crawl.action.DeleteFile<br/> |
| org.apache.oodt.cas.crawl.action.FilemgrUniquenessChecker<br/> |
| org.apache.oodt.cas.crawl.action.MimeTypeCrawlerAction<br/> |
| org.apache.oodt.cas.crawl.action.MoveFile<br/> |
| org.apache.oodt.cas.crawl.action.WorkflowMgrStatusUpdate |
| </td> |
| </tr> |
| <tr> |
| <td>preconditions</td> |
| <td>org.apache.oodt.cas.crawl.comparator.FilemgrUniquenessCheckComparator |
| </td> |
| </tr> |
| </table> |
| |
| </section> |
| </body> |
| |
| </document> |