| ------ |
| Apache Any23 - Developers Guide |
| ------ |
| The Apache Software Foundation |
| ------ |
| 2011-2012 |
| |
| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| |
| Architectural Overview |
| |
| [./images/any23-overall.png] |
| |
| The informal architectural diagram above shows the <<Any23>> logical modules, the main data flow and the code packages implementing such modules. |
| |
| The first module, <<Data Fetching>>, is responsible for retrieving raw data from the Web, its implementation package is <<org.apache.any23.source>>. |
| The data collected by <Data Fetching> is analyzed by the <<MIMEtype Detection>> module, implemented in package <<org.apache.any23>>. |
| Such module will determine the data encoding and the content <MIME> type. |
| The identification of the MIME type is used to select a list of activable <Extractors> for the subsequent metadata extraction. |
| |
| The next phase is performed by the <<Content Validation and Patching>> module (<<org.apache.any23.validator>>), and it is required because the most part of data |
| exposed on the Web is affected by minor issues which compromise the correct working of some <Extractors>. To overcome such problems <<Any23>> |
| introduced a mechanism to detect issues and in most cases to fix them. The detection and fixing is performed using an extensible collection of <<Rules>>. |
| Currently the Validation and Patching is applied only on <DOM> based documents (<HTML>). |
| |
| The <<Metadata Extraction>> module, implemented within the <<org.apache.any23.extraction>> package, applies all the <Extractors> activated by the analysis phase |
| and generates an RDF statements stream together with an issue report. |
| The statements produced by the <Extractor>s can be filtered to remove spurious, repeated or unwanted triples using the <<Metadata Filtering>> module (<<org.apache.any23.filter>>). |
| |
| The last metadata extraction phase consists in the conversion of the filtered statements in an RDF representation format. |
| This can be done by using one of the available RDF writers provided by the <<Serialization>> module (<<org.apache.any23.writer>>). |
| |
| The other modules represented at the bottom of the diagram add auxiliary functionalities over the core application. |
| The <<Plugin Management>> module (<<org.apache.ay23.plugin>>) is responsible for the extension of the platform through the runtime detection |
| and registration of additional components included within the classpath. |
| The Plugin Manager is currently able to detect and register new Extractors, Writers and CLI tools. |
| It is foreseen the plugin support implementation for all the modules marked as (P). |
| |
| The <<CLI Tool>> module (org.apache.any23.cli) allows to run all the available CLI tools through a unified interface. |
| |
| The <<Service>> module (org.apache.any23.service) implements a REST service to use Any23 as a Web service implementing a <REST> interface. |
| |
| |
| Developers Guide |
| |
| This section introduces some <<Apache Any23>> programming fundamentals. |
| |
| * {{{./dev-data-extraction.html}Data Extraction}} |
| |
| Explains how to extract RDF data from HTTP resources with <<Apache Any23>>. |
| |
| * {{{./dev-data-conversion.html}Data Conversion}} |
| |
| Shows how to perform RDF data conversion with <<Apache Any23>>. |
| |
| * {{{./dev-validation-fix.html}Validation and Fixing}} |
| |
| Demonstrates how to define validation and correction rules for HTML content with <<Apache Any23>>. |
| |
| * {{{./dev-xpath-extractor.html}XPath Extractor}} |
| |
| Explains how to write custom scraping rules for extracting RDF data from any HTML content with <<Apache Any23>>. |
| |
| * {{{./dev-microformat-extractors.html}Microformat Extractors}} |
| |
| Explains how to write new Microformat extractors with <<Apache Any23>> and also report interesting notes on |
| microformat nesting representation. |
| |
| * {{{./dev-microdata-extractor.html}Microdata Extractor}} |
| |
| Explains how it works the Microdata Extractor embedded in <<Apache Any23>>. |
| |
| * {{{./dev-csv-extractor.html}CSV Extractor}} |
| |
| Explains how it works the CSV Extractor embedded in <<Apache Any23>>. |