| ------ |
| Apache Any23 - Microformat Extractors |
| ------ |
| The Apache Software Foundation |
| ------ |
| 2011-2012 |
| |
| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| |
| Microformat Extractors |
| |
| This section describes some extractions corner-cases and their relative RDF representations. |
| Main aim of this section is to describe how |
| some specific cases are processed with <<Apache Any23>> showing the correspondences between the extracted RDF triples. |
| |
| {microformat-nesting} |
| * Nesting different Microformats |
| |
| [TODO: add picture about microformat nesting structure.] |
| |
| This section describes how <<Apache Any23>> represents, with RDF, the content of an HTML fragments containing different nested |
| Microformats. |
| <<Apache Any23>> performs the extraction executing different extractors for every supported Microformat on a input HTML page. |
| There are two different possibilities to write extractors able to produce a set of RDF triples that coherently |
| represents this nesting. |
| |
| More specifically: |
| |
| * Embedding explicitly the logic within the |
| {{{./apidocs/org/apache/any23/extractor/html/package-summary.html}Microformats Extractors}} |
| |
| * Using the default <<Apache Any23>> nesting feature. |
| |
| In the first case, the logic for representing the nested values, is directly embedded in the upper-level Extractor. |
| For example, the following HTML fragment shows an hCard that contains an hAddress Microformat. |
| |
| +---------------------------------------------------------------------------------------------- |
| <span class="vcard"> |
| <span class="fn">L'Amourita Pizza</span> |
| Located at |
| <span class="adr"> |
| <span class="street-address">123 Main St</span>, |
| <span class="locality">Albequerque</span>, |
| <span class="region">NM</span>. |
| </span> |
| <a href="http://pizza.example.com" class="url">http://pizza.example.com</a> |
| </span> |
| +---------------------------------------------------------------------------------------------- |
| |
| Since, as shown below, the {{{./apidocs/org/apache/any23/extractor/html/HCardExtractor.html}HCardExtractor}} |
| contains the code to handle nested hAddress, |
| |
| +------------------------------ |
| |
| foundSomething |= addSubMicroformat("adr", card, VCARD.adr); |
| |
| ... |
| |
| private boolean addSubMicroformat(String className, Resource resource, IRI property) { |
| List<Node> nodes = fragment.findAllByClassName(className); |
| if (nodes.isEmpty()) return false; |
| for (Node node : nodes) { |
| addBNodeProperty( |
| getDescription().getExtractorName(), |
| node, |
| resource, property, getBlankNodeFor(node) |
| ); |
| } |
| return true; |
| } |
| |
| +------------------------------ |
| |
| it explicitly produces the triples claiming the native nesting relationship: |
| |
| +---------------------------------------------------------------------------------------------------- |
| <rdf:Description rdf:nodeID="nodee2296b803cbf5c7953614ce9998c4083"> |
| <vcard:url rdf:resource="http://pizza.example.com"/> |
| <vcard:adr rdf:nodeID="nodea8badeafb65268ab3269455dd5377a5e"/> |
| <rdf:type rdf:resource="http://www.w3.org/2006/vcard/ns#VCard"/> |
| |
| <rdf:Description rdf:nodeID="nodea8badeafb65268ab3269455dd5377a5e"> |
| <rdf:type rdf:resource="http://www.w3.org/2006/vcard/ns#Address"/> |
| <vcard:street-address>123 Main St</vcard:street-address> |
| <vcard:locality>Albequerque</vcard:locality> |
| <vcard:region>NM</vcard:region> |
| </rdf:Description> |
| +----------------------------------------------------------------------------------------------------- |
| |
| It is higly recommended to decorate the extractors who natively handle the nesting relatioship using the |
| {{{./apidocs/org/apache/any23/extractor/html/annotations/Includes.html}@Includes}} annotation. This annotation, |
| if present, avoid the production of <nesting_original> and <nesting_structured> RDF statements. |
| |
| The following example shows how the {{{./apidocs/org/apache/any23/extractor/html/annotations/Includes.html}@Includes}} annotation |
| could be used to claim the fact that {{{./apidocs/org/apache/any23/extractor/html/HCardExtractor.html}HCardExtractor}} natively |
| embedds the {{{./apidocs/org/apache/any23/extractor/html/AdrExtractor.html}AdrExtractor}}. |
| |
| +---------------------------------------------------------------------------------------------- |
| @Includes( extractors = AdrExtractor.class ) |
| public class HCardExtractor extends EntityBasedMicroformatExtractor { |
| |
| // code omitted for brevity |
| |
| } |
| +---------------------------------------------------------------------------------------------- |
| |
| |
| Instead, the second manner is to leave to <<Apache Any23>> the responsibility of identifying nested Microformats and produce |
| a set of descriptive RDF triples. |
| More specifically, the following HTML fragment, provided as a reference example on |
| the {{{http://www.google.com/support/webmasters/bin/answer.py?answer=146862}Google Webmaster tools blog}}, |
| shows a vEvent Microformat with a nested vCard. |
| |
| +---------------------------------------------------------------------------------------------- |
| <p class="schedule vevent"> |
| <span class="summary"> |
| <span style="font-weight:bold; color: #3E4876;"> |
| This event is organized by |
| <span class="vcard"> |
| <a class="url fn" href="http://tantek.com/">Tantek Celik</a> |
| <span class="org">Technorati</span> |
| </span> |
| </span> |
| <a href="/cs/web2005/view/e_spkr/1852">Tantek Celik</a> |
| </span> |
| </p> |
| +---------------------------------------------------------------------------------------------- |
| |
| Due to the fact that the <<Apache Any23>> provided extractors don't explicitly foresee the possibility of nesting such two |
| Microformats, it automatically identifies the nesting relationship and represents it with the following triples: |
| |
| +--------------------------------------------------------- |
| <rdf:Description rdf:nodeID="node755b2b367973b6854ec68c77bec9b3"> |
| <nesting_original xmlns="http://vocab.sindice.net/" rdf:resource="http://www.w3.org/2002/12/cal/icaltzd#summary"/> |
| <nesting_structured xmlns="http://vocab.sindice.net/" rdf:nodeID="node985d8f2b9afb02eeddf2e72b5eeb74"/> |
| </rdf:Description> |
| |
| <rdf:Description rdf:nodeID="node150ldsavbx29"> |
| <nesting xmlns="http://vocab.sindice.net/" rdf:nodeID="node755b2b367973b6854ec68c77bec9b3"/> |
| </rdf:Description> |
| +--------------------------------------------------------- |
| |
| That informally means that the vEvent Microformat has a nested hCard through the property |
| http://www.w3.org/2002/12/cal/icaltzd#summary providing for them two blank nodes. |
| |