blob: d3582cea743ea10113722f9bc930c19d9b460bcf [file] [log] [blame]
Notes on the extraction process
This page describes some extractions corner-cases and their relative RDF representations. Main aim of this section is to describe how
some specific cases are processed with <<Any23>> showing the correspondences between the extracted RDF triples.
* Nesting different Microformats
This section describes how <<Any23>> represents, with RDF, the content of an HTML fragments containing different nested
Microformats.
<<Any23>> performs the extraction executing different extractors for every supported Microformat on a input HTML page.
There are two different possibilities to write extractors able to produce a set of RDF triples that coherently
represents this nesting.
More specifically:
* Embedding explicitly the logic within the {{{xref/org/deri/any23/extractor/html/package-summary.html}Microformats Extractors}}
* Using the default <<Any23>> nesting feature.
In the first case, the logic for representing the nested values, is directly embedded in the upper-level Extractor.
For example, the following HTML fragment shows an hCard that contains an hAddress Microformat.
+----------------------------------------------------------------------------------------------
<span class="vcard">
<span class="fn">L'Amourita Pizza</span>
Located at
<span class="adr">
<span class="street-address">123 Main St</span>,
<span class="locality">Albequerque</span>,
<span class="region">NM</span>.
</span>
<a href="http://pizza.example.com" class="url">http://pizza.example.com</a>
</span>
+----------------------------------------------------------------------------------------------
Since, as shown below, the {{{xref/org/deri/any23/extractor/html/HCardExtractor.html}HCardExtractor}}
contains the code to handle nested hAddress,
+------------------------------
foundSomething |= addSubMicroformat("adr", card, VCARD.adr);
...
private boolean addSubMicroformat(String className, Resource resource, URI property) {
List<Node> nodes = fragment.findAllByClassName(className);
if (nodes.isEmpty()) return false;
for (Node node : nodes) {
addBNodeProperty(
getDescription().getExtractorName(),
node,
resource, property, getBlankNodeFor(node)
);
}
return true;
}
+------------------------------
it explicitly produces the triples claiming the native nesting relationship:
+----------------------------------------------------------------------------------------------------
<rdf:Description rdf:nodeID="nodee2296b803cbf5c7953614ce9998c4083">
<vcard:url rdf:resource="http://pizza.example.com"/>
<vcard:adr rdf:nodeID="nodea8badeafb65268ab3269455dd5377a5e"/>
<rdf:type rdf:resource="http://www.w3.org/2006/vcard/ns#VCard"/>
<rdf:Description rdf:nodeID="nodea8badeafb65268ab3269455dd5377a5e">
<rdf:type rdf:resource="http://www.w3.org/2006/vcard/ns#Address"/>
<vcard:street-address>123 Main St</vcard:street-address>
<vcard:locality>Albequerque</vcard:locality>
<vcard:region>NM</vcard:region>
</rdf:Description>
+-----------------------------------------------------------------------------------------------------
Instead, the second manner is to leave to <<Any23>> the responsibility of identifying nested Microformats and produce
a set of descriptive RDF triples.
More specifically, the following HTML fragment, provided as a reference example on
the {{{http://www.google.com/support/webmasters/bin/answer.py?answer=146862}Google Webmaster tools blog}},
shows a vEvent Microformat with a nested vCard.
+----------------------------------------------------------------------------------------------
<p class="schedule vevent">
<span class="summary">
<span style="font-weight:bold; color: #3E4876;">
This event is organized by
<span class="vcard">
<a class="url fn" href="http://tantek.com/">Tantek Celik</a>
<span class="org">Technorati</span>
</span>
</span>
<a href="/cs/web2005/view/e_spkr/1852">Tantek Celik</a>
</span>
</p>
+----------------------------------------------------------------------------------------------
Due to the fact that the <<Any23>> provided extractors don't explicitly foresee the possibility of nesting such two
Microformats, it automatically identifies the nesting relationship and represents it with the following triples:
+---------------------------------------------------------
<rdf:Description rdf:nodeID="node755b2b367973b6854ec68c77bec9b3">
<nesting_original xmlns="http://vocab.sindice.net/" rdf:resource="http://www.w3.org/2002/12/cal/icaltzd#summary"/>
<nesting_structured xmlns="http://vocab.sindice.net/" rdf:nodeID="node985d8f2b9afb02eeddf2e72b5eeb74"/>
</rdf:Description>
<rdf:Description rdf:nodeID="node150ldsavbx29">
<nesting xmlns="http://vocab.sindice.net/" rdf:nodeID="node755b2b367973b6854ec68c77bec9b3"/>
</rdf:Description>
+---------------------------------------------------------
That informally means that the vEvent Microformat has a nested hCard through the property
http://www.w3.org/2002/12/cal/icaltzd#summary providing for them two blank nodes.