blob: c588fc2bfc041d88facf998c5095c4867ed27a2d [file] [log] [blame]
------
Apache Any23 - Validation and Fixing
------
The Apache Software Foundation
------
2011-2012
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
Validation and Fixing
Introduction
<<Apache Any23>> Is able to detect <<ill-formed HTML DOM content>> and apply fixes over it.
This section will show how to write RDFa validation Rule and Fix for RDFa.
It's widely recognized that RDFa is subjected to a plethora of different and {{{http://rdfa.info/wiki/Common-publishing-mistakes}common mistakes}}.
These errors may lead to a failures during RDF extraction process from HTML pages but since they are, typically, syntax errors
they could be easily detected and fixed with some heuristics.
This pages describes the <<Apache Any23>> rule-based approach, that allows it to detect, fix and correctly extract
RDF from those ill-formed RDFa in XHTML pages.
More specifically, <<Apache Any23>> allows you to write a {{{./apidocs/org/apache/any23/validator/Rule.html}Rule}}
able to detect the errors, a {{{./apidocs/org/apache/any23/validator/Fix.html}Fix}} containing the logic to fix the problem and a
{{{./apidocs/org/apache/any23/validator/Validator.html}Validator}} which acts as a register of rules and fixes. The Validator
calls all the registered rules and when one of them is applied it calls the associated Fix.
The following code snipped shows how to programmatically detect and fix a very common data error with <<Apache Any23>>.
Fix Missing Prefix Mappings Declaration
Sometimes, web authors forget to declare prefix mappings. For example, you can't just use something like dcterms:title
without first declaring the dcterms prefix mapping. If a prefix mapping isn't declared, the RDFa parser won't understand
the prefix when it is used in your document. This may lead <<Apache Any23>> to don't extract such embedded RDF triples.
This:
+------------------------------------------------------------------------------------------
<div>
The title of this document is <span property="dcterms:title">Why RDFa is Awesome</span>.
</div>
+------------------------------------------------------------------------------------------
Should be:
+------------------------------------------------------------------------------------------
<div xmlns:dcterms="http://purl.org/dc/terms/">
The title of this document is <span property="dcterms:title">Why RDFa is Awesome</span>.
</div>
+------------------------------------------------------------------------------------------
With the <<Apache Any23>> {{{./apidocs/org/apache/any23/validator/package-summary.html}Validator}} classes it's possible to solve this
problem simply implementing the {{{./apidocs/org/apache/any23/validator/Rule.html}Rule}} interface as described below:
+------------------------------------------------------------------------------------------
public class MissingOpenGraphNamespaceRule implements Rule {
public String getHRName() {
return "missing-opengraph-namespace-rule";
}
public boolean applyOn(DOMDocument document, RuleContext context, ValidationReport validationReport) {
List<Node> metas = document.getNodes("/HTML/HEAD/META");
boolean foundPrecondition = false;
for (Node meta : metas) {
Node propertyNode = meta.getAttributes().getNamedItem("property");
if( propertyNode != null && propertyNode.getTextContent().indexOf("og:") == 0) {
foundPrecondition = true;
break;
}
}
if (foundPrecondition) {
Node htmlNode = document.getNode("/HTML");
if (htmlNode.getAttributes().getNamedItem("xmlns:og") == null) {
validationReport.reportIssue(
ValidationReport.IssueLevel.error,
"Missing OpenGraph namespace declaration.",
htmlNode
);
return true;
}
}
return false;
}
}
+------------------------------------------------------------------------------------------
The {{{./apidocs/org/apache/any23/validator/rule/MissingOpenGraphNamespaceRule.html}MissingOpenGraphNamespaceRule}} inspects the DOM
structure of the HTML page and if it finds some META tags with some RDFa property (of the OpenGraph Protocol vocabulary, in this case)
it looks for the declaration of that name space. If there is no declaration it return <<true>>, that means that an error has been detected
within the document.
Writing a fix for the Rule depicted above it's quite simple:
+------------------------------------------------------------------------------------------
public class OpenGraphNamespaceFix implements Fix {
public static final String OPENGRAPH_PROTOCOL_NS = "http://opengraphprotocol.org/schema/";
public String getHRName() {
return "opengraph-namespace-fix";
}
public void execute(Rule rule, RuleContext context, DOMDocument document) {
document.addAttribute("/HTML", "xmlns:og", OPENGRAPH_PROTOCOL_NS);
}
}
+------------------------------------------------------------------------------------------
At this point it's enough to register the Rule and the relative Fix to the Validator:
+------------------------------------------------------------------------------------------
validator.addRule(MissingOpenGraphNamespaceRule.class, OpenGraphNamespaceFix.class);
+------------------------------------------------------------------------------------------
When the Rule precondition is matched, then the Fix is triggered modifying the DOM structure.