| ------ |
| Apache Any23 - Validation and Fixing |
| ------ |
| The Apache Software Foundation |
| ------ |
| 2011-2012 |
| |
| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| |
| Validation and Fixing |
| |
| Introduction |
| |
| <<Apache Any23>> Is able to detect <<ill-formed HTML DOM content>> and apply fixes over it. |
| |
| This section will show how to write RDFa validation Rule and Fix for RDFa. |
| |
| It's widely recognized that RDFa is subjected to a plethora of different and {{{http://rdfa.info/wiki/Common-publishing-mistakes}common mistakes}}. |
| These errors may lead to a failures during RDF extraction process from HTML pages but since they are, typically, syntax errors |
| they could be easily detected and fixed with some heuristics. |
| |
| This pages describes the <<Apache Any23>> rule-based approach, that allows it to detect, fix and correctly extract |
| RDF from those ill-formed RDFa in XHTML pages. |
| |
| More specifically, <<Apache Any23>> allows you to write a {{{./apidocs/org/apache/any23/validator/Rule.html}Rule}} |
| able to detect the errors, a {{{./apidocs/org/apache/any23/validator/Fix.html}Fix}} containing the logic to fix the problem and a |
| {{{./apidocs/org/apache/any23/validator/Validator.html}Validator}} which acts as a register of rules and fixes. The Validator |
| calls all the registered rules and when one of them is applied it calls the associated Fix. |
| |
| The following code snipped shows how to programmatically detect and fix a very common data error with <<Apache Any23>>. |
| |
| Fix Missing Prefix Mappings Declaration |
| |
| Sometimes, web authors forget to declare prefix mappings. For example, you can't just use something like dcterms:title |
| without first declaring the dcterms prefix mapping. If a prefix mapping isn't declared, the RDFa parser won't understand |
| the prefix when it is used in your document. This may lead <<Apache Any23>> to don't extract such embedded RDF triples. |
| |
| This: |
| |
| +------------------------------------------------------------------------------------------ |
| <div> |
| The title of this document is <span property="dcterms:title">Why RDFa is Awesome</span>. |
| </div> |
| +------------------------------------------------------------------------------------------ |
| |
| Should be: |
| |
| +------------------------------------------------------------------------------------------ |
| <div xmlns:dcterms="http://purl.org/dc/terms/"> |
| The title of this document is <span property="dcterms:title">Why RDFa is Awesome</span>. |
| </div> |
| +------------------------------------------------------------------------------------------ |
| |
| With the <<Apache Any23>> {{{./apidocs/org/apache/any23/validator/package-summary.html}Validator}} classes it's possible to solve this |
| problem simply implementing the {{{./apidocs/org/apache/any23/validator/Rule.html}Rule}} interface as described below: |
| |
| +------------------------------------------------------------------------------------------ |
| public class MissingOpenGraphNamespaceRule implements Rule { |
| |
| public String getHRName() { |
| return "missing-opengraph-namespace-rule"; |
| } |
| |
| public boolean applyOn(DOMDocument document, RuleContext context, ValidationReport validationReport) { |
| List<Node> metas = document.getNodes("/HTML/HEAD/META"); |
| boolean foundPrecondition = false; |
| for (Node meta : metas) { |
| Node propertyNode = meta.getAttributes().getNamedItem("property"); |
| if( propertyNode != null && propertyNode.getTextContent().indexOf("og:") == 0) { |
| foundPrecondition = true; |
| break; |
| } |
| } |
| if (foundPrecondition) { |
| Node htmlNode = document.getNode("/HTML"); |
| if (htmlNode.getAttributes().getNamedItem("xmlns:og") == null) { |
| validationReport.reportIssue( |
| ValidationReport.IssueLevel.error, |
| "Missing OpenGraph namespace declaration.", |
| htmlNode |
| ); |
| return true; |
| } |
| } |
| return false; |
| } |
| } |
| +------------------------------------------------------------------------------------------ |
| |
| The {{{./apidocs/org/apache/any23/validator/rule/MissingOpenGraphNamespaceRule.html}MissingOpenGraphNamespaceRule}} inspects the DOM |
| structure of the HTML page and if it finds some META tags with some RDFa property (of the OpenGraph Protocol vocabulary, in this case) |
| it looks for the declaration of that name space. If there is no declaration it return <<true>>, that means that an error has been detected |
| within the document. |
| |
| Writing a fix for the Rule depicted above it's quite simple: |
| |
| +------------------------------------------------------------------------------------------ |
| public class OpenGraphNamespaceFix implements Fix { |
| |
| public static final String OPENGRAPH_PROTOCOL_NS = "http://opengraphprotocol.org/schema/"; |
| |
| public String getHRName() { |
| return "opengraph-namespace-fix"; |
| } |
| |
| public void execute(Rule rule, RuleContext context, DOMDocument document) { |
| document.addAttribute("/HTML", "xmlns:og", OPENGRAPH_PROTOCOL_NS); |
| } |
| |
| } |
| +------------------------------------------------------------------------------------------ |
| |
| At this point it's enough to register the Rule and the relative Fix to the Validator: |
| |
| +------------------------------------------------------------------------------------------ |
| validator.addRule(MissingOpenGraphNamespaceRule.class, OpenGraphNamespaceFix.class); |
| +------------------------------------------------------------------------------------------ |
| |
| When the Rule precondition is matched, then the Fix is triggered modifying the DOM structure. |