| ------ |
| Apache Any23 - Data Extraction |
| ------ |
| The Apache Software Foundation |
| ------ |
| 2011-2012 |
| |
| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| |
| Data Extraction |
| |
| +---------------------------------------------------------------------------------------------- |
| /*1*/ Any23 runner = new Any23(); |
| /*2*/ runner.setHTTPUserAgent("test-user-agent"); |
| /*3*/ HTTPClient httpClient = runner.getHTTPClient(); |
| /*4*/ DocumentSource source = new HTTPDocumentSource( |
| httpClient, |
| "http://www.rentalinrome.com/semanticloft/semanticloft.htm" |
| ); |
| /*5*/ ByteArrayOutputStream out = new ByteArrayOutputStream(); |
| /*6*/ TripleHandler handler = new NTriplesWriter(out); |
| try { |
| /*7*/ runner.extract(source, handler); |
| } finally { |
| /*8*/ handler.close(); |
| } |
| /*9*/ String n3 = out.toString("UTF-8"); |
| +---------------------------------------------------------------------------------------------- |
| |
| This example demonstrates the data extraction, that is the main purpose of <<Apache Any23>> library. |
| At <<line 1>> we define the <<Apache Any23>> facade instance. As described before, the constructor allows to enforce |
| the usage of specific extractors. |
| |
| The <<line 2>> defines the <HTTP User Agent>, used to identify the client during <HTTP> data collection. |
| At <<line 3>> we use the runner to create an instance of {{{./apidocs/org/apache/any23/http/HTTPClient.html}HTTPClient}}, |
| used by {{{./apidocs/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} for <HTTP> content fetching. |
| |
| The <<line 4>> instantiates an {{{./apidocs/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} instance, |
| specifying the {{{./apidocs/org/apache/any23/http/HTTPClient.html}HTTPClient}} and the URL addressing the content |
| to be processed. |
| |
| At <<line 5>> we define a buffered output stream used to store data produced by the |
| {{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} defined at <<line 6>>. |
| |
| The extraction method at <<line 7>> will run the metadata extraction. |
| The produced metadata will be written within the passed |
| {{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} instance. |
| |
| The {{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} needs to be explicitly closed, |
| this is done safely in a <<finally>> block at <<line 8>>. |
| |
| The expected output is <UTF-8> encoded at <<line 9>> and is: |
| |
| +---------------------------------------------------------------------------------------------- |
| <http://www.rentalinrome.com/semanticloft/semanticloft.htm> <http://purl.org/dc/terms/title> |
| "Semantic Loft (beta) - Trastevere apartments | Rental in Rome - rentalinrome.com" . |
| |
| <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft> |
| <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> |
| <http://purl.org/goodrelations/v1#Offering> . |
| |
| <http://www.rentalinrome.com> |
| <http://purl.org/goodrelations/v1#offers> |
| <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft> . |
| |
| <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft> |
| <http://www.w3.org/2000/01/rdf-schema#seeAlso> |
| <http://rentalinrome.com/semanticloft/semanticloft.htm> . |
| |
| <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft> |
| <http://purl.org/goodrelations/v1#hasBusinessFunction> |
| <http://purl.org/goodrelations/v1#ProvideService> . |
| |
| <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft> |
| <http://www.w3.org/2006/vcard/ns#adr> |
| _:node14r93a8dex1 . |
| |
| [The complete output is omitted for brevity.] |
| +---------------------------------------------------------------------------------------------- |
| |
| Filter Out Accidental Triples |
| |
| To remove accidental triples <<Apache Any23>> provides a set of useful filters, located |
| within the <<org.apache.any23.filter>> package. |
| |
| The filter {{{./apidocs/org/apache/any23/filter/IgnoreTitlesOfEmptyDocuments.html}IgnoreTitlesOfEmptyDocuments}} |
| removes triples generated by the {{{./apidocs/org/apache/any23/extractor/html/TitleExtractor.html}TitleExtractor}} |
| whether the document is empty. |
| |
| The filter {{{./apidocs/org/apache/any23/filter/IgnoreAccidentalRDFa.html}IgnoreAccidentalRDFa}} removes accidental |
| <<CSS>> related triples. |
| |
| +------------------------------------ |
| RDFWriter rdfWriter = ... |
| TripleHandler rdfWriterHandler = RDFWriterTripleHandler(rdfWriter); |
| TripleHandler tripleHandler = new ReportingTripleHandler( |
| new IgnoreAccidentalRDFa( |
| new IgnoreTitlesOfEmptyDocuments(rdfWriterHandler), |
| true // if true the CSS triples will be removed in any case. |
| ) |
| ); |
| DocumentSource documentSource = ... |
| any23.extract(documentSource, rdfWriterHandler); |
| +------------------------------------ |