blob: 1f67a53628f7ca443deb660040ccd52a8e9103ce [file] [log] [blame]
------
Apache Any23 - Data Extraction
------
The Apache Software Foundation
------
2011-2012
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
Data Extraction
+----------------------------------------------------------------------------------------------
/*1*/ Any23 runner = new Any23();
/*2*/ runner.setHTTPUserAgent("test-user-agent");
/*3*/ HTTPClient httpClient = runner.getHTTPClient();
/*4*/ DocumentSource source = new HTTPDocumentSource(
httpClient,
"http://www.rentalinrome.com/semanticloft/semanticloft.htm"
);
/*5*/ ByteArrayOutputStream out = new ByteArrayOutputStream();
/*6*/ TripleHandler handler = new NTriplesWriter(out);
try {
/*7*/ runner.extract(source, handler);
} finally {
/*8*/ handler.close();
}
/*9*/ String n3 = out.toString("UTF-8");
+----------------------------------------------------------------------------------------------
This example demonstrates the data extraction, that is the main purpose of <<Apache Any23>> library.
At <<line 1>> we define the <<Apache Any23>> facade instance. As described before, the constructor allows to enforce
the usage of specific extractors.
The <<line 2>> defines the <HTTP User Agent>, used to identify the client during <HTTP> data collection.
At <<line 3>> we use the runner to create an instance of {{{./apidocs/org/apache/any23/http/HTTPClient.html}HTTPClient}},
used by {{{./apidocs/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} for <HTTP> content fetching.
The <<line 4>> instantiates an {{{./apidocs/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} instance,
specifying the {{{./apidocs/org/apache/any23/http/HTTPClient.html}HTTPClient}} and the URL addressing the content
to be processed.
At <<line 5>> we define a buffered output stream used to store data produced by the
{{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} defined at <<line 6>>.
The extraction method at <<line 7>> will run the metadata extraction.
The produced metadata will be written within the passed
{{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} instance.
The {{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} needs to be explicitly closed,
this is done safely in a <<finally>> block at <<line 8>>.
The expected output is <UTF-8> encoded at <<line 9>> and is:
+----------------------------------------------------------------------------------------------
<http://www.rentalinrome.com/semanticloft/semanticloft.htm> <http://purl.org/dc/terms/title>
"Semantic Loft (beta) - Trastevere apartments | Rental in Rome - rentalinrome.com" .
<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/goodrelations/v1#Offering> .
<http://www.rentalinrome.com>
<http://purl.org/goodrelations/v1#offers>
<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft> .
<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
<http://www.w3.org/2000/01/rdf-schema#seeAlso>
<http://rentalinrome.com/semanticloft/semanticloft.htm> .
<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
<http://purl.org/goodrelations/v1#hasBusinessFunction>
<http://purl.org/goodrelations/v1#ProvideService> .
<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
<http://www.w3.org/2006/vcard/ns#adr>
_:node14r93a8dex1 .
[The complete output is omitted for brevity.]
+----------------------------------------------------------------------------------------------
Filter Out Accidental Triples
To remove accidental triples <<Apache Any23>> provides a set of useful filters, located
within the <<org.apache.any23.filter>> package.
The filter {{{./apidocs/org/apache/any23/filter/IgnoreTitlesOfEmptyDocuments.html}IgnoreTitlesOfEmptyDocuments}}
removes triples generated by the {{{./apidocs/org/apache/any23/extractor/html/TitleExtractor.html}TitleExtractor}}
whether the document is empty.
The filter {{{./apidocs/org/apache/any23/filter/IgnoreAccidentalRDFa.html}IgnoreAccidentalRDFa}} removes accidental
<<CSS>> related triples.
+------------------------------------
RDFWriter rdfWriter = ...
TripleHandler rdfWriterHandler = RDFWriterTripleHandler(rdfWriter);
TripleHandler tripleHandler = new ReportingTripleHandler(
new IgnoreAccidentalRDFa(
new IgnoreTitlesOfEmptyDocuments(rdfWriterHandler),
true // if true the CSS triples will be removed in any case.
)
);
DocumentSource documentSource = ...
any23.extract(documentSource, rdfWriterHandler);
+------------------------------------