src/site/apt/dev-data-extraction.apt - any23 - Git at Google

                                     ------
                                     Apache Any23 - Data Extraction
                                     ------
                               The Apache Software Foundation
                                     ------
                                      2011-2012

 ~~  Licensed to the Apache Software Foundation (ASF) under one or more
 ~~  contributor license agreements.  See the NOTICE file distributed with
 ~~  this work for additional information regarding copyright ownership.
 ~~  The ASF licenses this file to You under the Apache License, Version 2.0
 ~~  (the "License"); you may not use this file except in compliance with
 ~~  the License.  You may obtain a copy of the License at
 ~~
 ~~     http://www.apache.org/licenses/LICENSE-2.0
 ~~
 ~~  Unless required by applicable law or agreed to in writing, software
 ~~  distributed under the License is distributed on an "AS IS" BASIS,
 ~~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 ~~  See the License for the specific language governing permissions and
 ~~  limitations under the License.

 Data Extraction

 +----------------------------------------------------------------------------------------------
 /*1*/ Any23 runner = new Any23();
 /*2*/ runner.setHTTPUserAgent("test-user-agent");
 /*3*/ HTTPClient httpClient = runner.getHTTPClient();
 /*4*/ DocumentSource source = new HTTPDocumentSource(
          httpClient,
          "http://www.rentalinrome.com/semanticloft/semanticloft.htm"
       );
 /*5*/ ByteArrayOutputStream out = new ByteArrayOutputStream();
 /*6*/ TripleHandler handler = new NTriplesWriter(out);
       try {
 /*7*/     runner.extract(source, handler);
       } finally {
 /*8*/     handler.close();
       }
 /*9*/ String n3 = out.toString("UTF-8");
 +----------------------------------------------------------------------------------------------

    This example demonstrates the data extraction, that is the main purpose of <<Apache Any23>> library.
    At <<line 1>> we define the <<Apache Any23>> facade instance. As described before, the constructor allows to enforce
    the usage of specific extractors.

    The <<line 2>> defines the <HTTP User Agent>, used to identify the client during <HTTP> data collection.
    At <<line 3>> we use the runner to create an instance of {{{./apidocs/org/apache/any23/http/HTTPClient.html}HTTPClient}},
    used by {{{./apidocs/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} for <HTTP> content fetching.

    The <<line 4>> instantiates an {{{./apidocs/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} instance,
    specifying the {{{./apidocs/org/apache/any23/http/HTTPClient.html}HTTPClient}} and the URL addressing the content
    to be processed.

    At <<line 5>> we define a buffered output stream used to store data produced by the
    {{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} defined at <<line 6>>.

    The extraction method at <<line 7>> will run the metadata extraction.
    The produced metadata will be written within the passed
    {{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} instance.

    The {{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} needs to be explicitly closed,
    this is done safely in a <<finally>> block at <<line 8>>.

    The expected output is <UTF-8> encoded at <<line 9>> and is:

 +----------------------------------------------------------------------------------------------
 <http://www.rentalinrome.com/semanticloft/semanticloft.htm> <http://purl.org/dc/terms/title>
 "Semantic Loft (beta) - Trastevere apartments | Rental in Rome - rentalinrome.com" .

 <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
 <http://purl.org/goodrelations/v1#Offering> .

 <http://www.rentalinrome.com>
 <http://purl.org/goodrelations/v1#offers>
 <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft> .

 <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
 <http://www.w3.org/2000/01/rdf-schema#seeAlso>
 <http://rentalinrome.com/semanticloft/semanticloft.htm> .

 <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
 <http://purl.org/goodrelations/v1#hasBusinessFunction>
 <http://purl.org/goodrelations/v1#ProvideService> .

 <http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
 <http://www.w3.org/2006/vcard/ns#adr>
 _:node14r93a8dex1 .

 [The complete output is omitted for brevity.]
 +----------------------------------------------------------------------------------------------

 Filter Out Accidental Triples

    To remove accidental triples <<Apache Any23>> provides a set of useful filters, located
    within the <<org.apache.any23.filter>> package.

    The filter {{{./apidocs/org/apache/any23/filter/IgnoreTitlesOfEmptyDocuments.html}IgnoreTitlesOfEmptyDocuments}}
    removes triples generated by the {{{./apidocs/org/apache/any23/extractor/html/TitleExtractor.html}TitleExtractor}}
    whether the document is empty.

    The filter {{{./apidocs/org/apache/any23/filter/IgnoreAccidentalRDFa.html}IgnoreAccidentalRDFa}} removes accidental
    <<CSS>> related triples.

 +------------------------------------
 RDFWriter rdfWriter = ...
 TripleHandler rdfWriterHandler = RDFWriterTripleHandler(rdfWriter);
 TripleHandler tripleHandler = new ReportingTripleHandler(
         new IgnoreAccidentalRDFa(
                 new IgnoreTitlesOfEmptyDocuments(rdfWriterHandler),
                 true // if true the CSS triples will be removed in any case.
         )
 );
 DocumentSource documentSource = ...
 any23.extract(documentSource, rdfWriterHandler);
 +------------------------------------
	------
	Apache Any23 - Data Extraction
	------
	The Apache Software Foundation
	------
	2011-2012

	~~ Licensed to the Apache Software Foundation (ASF) under one or more
	~~ contributor license agreements. See the NOTICE file distributed with
	~~ this work for additional information regarding copyright ownership.
	~~ The ASF licenses this file to You under the Apache License, Version 2.0
	~~ (the "License"); you may not use this file except in compliance with
	~~ the License. You may obtain a copy of the License at
	~~
	~~ http://www.apache.org/licenses/LICENSE-2.0
	~~
	~~ Unless required by applicable law or agreed to in writing, software
	~~ distributed under the License is distributed on an "AS IS" BASIS,
	~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	~~ See the License for the specific language governing permissions and
	~~ limitations under the License.

	Data Extraction

	+----------------------------------------------------------------------------------------------
	/1/ Any23 runner = new Any23();
	/2/ runner.setHTTPUserAgent("test-user-agent");
	/3/ HTTPClient httpClient = runner.getHTTPClient();
	/4/ DocumentSource source = new HTTPDocumentSource(
	httpClient,
	"http://www.rentalinrome.com/semanticloft/semanticloft.htm"
	);
	/5/ ByteArrayOutputStream out = new ByteArrayOutputStream();
	/6/ TripleHandler handler = new NTriplesWriter(out);
	try {
	/7/ runner.extract(source, handler);
	} finally {
	/8/ handler.close();
	}
	/9/ String n3 = out.toString("UTF-8");
	+----------------------------------------------------------------------------------------------

	This example demonstrates the data extraction, that is the main purpose of <<Apache Any23>> library.
	At <<line 1>> we define the <<Apache Any23>> facade instance. As described before, the constructor allows to enforce
	the usage of specific extractors.

	The <<line 2>> defines the <HTTP User Agent>, used to identify the client during <HTTP> data collection.
	At <<line 3>> we use the runner to create an instance of {{{./apidocs/org/apache/any23/http/HTTPClient.html}HTTPClient}},
	used by {{{./apidocs/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} for <HTTP> content fetching.

	The <<line 4>> instantiates an {{{./apidocs/org/apache/any23/source/HTTPDocumentSource.html}HTTPDocumentSource}} instance,
	specifying the {{{./apidocs/org/apache/any23/http/HTTPClient.html}HTTPClient}} and the URL addressing the content
	to be processed.

	At <<line 5>> we define a buffered output stream used to store data produced by the
	{{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} defined at <<line 6>>.

	The extraction method at <<line 7>> will run the metadata extraction.
	The produced metadata will be written within the passed
	{{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} instance.

	The {{{./apidocs/org/apache/any23/writer/TripleHandler.html}TripleHandler}} needs to be explicitly closed,
	this is done safely in a <<finally>> block at <<line 8>>.

	The expected output is <UTF-8> encoded at <<line 9>> and is:

	+----------------------------------------------------------------------------------------------
	<http://www.rentalinrome.com/semanticloft/semanticloft.htm> <http://purl.org/dc/terms/title>
	"Semantic Loft (beta) - Trastevere apartments \| Rental in Rome - rentalinrome.com" .

	<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
	<http://purl.org/goodrelations/v1#Offering> .

	<http://www.rentalinrome.com>
	<http://purl.org/goodrelations/v1#offers>
	<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft> .

	<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
	<http://www.w3.org/2000/01/rdf-schema#seeAlso>
	<http://rentalinrome.com/semanticloft/semanticloft.htm> .

	<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
	<http://purl.org/goodrelations/v1#hasBusinessFunction>
	<http://purl.org/goodrelations/v1#ProvideService> .

	<http://www.rentalinrome.com/semanticloft/semanticloft.htm#semanticloft>
	<http://www.w3.org/2006/vcard/ns#adr>
	_:node14r93a8dex1 .

	[The complete output is omitted for brevity.]
	+----------------------------------------------------------------------------------------------

	Filter Out Accidental Triples

	To remove accidental triples <<Apache Any23>> provides a set of useful filters, located
	within the <<org.apache.any23.filter>> package.

	The filter {{{./apidocs/org/apache/any23/filter/IgnoreTitlesOfEmptyDocuments.html}IgnoreTitlesOfEmptyDocuments}}
	removes triples generated by the {{{./apidocs/org/apache/any23/extractor/html/TitleExtractor.html}TitleExtractor}}
	whether the document is empty.

	The filter {{{./apidocs/org/apache/any23/filter/IgnoreAccidentalRDFa.html}IgnoreAccidentalRDFa}} removes accidental
	<<CSS>> related triples.

	+------------------------------------
	RDFWriter rdfWriter = ...
	TripleHandler rdfWriterHandler = RDFWriterTripleHandler(rdfWriter);
	TripleHandler tripleHandler = new ReportingTripleHandler(
	new IgnoreAccidentalRDFa(
	new IgnoreTitlesOfEmptyDocuments(rdfWriterHandler),
	true // if true the CSS triples will be removed in any case.
	)
	);
	DocumentSource documentSource = ...
	any23.extract(documentSource, rdfWriterHandler);
	+------------------------------------