src/site/apt/plugin-html-scraper.apt - any23 - Git at Google

                                     ------
                                     Apache Any23 - Plugins - HTML Scraper
                                     ------
                               The Apache Software Foundation
                                     ------
                                      2011-2012

 ~~  Licensed to the Apache Software Foundation (ASF) under one or more
 ~~  contributor license agreements.  See the NOTICE file distributed with
 ~~  this work for additional information regarding copyright ownership.
 ~~  The ASF licenses this file to You under the Apache License, Version 2.0
 ~~  (the "License"); you may not use this file except in compliance with
 ~~  the License.  You may obtain a copy of the License at
 ~~
 ~~     http://www.apache.org/licenses/LICENSE-2.0
 ~~
 ~~  Unless required by applicable law or agreed to in writing, software
 ~~  distributed under the License is distributed on an "AS IS" BASIS,
 ~~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 ~~  See the License for the specific language governing permissions and
 ~~  limitations under the License.

 HTML Scraper Plugin

   The HTML Scraper Plugin is meant to scrape any HTML page extracting human readable text only.
   Such plugin will generate a set of triples like:

 +-----------------
 <http://source-page-url> <http://vocab.sindice.net/pagecontent/de>  "<DE  Extractor Result>" .
 <http://source-page-url> <http://vocab.sindice.net/pagecontent/ae>  "<AE  Extractor Result>" .
 <http://source-page-url> <http://vocab.sindice.net/pagecontent/lce> "<LCE Extractor Result>" .
 <http://source-page-url> <http://vocab.sindice.net/pagecontent/ce>  "<CE  Extractor Result>" .
 +-----------------

   The plugin engine is based on the {{{http://code.google.com/p/boilerpipe/} Boilerpipe}} library extractor.
   The extractors mentioned as <<DE>>, <<AE>>, <<LCE>> and <<CE>> are the ones defined within the library.
	------
	Apache Any23 - Plugins - HTML Scraper
	------
	The Apache Software Foundation
	------
	2011-2012

	~~ Licensed to the Apache Software Foundation (ASF) under one or more
	~~ contributor license agreements. See the NOTICE file distributed with
	~~ this work for additional information regarding copyright ownership.
	~~ The ASF licenses this file to You under the Apache License, Version 2.0
	~~ (the "License"); you may not use this file except in compliance with
	~~ the License. You may obtain a copy of the License at
	~~
	~~ http://www.apache.org/licenses/LICENSE-2.0
	~~
	~~ Unless required by applicable law or agreed to in writing, software
	~~ distributed under the License is distributed on an "AS IS" BASIS,
	~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	~~ See the License for the specific language governing permissions and
	~~ limitations under the License.

	HTML Scraper Plugin

	The HTML Scraper Plugin is meant to scrape any HTML page extracting human readable text only.
	Such plugin will generate a set of triples like:

	+-----------------
	<http://source-page-url> <http://vocab.sindice.net/pagecontent/de> "<DE Extractor Result>" .
	<http://source-page-url> <http://vocab.sindice.net/pagecontent/ae> "<AE Extractor Result>" .
	<http://source-page-url> <http://vocab.sindice.net/pagecontent/lce> "<LCE Extractor Result>" .
	<http://source-page-url> <http://vocab.sindice.net/pagecontent/ce> "<CE Extractor Result>" .
	+-----------------

	The plugin engine is based on the {{{http://code.google.com/p/boilerpipe/} Boilerpipe}} library extractor.
	The extractors mentioned as <<DE>>, <<AE>>, <<LCE>> and <<CE>> are the ones defined within the library.