blob: af3ab5565e8f2585be93056cda5a6a6ecd68ff5f [file] [log] [blame]
------
Apache Any23 - Plugins - HTML Scraper
------
The Apache Software Foundation
------
2011-2012
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
HTML Scraper Plugin
The HTML Scraper Plugin is meant to scrape any HTML page extracting human readable text only.
Such plugin will generate a set of triples like:
+-----------------
<http://source-page-url> <http://vocab.sindice.net/pagecontent/de> "<DE Extractor Result>" .
<http://source-page-url> <http://vocab.sindice.net/pagecontent/ae> "<AE Extractor Result>" .
<http://source-page-url> <http://vocab.sindice.net/pagecontent/lce> "<LCE Extractor Result>" .
<http://source-page-url> <http://vocab.sindice.net/pagecontent/ce> "<CE Extractor Result>" .
+-----------------
The plugin engine is based on the {{{http://code.google.com/p/boilerpipe/} Boilerpipe}} library extractor.
The extractors mentioned as <<DE>>, <<AE>>, <<LCE>> and <<CE>> are the ones defined within the library.