tree: b014803a947313aea38ee5da309dd226c9a8057f [path history] [tgz]
  1. src/
  2. pom.xml
  3. README.md
enhancement-engines/htmlextractor/README.md

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Htmlextractor: Metadata extraction from HTML documents

The Htmlextractor Enhancement Engine extracts embedded metadata from HTML documents, such as Microformats and RDFa. By providing other extractors it can be configured for any kind of content extraction from HTML pages.

##Technical description

Supported metadata types

The built-in extractors are defined in the default resource htmlextractors.xml. The following metadata types are supported:

  • RDFa
  • geo
  • hAtom
  • hCal
  • hCard
  • hReview
  • rel-license
  • rel-tag
  • xFolk

Vocabularies

HTML Microformat Extractors

The following table describes which vocabularies are used for representing microformat data in Metaxa:

To prevent the occurrence of unconnected graphs in the metadata extracted subgraphs get connected to the content item by the property:

Configuration options

By default, the Htmlextractor engine uses the extractors specified in the resource “htmlregistry.xml”. Alternative configurations and extractors can be attached to the Htmlextractor as fragment bundles, specifying as host bundle

Fragment-Host: org.apache.stanbol.enhancer.engines.htmlextractor

The alternative configuration files then can be set as values of the property

Usage

Assuming that the Stanbol endpoint with the full launcher is running at

http://localhost:8080

and the engine is activated, from the command line commands like this can be used for submitting some file as content item, where the mime type must match the document type:

Alternatively, the Stanbol web interface can be used for submitting documents and viewing the metadata at

http://localhost:8080/contenthub