blob: c3397886a86e56f287ca343ca5a45db25cbd5309 [file] [log] [blame]
------
Apache Any23 - Getting started
------
The Apache Software Foundation
------
2011-2012
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
Getting started with <<Apache Any23>>
<<Apache Any23>> can be used:
* via CLI (command line interface) from your preferred shell environment;
* as a RESTful Webservice;
* as a library.
* <<Apache Any23>> Modules
<<Apache Any23>> is composed of the following modules:
* <<<api/>>> The base API definitions e.g. The Any23 API.
* <<<core/>>> The core library containing all extractor functionality.
* <<<cli/>>> A command line interface enabling easy invocation of Any23 tools.
* <<<csvutils/>>> Utility code for CSV extractions.
* <<<encoding/>>> Characterset detection and encoding.
* <<<mime/>>> Media-type detection.
* <<<service/>>> The REST service.
* <<<plugins/>>> The core additional plugins.
* <<<openie/>>> Additional extractor logic for the {{{https://github.com/allenai/openie-standalone}Open Information Extraction (Open IE) system}}.
* Use the <<Apache Any23>> CLI
The command-line tools support is provided by the <<cli>> module.
Once <<Apache Any23>> has been correctly {{{./install.html}installed}}, if you want to use it as a command line tool,
use the shell script within the <<<cli/target/appassembler/bin/>>> directory.
These are provided both for Unix (Linux/OSX) and Windows.
The <<<any23>>> script provides analysis, documentation, testing and debugging utilities.
Simply running <./any23> without options will show the <usage> options.
+-------------------------------------------
$ cli/target/appassembler/bin/any23
A command must be specified.
Usage: any23 [options] [command] [command options]
Options:
-h, --help
Display help information.
Default: false
--plugins-dir
The Any23 plugins directory.
Default: /Users/lmcgibbn/.any23/plugins
-X, --verbose
Produce execution verbose output.
Default: false
-v, --version
Display version information.
Default: false
Commands:
extractor Utility for obtaining documentation about metadata extractors.
Usage: extractor [options] Extractor name
Options:
-a, --all
shows a report about all available extractors
Default: false
-i, --input
shows example input for the given extractor
Default: false
-l, --list
shows the names of all available extractors
Default: false
-o, --outut
shows example output for the given extractor
Default: false
microdata Commandline Tool for extracting Microdata from file/HTTP source.
Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/localFile.html}
mimes MIME Type Detector Tool.
Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content}
verify Utility for plugin management verification.
Usage: verify [options] plugins-dir
rover Any23 Command Line Tool.
Usage: rover [options] input IRIs {<url>|<file>}+
Options:
-d, --defaultns
Override the default namespace used to produce statements.
-e, --extractors
a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle
Default: []
-f, --format
the output format
Default: json
-l, --log
Produce log within a file.
-n, --nesting
Disable production of nesting triples.
Default: false
-t, --notrivial
Filter trivial statements (e.g. CSS related ones).
Default: false
-o, --output
Specify Output file (defaults to standard output)
Default: java.io.PrintStream@5204062d
-p, --pedantic
Validate and fixes HTML content detecting commons issues.
Default: false
-s, --stats
Print out extraction statistics.
Default: false
vocab Prints out the RDF Schema of the vocabularies used by Any23.
Usage: vocab [options]
Options:
-f, --format
Vocabulary output format
Default: N-Quads (mimeTypes=application/n-quads, text/x-nquads, text/nquads; ext=nq)
+-------------------------------------------
The <<<any23>>> script detects a list of available utilities within the <<core>> and <<plugins>>
classpath and allows to activate them.
The <any23-core> CLI tools are:
* <<<extractor>>>: a utility for obtaining useful information about extractors.
* <<<microdata>>>: commandline parser to extract specific Microdata content from a web page
(local or remote) and produce a JSON output compliant with the Microdata
specification ({{{http://www.w3.org/TR/microdata/}http://www.w3.org/TR/microdata/}}).
* <<<mimes>>>: detects the MIME Type for any HTTP / file / direct input resource.
* <<<verify>>>: a utility for verifying <Apache Any23> plugins.
* <<<rover>>>: the RDF extraction tool.
* <<<vocab>>>: allows to dump all the <<RDFSchema>> vocabularies declared within Apache Any23.
** The Rover tool
Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP)
resources, specify a custom list of extractors, specify the desired output format and other flags
to suppress noise and generate advanced reports.
Extract metadata from an <<HTML>> page:
+-----------------------------------------
cli$ any23 rover http://yourdomain/yourfile
+-----------------------------------------
Extract metadata from a <<local>> resource:
+--------------------------------------
cli$ any23 rover myfoaf.rdf
+--------------------------------------
Specify the output format, use the option <<"-f">> or <<"--format">>:
(Default output format is <<TURTLE>>).
+--------------------------------------
cli$ any23 rover -f quad myfoaf.rdf
+--------------------------------------
Filtering trivial statements
By default, <<Apache Any23>> will extract <HTML/head> meta information, such as links to <CSS stylesheets> or meta
information like the author or the software used to create the <html>. Hence, if the user is only interested
in the structured content from the <HTML/body> tag we offer a filter functionality, activated by the <<"-t">>
command line argument.
+-------------------------
core$ any23 rover -t -f quad myfoaf.rdf
+-------------------------
** The ExtractorDocumentation tool
The ExtractorDocumentation returns human readable information
about the registered extractors.
List all the available extractors:
+--------------------------------------
cli$ any23 extractor --list
csv [org.apache.any23.extractor.csv.CSVExtractorFactory]
html-embedded-jsonld [org.apache.any23.extractor.html.EmbeddedJSONLDExtractorFactory]
html-head-icbm [org.apache.any23.extractor.html.ICBMExtractorFactory]
html-head-links [org.apache.any23.extractor.html.HeadLinkExtractorFactory]
html-head-meta [org.apache.any23.extractor.html.HTMLMetaExtractorFactory]
html-head-title [org.apache.any23.extractor.html.TitleExtractorFactory]
html-mf-adr [org.apache.any23.extractor.html.AdrExtractorFactory]
html-mf-geo [org.apache.any23.extractor.html.GeoExtractorFactory]
html-mf-hcalendar [org.apache.any23.extractor.html.HCalendarExtractorFactory]
html-mf-hcard [org.apache.any23.extractor.html.HCardExtractorFactory]
html-mf-hlisting [org.apache.any23.extractor.html.HListingExtractorFactory]
html-mf-hrecipe [org.apache.any23.extractor.html.HRecipeExtractorFactory]
html-mf-hresume [org.apache.any23.extractor.html.HResumeExtractorFactory]
html-mf-hreview [org.apache.any23.extractor.html.HReviewExtractorFactory]
html-mf-hreview-aggregate [org.apache.any23.extractor.html.HReviewAggregateExtractorFactory]
html-mf-license [org.apache.any23.extractor.html.LicenseExtractorFactory]
html-mf-species [org.apache.any23.extractor.html.SpeciesExtractorFactory]
html-mf-xfn [org.apache.any23.extractor.html.XFNExtractorFactory]
html-microdata [org.apache.any23.extractor.microdata.MicrodataExtractorFactory]
html-rdfa11 [org.apache.any23.extractor.rdfa.RDFa11ExtractorFactory]
html-xpath [org.apache.any23.extractor.xpath.XPathExtractorFactory]
rdf-jsonld [org.apache.any23.extractor.rdf.JSONLDExtractorFactory]
rdf-nq [org.apache.any23.extractor.rdf.NQuadsExtractorFactory]
rdf-nt [org.apache.any23.extractor.rdf.NTriplesExtractorFactory]
rdf-trix [org.apache.any23.extractor.rdf.TriXExtractorFactory]
rdf-turtle [org.apache.any23.extractor.rdf.TurtleExtractorFactory]
rdf-xml [org.apache.any23.extractor.rdf.RDFXMLExtractorFactory]
yaml [org.apache.any23.extractor.yaml.YAMLExtractorFactory]
+--------------------------------------
** The MicrodataParser tool
The <MicrodataParser> tool allows to apply the only MicrodataExtractor
on a specific input source and returns the extracted data in the JSON format
declared in the Microdata specification section {{{http://www.w3.org/TR/microdata/#json}JSON}}.
+--------------------------------------
cli$ any23 microdata http://path/to/resource.html
+--------------------------------------
** The VocabPrinter tool
The VocabPrinter Tool prints out the RDFSchema declared by all the <<Apache Any23>>
declared vocabularies.
Just launch the command below to see all the managed vocabularies.
+--------------------------------------
cli$ any23 vocab
+--------------------------------------
<NOTE>: <<This tool is still in beta version.>>
** The MimeDetector tool
The MimeDetector Tool extracts the <<MIME Type>> for a given source (http:// file:// inline://).
Examples:
+--------------------------------------
cli$ any23 mimes http://www.michelemostarda.com/foaf.rdf
application/rdf+xml
+--------------------------------------
+--------------------------------------
cli$ any23 mimes file://../src/test/resources/application/trix/test1.trx
application/trix
+--------------------------------------
+--------------------------------------
cli$ any23 mimes 'inline://<http://s> <http://p> <http://o> .'
text/n3
+--------------------------------------
** The PluginVerifier tool
The PluginVerifier tool allows checking installed plugin in the specified input directory
Just launch the command below to sanity-check the input plugins directory
+--------------------------------------
cli$ any23 verify [/path/to/plugins/dir]
+--------------------------------------
* <<Apache Any23>> CLI <Plugins>
The <<Apache Any23>> ToolRunner CLI (<bin/any23>) supports the auto detection of Tool plugins within the classpath.
For further details see {{{./any23-plugins.html}Plugins}} section.
The default <<any23>> CLI plugins are enlisted below.
** Crawler Plugin
{crawler-tool}
The <Crawler Plugin> provides basic site crawling and metadata extraction capabilities.
+----------------------------------------------------------------------------
cli$ any23 -h
[...]
crawler Any23 Crawler Command Line Tool.
Usage: crawler [options] input IRIs {<url>|<file>}+
Options:
-d, --defaultns Override the default namespace used to
produce statements.
-e, --extractors a comma-separated list of extractors, e.g.
rdf-xml,rdf-turtle
Default: []
-f, --format the output format
Default: turtle
-l, --log Produce log within a file.
-md, --maxdepth Max allowed crawler depth.
Default: 2147483647
-mp, --maxpages Max number of pages before interrupting
crawl.
Default: 2147483647
-n, --nesting Disable production of nesting triples.
Default: false
-t, --notrivial Filter trivial statements (e.g. CSS related
ones).
Default: false
-nc, --numcrawlers Sets the number of crawlers.
Default: 10
-o, --output Specify Output file (defaults to standard
output)
Default: java.io.PrintStream@2911a3a4
-pf, --pagefilter Regex used to filter out page URLs during
crawling.
Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$
-p, --pedantic Validate and fixes HTML content detecting
commons issues.
Default: false
-pd, --politenessdelay Politeness delay in milliseconds.
Default: 2147483647
-s, --stats Print out extraction statistics.
Default: false
-sf, --storagefolder Folder used to store crawler temporary data.
Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce
+----------------------------------------------------------------------------
A usage example:
+----------------------------------------------------------------------------
cli$ any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log
+----------------------------------------------------------------------------
* Use <<Apache Any23>> as a RESTful Web Service
<<Apache Any23>> provides a Web Service that can be used to extract <RDF> from Web documents.
<<Apache Any23>> services can be accessed through a {{{./service.html}RESTful API}}.
Running the server
The server command line tool is defined within the <<service>> module.
Run the <<<any23server>>> script
+--------------------------
service$ ./bin/any23server
+--------------------------
from the command line in order to start up the server, then go to {{{http://localhost:8080/}}}
to access the web interface. A live demo version of such service is running at {{{http://any23.org/}}}.
You can also start the server from Java by running the
{{{./apidocs/org/apache/any23/servlet/Servlet.html}Apache Any23 Servlet}} class. Maven can be used to create a WAR
file for deployment into an existing servlet container such as {{{http://tomcat.apache.org/}Apache Tomcat}}.
* Use <<Apache Any23>> as a Library
See our {{{./developers.html}Developers guide}} for more details.