src/site/apt/getting-started.apt - any23 - Git at Google

                                     ------
                                     Apache Any23 - Getting started
                                     ------
                               The Apache Software Foundation
                                     ------
                                      2011-2012

 ~~  Licensed to the Apache Software Foundation (ASF) under one or more
 ~~  contributor license agreements.  See the NOTICE file distributed with
 ~~  this work for additional information regarding copyright ownership.
 ~~  The ASF licenses this file to You under the Apache License, Version 2.0
 ~~  (the "License"); you may not use this file except in compliance with
 ~~  the License.  You may obtain a copy of the License at
 ~~
 ~~     http://www.apache.org/licenses/LICENSE-2.0
 ~~
 ~~  Unless required by applicable law or agreed to in writing, software
 ~~  distributed under the License is distributed on an "AS IS" BASIS,
 ~~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 ~~  See the License for the specific language governing permissions and
 ~~  limitations under the License.

 Getting started with <<Apache Any23>>

     <<Apache Any23>> can be used:

       * via CLI (command line interface) from your preferred shell environment;

       * as a RESTful Webservice;

       * as a library.

 * <<Apache Any23>> Modules

     <<Apache Any23>> is composed of the following modules:

       * <<<api/>>>      The base API definitions e.g. The Any23 API.

       * <<<core/>>>      The core library containing all extractor functionality.

       * <<<cli/>>>       A command line interface enabling easy invocation of Any23 tools.

       * <<<csvutils/>>>       Utility code for CSV extractions.

       * <<<encoding/>>>       Characterset detection and encoding.

       * <<<mime/>>>       Media-type detection.

       * <<<service/>>>   The REST service.

       * <<<plugins/>>>   The core additional plugins.

       * <<<openie/>>>   Additional extractor logic for the {{{https://github.com/allenai/openie-standalone}Open Information Extraction (Open IE) system}}.

 * Use the <<Apache Any23>> CLI

    The command-line tools support is provided by the <<cli>> module.

    Once <<Apache Any23>> has been correctly {{{./install.html}installed}}, if you want to use it as a command line tool,
    use the shell script within the <<<cli/target/appassembler/bin/>>> directory.
    These are provided both for Unix (Linux/OSX) and Windows.

    The <<<any23>>> script provides analysis, documentation, testing and debugging utilities.

    Simply running <./any23> without options will show the <usage> options.

 +-------------------------------------------
 $ cli/target/appassembler/bin/any23

 A command must be specified.
 Usage: any23 [options] [command] [command options]
   Options:
     -h, --help
        Display help information.
        Default: false
         --plugins-dir
        The Any23 plugins directory.
        Default: /Users/lmcgibbn/.any23/plugins
     -X, --verbose
        Produce execution verbose output.
        Default: false
     -v, --version
        Display version information.
        Default: false
   Commands:
     extractor      Utility for obtaining documentation about metadata extractors.
       Usage: extractor [options] Extractor name
         Options:
           -a, --all
              shows a report about all available extractors
              Default: false
           -i, --input
              shows example input for the given extractor
              Default: false
           -l, --list
              shows the names of all available extractors
              Default: false
           -o, --outut
              shows example output for the given extractor
              Default: false

     microdata      Commandline Tool for extracting Microdata from file/HTTP source.
       Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/localFile.html}

     mimes      MIME Type Detector Tool.
       Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content}

     verify      Utility for plugin management verification.
       Usage: verify [options] plugins-dir

     rover      Any23 Command Line Tool.
       Usage: rover [options] input IRIs {<url>|<file>}+
         Options:
           -d, --defaultns
              Override the default namespace used to produce statements.
           -e, --extractors
              a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle
              Default: []
           -f, --format
              the output format
              Default: json
           -l, --log
              Produce log within a file.
           -n, --nesting
              Disable production of nesting triples.
              Default: false
           -t, --notrivial
              Filter trivial statements (e.g. CSS related ones).
              Default: false
           -o, --output
              Specify Output file (defaults to standard output)
              Default: java.io.PrintStream@5204062d
           -p, --pedantic
              Validate and fixes HTML content detecting commons issues.
              Default: false
           -s, --stats
              Print out extraction statistics.
              Default: false

     vocab      Prints out the RDF Schema of the vocabularies used by Any23.
       Usage: vocab [options]
         Options:
           -f, --format
              Vocabulary output format
              Default: N-Quads (mimeTypes=application/n-quads, text/x-nquads, text/nquads; ext=nq)
 +-------------------------------------------

    The <<<any23>>> script detects a list of available utilities within the <<core>> and <<plugins>>
    classpath and allows to activate them.

    The <any23-core> CLI tools are:

        * <<<extractor>>>: a utility for obtaining useful information about extractors.

        * <<<microdata>>>:  commandline parser to extract specific Microdata content from a web page
          (local or remote) and produce a JSON output compliant with the Microdata
          specification ({{{http://www.w3.org/TR/microdata/}http://www.w3.org/TR/microdata/}}).

        * <<<mimes>>>: detects the MIME Type for any HTTP / file / direct input resource.

        * <<<verify>>>: a utility for verifying <Apache Any23> plugins.

        * <<<rover>>>: the RDF extraction tool.

        * <<<vocab>>>: allows to dump all the <<RDFSchema>> vocabularies declared within Apache Any23.

 ** The Rover tool

    Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP)
    resources, specify a custom list of extractors, specify the desired output format and other flags
    to suppress noise and generate advanced reports.

   Extract metadata from an <<HTML>> page:

 +-----------------------------------------
 cli$ any23 rover http://yourdomain/yourfile
 +-----------------------------------------

   Extract metadata from a <<local>> resource:

 +--------------------------------------
 cli$ any23 rover myfoaf.rdf
 +--------------------------------------

   Specify the output format, use the option <<"-f">> or <<"--format">>:
   (Default output format is <<TURTLE>>).

 +--------------------------------------
 cli$ any23 rover -f quad myfoaf.rdf
 +--------------------------------------

   Filtering trivial statements

     By default, <<Apache Any23>> will extract <HTML/head> meta information, such as links to <CSS stylesheets> or meta
     information like the author or the software used to create the <html>. Hence, if the user is only interested
     in the structured content from the <HTML/body> tag we offer a filter functionality, activated by the <<"-t">>
     command line argument.

 +-------------------------
 core$ any23 rover -t -f quad myfoaf.rdf
 +-------------------------

 ** The ExtractorDocumentation tool

    The ExtractorDocumentation returns human readable information
    about the registered extractors.

    List all the available extractors:

 +--------------------------------------
 cli$ any23 extractor --list
                       csv [org.apache.any23.extractor.csv.CSVExtractorFactory]
      html-embedded-jsonld [org.apache.any23.extractor.html.EmbeddedJSONLDExtractorFactory]
            html-head-icbm [org.apache.any23.extractor.html.ICBMExtractorFactory]
           html-head-links [org.apache.any23.extractor.html.HeadLinkExtractorFactory]
            html-head-meta [org.apache.any23.extractor.html.HTMLMetaExtractorFactory]
           html-head-title [org.apache.any23.extractor.html.TitleExtractorFactory]
               html-mf-adr [org.apache.any23.extractor.html.AdrExtractorFactory]
               html-mf-geo [org.apache.any23.extractor.html.GeoExtractorFactory]
         html-mf-hcalendar [org.apache.any23.extractor.html.HCalendarExtractorFactory]
             html-mf-hcard [org.apache.any23.extractor.html.HCardExtractorFactory]
          html-mf-hlisting [org.apache.any23.extractor.html.HListingExtractorFactory]
           html-mf-hrecipe [org.apache.any23.extractor.html.HRecipeExtractorFactory]
           html-mf-hresume [org.apache.any23.extractor.html.HResumeExtractorFactory]
           html-mf-hreview [org.apache.any23.extractor.html.HReviewExtractorFactory]
 html-mf-hreview-aggregate [org.apache.any23.extractor.html.HReviewAggregateExtractorFactory]
           html-mf-license [org.apache.any23.extractor.html.LicenseExtractorFactory]
           html-mf-species [org.apache.any23.extractor.html.SpeciesExtractorFactory]
               html-mf-xfn [org.apache.any23.extractor.html.XFNExtractorFactory]
            html-microdata [org.apache.any23.extractor.microdata.MicrodataExtractorFactory]
               html-rdfa11 [org.apache.any23.extractor.rdfa.RDFa11ExtractorFactory]
                html-xpath [org.apache.any23.extractor.xpath.XPathExtractorFactory]
                rdf-jsonld [org.apache.any23.extractor.rdf.JSONLDExtractorFactory]
                    rdf-nq [org.apache.any23.extractor.rdf.NQuadsExtractorFactory]
                    rdf-nt [org.apache.any23.extractor.rdf.NTriplesExtractorFactory]
                  rdf-trix [org.apache.any23.extractor.rdf.TriXExtractorFactory]
                rdf-turtle [org.apache.any23.extractor.rdf.TurtleExtractorFactory]
                   rdf-xml [org.apache.any23.extractor.rdf.RDFXMLExtractorFactory]
                      yaml [org.apache.any23.extractor.yaml.YAMLExtractorFactory]
 +--------------------------------------

 ** The MicrodataParser tool

    The <MicrodataParser> tool allows to apply the only MicrodataExtractor
    on a specific input source and returns the extracted data in the JSON format
    declared in the Microdata specification section {{{http://www.w3.org/TR/microdata/#json}JSON}}.

 +--------------------------------------
 cli$ any23 microdata http://path/to/resource.html
 +--------------------------------------


 ** The VocabPrinter tool

    The VocabPrinter Tool prints out the RDFSchema declared by all the <<Apache Any23>>
    declared vocabularies.

   Just launch the command below to see all the managed vocabularies.

 +--------------------------------------
 cli$ any23 vocab
 +--------------------------------------

    <NOTE>: <<This tool is still in beta version.>>

 ** The MimeDetector tool

    The MimeDetector Tool extracts the <<MIME Type>> for a given source (http:// file:// inline://).

    Examples:

 +--------------------------------------
 cli$ any23 mimes http://www.michelemostarda.com/foaf.rdf
 application/rdf+xml
 +--------------------------------------

 +--------------------------------------
 cli$ any23 mimes file://../src/test/resources/application/trix/test1.trx
 application/trix
 +--------------------------------------

 +--------------------------------------
 cli$ any23 mimes 'inline://<http://s> <http://p> <http://o> .'
 text/n3
 +--------------------------------------

 ** The PluginVerifier tool

   The PluginVerifier tool allows checking installed plugin in the specified input directory

   Just launch the command below to sanity-check the input plugins directory

 +--------------------------------------
 cli$ any23 verify [/path/to/plugins/dir]
 +--------------------------------------

 * <<Apache Any23>> CLI <Plugins>

    The <<Apache Any23>> ToolRunner CLI (<bin/any23>) supports the auto detection of Tool plugins within the classpath.
    For further details see {{{./any23-plugins.html}Plugins}} section.

    The default <<any23>> CLI plugins are enlisted below.

 ** Crawler Plugin

    {crawler-tool}
    The <Crawler Plugin> provides basic site crawling and metadata extraction capabilities.

 +----------------------------------------------------------------------------
 cli$ any23 -h
 [...]
     crawler      Any23 Crawler Command Line Tool.
       Usage: crawler [options] input IRIs {<url>|<file>}+
   Options:
           -d, --defaultns          Override the default namespace used to
                                    produce statements.
           -e, --extractors         a comma-separated list of extractors, e.g.
                                    rdf-xml,rdf-turtle
                                    Default: []
           -f, --format             the output format
                                    Default: turtle
           -l, --log                Produce log within a file.
           -md, --maxdepth          Max allowed crawler depth.
                                    Default: 2147483647
           -mp, --maxpages          Max number of pages before interrupting
                                    crawl.
                                    Default: 2147483647
           -n, --nesting            Disable production of nesting triples.
                                    Default: false
           -t, --notrivial          Filter trivial statements (e.g. CSS related
                                    ones).
                                    Default: false
           -nc, --numcrawlers       Sets the number of crawlers.
                                    Default: 10
           -o, --output             Specify Output file (defaults to standard
                                    output)
                                    Default: java.io.PrintStream@2911a3a4
           -pf, --pagefilter        Regex used to filter out page URLs during
                                    crawling.
                                    Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$
           -p, --pedantic           Validate and fixes HTML content detecting
                                    commons issues.
                                    Default: false
           -pd, --politenessdelay   Politeness delay in milliseconds.
                                    Default: 2147483647
           -s, --stats              Print out extraction statistics.
                                    Default: false
           -sf, --storagefolder     Folder used to store crawler temporary data.
                                    Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce
 +----------------------------------------------------------------------------

     A usage example:

 +----------------------------------------------------------------------------
 cli$ any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log
 +----------------------------------------------------------------------------

 * Use <<Apache Any23>> as a RESTful Web Service

    <<Apache Any23>> provides a Web Service that can be used to extract <RDF> from Web documents.
    <<Apache Any23>> services can be accessed through a {{{./service.html}RESTful API}}.

    Running the server

     The server command line tool is defined within the <<service>> module.
     Run the <<<any23server>>> script

 +--------------------------
 service$ ./bin/any23server
 +--------------------------

     from the command line in order to start up the server, then go to {{{http://localhost:8080/}}}
     to access the web interface. A live demo version of such service is running at {{{http://any23.org/}}}.
     You can also start the server from Java by running the
     {{{./apidocs/org/apache/any23/servlet/Servlet.html}Apache Any23 Servlet}} class. Maven can be used to create a WAR
     file for deployment into an existing servlet container such as {{{http://tomcat.apache.org/}Apache Tomcat}}.

 * Use <<Apache Any23>> as a Library

    See our {{{./developers.html}Developers guide}} for more details.
	------
	Apache Any23 - Getting started
	------
	The Apache Software Foundation
	------
	2011-2012

	~~ Licensed to the Apache Software Foundation (ASF) under one or more
	~~ contributor license agreements. See the NOTICE file distributed with
	~~ this work for additional information regarding copyright ownership.
	~~ The ASF licenses this file to You under the Apache License, Version 2.0
	~~ (the "License"); you may not use this file except in compliance with
	~~ the License. You may obtain a copy of the License at
	~~
	~~ http://www.apache.org/licenses/LICENSE-2.0
	~~
	~~ Unless required by applicable law or agreed to in writing, software
	~~ distributed under the License is distributed on an "AS IS" BASIS,
	~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	~~ See the License for the specific language governing permissions and
	~~ limitations under the License.

	Getting started with <<Apache Any23>>

	<<Apache Any23>> can be used:

	* via CLI (command line interface) from your preferred shell environment;

	* as a RESTful Webservice;

	* as a library.

	* <<Apache Any23>> Modules

	<<Apache Any23>> is composed of the following modules:

	* <<<api/>>> The base API definitions e.g. The Any23 API.

	* <<<core/>>> The core library containing all extractor functionality.

	* <<<cli/>>> A command line interface enabling easy invocation of Any23 tools.

	* <<<csvutils/>>> Utility code for CSV extractions.

	* <<<encoding/>>> Characterset detection and encoding.

	* <<<mime/>>> Media-type detection.

	* <<<service/>>> The REST service.

	* <<<plugins/>>> The core additional plugins.

	* <<<openie/>>> Additional extractor logic for the {{{https://github.com/allenai/openie-standalone}Open Information Extraction (Open IE) system}}.

	* Use the <<Apache Any23>> CLI

	The command-line tools support is provided by the <<cli>> module.

	Once <<Apache Any23>> has been correctly {{{./install.html}installed}}, if you want to use it as a command line tool,
	use the shell script within the <<<cli/target/appassembler/bin/>>> directory.
	These are provided both for Unix (Linux/OSX) and Windows.

	The <<<any23>>> script provides analysis, documentation, testing and debugging utilities.

	Simply running <./any23> without options will show the <usage> options.

	+-------------------------------------------
	$ cli/target/appassembler/bin/any23

	A command must be specified.
	Usage: any23 [options] [command] [command options]
	Options:
	-h, --help
	Display help information.
	Default: false
	--plugins-dir
	The Any23 plugins directory.
	Default: /Users/lmcgibbn/.any23/plugins
	-X, --verbose
	Produce execution verbose output.
	Default: false
	-v, --version
	Display version information.
	Default: false
	Commands:
	extractor Utility for obtaining documentation about metadata extractors.
	Usage: extractor [options] Extractor name
	Options:
	-a, --all
	shows a report about all available extractors
	Default: false
	-i, --input
	shows example input for the given extractor
	Default: false
	-l, --list
	shows the names of all available extractors
	Default: false
	-o, --outut
	shows example output for the given extractor
	Default: false

	microdata Commandline Tool for extracting Microdata from file/HTTP source.
	Usage: microdata [options] Input document URL, {http://path/to/resource.html\|file:/path/to/localFile.html}

	mimes MIME Type Detector Tool.
	Usage: mimes [options] Input document URL, {http://path/to/resource.html\|file:///path/to/local.file\|inline:// some inline content}

	verify Utility for plugin management verification.
	Usage: verify [options] plugins-dir

	rover Any23 Command Line Tool.
	Usage: rover [options] input IRIs {<url>\|<file>}+
	Options:
	-d, --defaultns
	Override the default namespace used to produce statements.
	-e, --extractors
	a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle
	Default: []
	-f, --format
	the output format
	Default: json
	-l, --log
	Produce log within a file.
	-n, --nesting
	Disable production of nesting triples.
	Default: false
	-t, --notrivial
	Filter trivial statements (e.g. CSS related ones).
	Default: false
	-o, --output
	Specify Output file (defaults to standard output)
	Default: java.io.PrintStream@5204062d
	-p, --pedantic
	Validate and fixes HTML content detecting commons issues.
	Default: false
	-s, --stats
	Print out extraction statistics.
	Default: false

	vocab Prints out the RDF Schema of the vocabularies used by Any23.
	Usage: vocab [options]
	Options:
	-f, --format
	Vocabulary output format
	Default: N-Quads (mimeTypes=application/n-quads, text/x-nquads, text/nquads; ext=nq)
	+-------------------------------------------

	The <<<any23>>> script detects a list of available utilities within the <<core>> and <<plugins>>
	classpath and allows to activate them.

	The <any23-core> CLI tools are:

	* <<<extractor>>>: a utility for obtaining useful information about extractors.

	* <<<microdata>>>: commandline parser to extract specific Microdata content from a web page
	(local or remote) and produce a JSON output compliant with the Microdata
	specification ({{{http://www.w3.org/TR/microdata/}http://www.w3.org/TR/microdata/}}).

	* <<<mimes>>>: detects the MIME Type for any HTTP / file / direct input resource.

	* <<<verify>>>: a utility for verifying <Apache Any23> plugins.

	* <<<rover>>>: the RDF extraction tool.

	* <<<vocab>>>: allows to dump all the <<RDFSchema>> vocabularies declared within Apache Any23.

	** The Rover tool

	Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP)
	resources, specify a custom list of extractors, specify the desired output format and other flags
	to suppress noise and generate advanced reports.

	Extract metadata from an <<HTML>> page:

	+-----------------------------------------
	cli$ any23 rover http://yourdomain/yourfile
	+-----------------------------------------

	Extract metadata from a <<local>> resource:

	+--------------------------------------
	cli$ any23 rover myfoaf.rdf
	+--------------------------------------

	Specify the output format, use the option <<"-f">> or <<"--format">>:
	(Default output format is <<TURTLE>>).

	+--------------------------------------
	cli$ any23 rover -f quad myfoaf.rdf
	+--------------------------------------

	Filtering trivial statements

	By default, <<Apache Any23>> will extract <HTML/head> meta information, such as links to <CSS stylesheets> or meta
	information like the author or the software used to create the <html>. Hence, if the user is only interested
	in the structured content from the <HTML/body> tag we offer a filter functionality, activated by the <<"-t">>
	command line argument.

	+-------------------------
	core$ any23 rover -t -f quad myfoaf.rdf
	+-------------------------

	** The ExtractorDocumentation tool

	The ExtractorDocumentation returns human readable information
	about the registered extractors.

	List all the available extractors:

	+--------------------------------------
	cli$ any23 extractor --list
	csv [org.apache.any23.extractor.csv.CSVExtractorFactory]
	html-embedded-jsonld [org.apache.any23.extractor.html.EmbeddedJSONLDExtractorFactory]
	html-head-icbm [org.apache.any23.extractor.html.ICBMExtractorFactory]
	html-head-links [org.apache.any23.extractor.html.HeadLinkExtractorFactory]
	html-head-meta [org.apache.any23.extractor.html.HTMLMetaExtractorFactory]
	html-head-title [org.apache.any23.extractor.html.TitleExtractorFactory]
	html-mf-adr [org.apache.any23.extractor.html.AdrExtractorFactory]
	html-mf-geo [org.apache.any23.extractor.html.GeoExtractorFactory]
	html-mf-hcalendar [org.apache.any23.extractor.html.HCalendarExtractorFactory]
	html-mf-hcard [org.apache.any23.extractor.html.HCardExtractorFactory]
	html-mf-hlisting [org.apache.any23.extractor.html.HListingExtractorFactory]
	html-mf-hrecipe [org.apache.any23.extractor.html.HRecipeExtractorFactory]
	html-mf-hresume [org.apache.any23.extractor.html.HResumeExtractorFactory]
	html-mf-hreview [org.apache.any23.extractor.html.HReviewExtractorFactory]
	html-mf-hreview-aggregate [org.apache.any23.extractor.html.HReviewAggregateExtractorFactory]
	html-mf-license [org.apache.any23.extractor.html.LicenseExtractorFactory]
	html-mf-species [org.apache.any23.extractor.html.SpeciesExtractorFactory]
	html-mf-xfn [org.apache.any23.extractor.html.XFNExtractorFactory]
	html-microdata [org.apache.any23.extractor.microdata.MicrodataExtractorFactory]
	html-rdfa11 [org.apache.any23.extractor.rdfa.RDFa11ExtractorFactory]
	html-xpath [org.apache.any23.extractor.xpath.XPathExtractorFactory]
	rdf-jsonld [org.apache.any23.extractor.rdf.JSONLDExtractorFactory]
	rdf-nq [org.apache.any23.extractor.rdf.NQuadsExtractorFactory]
	rdf-nt [org.apache.any23.extractor.rdf.NTriplesExtractorFactory]
	rdf-trix [org.apache.any23.extractor.rdf.TriXExtractorFactory]
	rdf-turtle [org.apache.any23.extractor.rdf.TurtleExtractorFactory]
	rdf-xml [org.apache.any23.extractor.rdf.RDFXMLExtractorFactory]
	yaml [org.apache.any23.extractor.yaml.YAMLExtractorFactory]
	+--------------------------------------

	** The MicrodataParser tool

	The <MicrodataParser> tool allows to apply the only MicrodataExtractor
	on a specific input source and returns the extracted data in the JSON format
	declared in the Microdata specification section {{{http://www.w3.org/TR/microdata/#json}JSON}}.

	+--------------------------------------
	cli$ any23 microdata http://path/to/resource.html
	+--------------------------------------


	** The VocabPrinter tool

	The VocabPrinter Tool prints out the RDFSchema declared by all the <<Apache Any23>>
	declared vocabularies.

	Just launch the command below to see all the managed vocabularies.

	+--------------------------------------
	cli$ any23 vocab
	+--------------------------------------

	<NOTE>: <<This tool is still in beta version.>>

	** The MimeDetector tool

	The MimeDetector Tool extracts the <<MIME Type>> for a given source (http:// file:// inline://).

	Examples:

	+--------------------------------------
	cli$ any23 mimes http://www.michelemostarda.com/foaf.rdf
	application/rdf+xml
	+--------------------------------------

	+--------------------------------------
	cli$ any23 mimes file://../src/test/resources/application/trix/test1.trx
	application/trix
	+--------------------------------------

	+--------------------------------------
	cli$ any23 mimes 'inline://<http://s> <http://p> <http://o> .'
	text/n3
	+--------------------------------------

	** The PluginVerifier tool

	The PluginVerifier tool allows checking installed plugin in the specified input directory

	Just launch the command below to sanity-check the input plugins directory

	+--------------------------------------
	cli$ any23 verify [/path/to/plugins/dir]
	+--------------------------------------

	* <<Apache Any23>> CLI <Plugins>

	The <<Apache Any23>> ToolRunner CLI (<bin/any23>) supports the auto detection of Tool plugins within the classpath.
	For further details see {{{./any23-plugins.html}Plugins}} section.

	The default <<any23>> CLI plugins are enlisted below.

	** Crawler Plugin

	{crawler-tool}
	The <Crawler Plugin> provides basic site crawling and metadata extraction capabilities.

	+----------------------------------------------------------------------------
	cli$ any23 -h
	[...]
	crawler Any23 Crawler Command Line Tool.
	Usage: crawler [options] input IRIs {<url>\|<file>}+
	Options:
	-d, --defaultns Override the default namespace used to
	produce statements.
	-e, --extractors a comma-separated list of extractors, e.g.
	rdf-xml,rdf-turtle
	Default: []
	-f, --format the output format
	Default: turtle
	-l, --log Produce log within a file.
	-md, --maxdepth Max allowed crawler depth.
	Default: 2147483647
	-mp, --maxpages Max number of pages before interrupting
	crawl.
	Default: 2147483647
	-n, --nesting Disable production of nesting triples.
	Default: false
	-t, --notrivial Filter trivial statements (e.g. CSS related
	ones).
	Default: false
	-nc, --numcrawlers Sets the number of crawlers.
	Default: 10
	-o, --output Specify Output file (defaults to standard
	output)
	Default: java.io.PrintStream@2911a3a4
	-pf, --pagefilter Regex used to filter out page URLs during
	crawling.
	Default: .*(\.(css\|js\|bmp\|gif\|jpe?g\|png\|tiff?\|mid\|mp2\|mp3\|mp4\|wav\|wma\|avi\|mov\|mpeg\|ram\|m4v\|wmv\|rm\|smil\|pdf\|swf\|zip\|rar\|gz\|xml\|txt))$
	-p, --pedantic Validate and fixes HTML content detecting
	commons issues.
	Default: false
	-pd, --politenessdelay Politeness delay in milliseconds.
	Default: 2147483647
	-s, --stats Print out extraction statistics.
	Default: false
	-sf, --storagefolder Folder used to store crawler temporary data.
	Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce
	+----------------------------------------------------------------------------

	A usage example:

	+----------------------------------------------------------------------------
	cli$ any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log
	+----------------------------------------------------------------------------

	* Use <<Apache Any23>> as a RESTful Web Service

	<<Apache Any23>> provides a Web Service that can be used to extract <RDF> from Web documents.
	<<Apache Any23>> services can be accessed through a {{{./service.html}RESTful API}}.

	Running the server

	The server command line tool is defined within the <<service>> module.
	Run the <<<any23server>>> script

	+--------------------------
	service$ ./bin/any23server
	+--------------------------

	from the command line in order to start up the server, then go to {{{http://localhost:8080/}}}
	to access the web interface. A live demo version of such service is running at {{{http://any23.org/}}}.
	You can also start the server from Java by running the
	{{{./apidocs/org/apache/any23/servlet/Servlet.html}Apache Any23 Servlet}} class. Maven can be used to create a WAR
	file for deployment into an existing servlet container such as {{{http://tomcat.apache.org/}Apache Tomcat}}.

	* Use <<Apache Any23>> as a Library

	See our {{{./developers.html}Developers guide}} for more details.