| ------ |
| Apache Any23 - Getting started |
| ------ |
| The Apache Software Foundation |
| ------ |
| 2011-2012 |
| |
| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| |
| Getting started with <<Apache Any23>> |
| |
| <<Apache Any23>> can be used: |
| |
| * via CLI (command line interface) from your preferred shell environment; |
| |
| * as a RESTful Webservice; |
| |
| * as a library. |
| |
| * <<Apache Any23>> Modules |
| |
| <<Apache Any23>> is composed of the following modules: |
| |
| * <<<api/>>> The base API definitions e.g. The Any23 API. |
| |
| * <<<core/>>> The core library containing all extractor functionality. |
| |
| * <<<cli/>>> A command line interface enabling easy invocation of Any23 tools. |
| |
| * <<<csvutils/>>> Utility code for CSV extractions. |
| |
| * <<<encoding/>>> Characterset detection and encoding. |
| |
| * <<<mime/>>> Media-type detection. |
| |
| * <<<service/>>> The REST service. |
| |
| * <<<plugins/>>> The core additional plugins. |
| |
| * <<<openie/>>> Additional extractor logic for the {{{https://github.com/allenai/openie-standalone}Open Information Extraction (Open IE) system}}. |
| |
| * Use the <<Apache Any23>> CLI |
| |
| The command-line tools support is provided by the <<cli>> module. |
| |
| Once <<Apache Any23>> has been correctly {{{./install.html}installed}}, if you want to use it as a command line tool, |
| use the shell script within the <<<cli/target/appassembler/bin/>>> directory. |
| These are provided both for Unix (Linux/OSX) and Windows. |
| |
| The <<<any23>>> script provides analysis, documentation, testing and debugging utilities. |
| |
| Simply running <./any23> without options will show the <usage> options. |
| |
| +------------------------------------------- |
| $ cli/target/appassembler/bin/any23 |
| |
| A command must be specified. |
| Usage: any23 [options] [command] [command options] |
| Options: |
| -h, --help |
| Display help information. |
| Default: false |
| --plugins-dir |
| The Any23 plugins directory. |
| Default: /Users/lmcgibbn/.any23/plugins |
| -X, --verbose |
| Produce execution verbose output. |
| Default: false |
| -v, --version |
| Display version information. |
| Default: false |
| Commands: |
| extractor Utility for obtaining documentation about metadata extractors. |
| Usage: extractor [options] Extractor name |
| Options: |
| -a, --all |
| shows a report about all available extractors |
| Default: false |
| -i, --input |
| shows example input for the given extractor |
| Default: false |
| -l, --list |
| shows the names of all available extractors |
| Default: false |
| -o, --outut |
| shows example output for the given extractor |
| Default: false |
| |
| microdata Commandline Tool for extracting Microdata from file/HTTP source. |
| Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/localFile.html} |
| |
| mimes MIME Type Detector Tool. |
| Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content} |
| |
| verify Utility for plugin management verification. |
| Usage: verify [options] plugins-dir |
| |
| rover Any23 Command Line Tool. |
| Usage: rover [options] input IRIs {<url>|<file>}+ |
| Options: |
| -d, --defaultns |
| Override the default namespace used to produce statements. |
| -e, --extractors |
| a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle |
| Default: [] |
| -f, --format |
| the output format |
| Default: json |
| -l, --log |
| Produce log within a file. |
| -n, --nesting |
| Disable production of nesting triples. |
| Default: false |
| -t, --notrivial |
| Filter trivial statements (e.g. CSS related ones). |
| Default: false |
| -o, --output |
| Specify Output file (defaults to standard output) |
| Default: java.io.PrintStream@5204062d |
| -p, --pedantic |
| Validate and fixes HTML content detecting commons issues. |
| Default: false |
| -s, --stats |
| Print out extraction statistics. |
| Default: false |
| |
| vocab Prints out the RDF Schema of the vocabularies used by Any23. |
| Usage: vocab [options] |
| Options: |
| -f, --format |
| Vocabulary output format |
| Default: N-Quads (mimeTypes=application/n-quads, text/x-nquads, text/nquads; ext=nq) |
| +------------------------------------------- |
| |
| The <<<any23>>> script detects a list of available utilities within the <<core>> and <<plugins>> |
| classpath and allows to activate them. |
| |
| The <any23-core> CLI tools are: |
| |
| * <<<extractor>>>: a utility for obtaining useful information about extractors. |
| |
| * <<<microdata>>>: commandline parser to extract specific Microdata content from a web page |
| (local or remote) and produce a JSON output compliant with the Microdata |
| specification ({{{http://www.w3.org/TR/microdata/}http://www.w3.org/TR/microdata/}}). |
| |
| * <<<mimes>>>: detects the MIME Type for any HTTP / file / direct input resource. |
| |
| * <<<verify>>>: a utility for verifying <Apache Any23> plugins. |
| |
| * <<<rover>>>: the RDF extraction tool. |
| |
| * <<<vocab>>>: allows to dump all the <<RDFSchema>> vocabularies declared within Apache Any23. |
| |
| ** The Rover tool |
| |
| Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP) |
| resources, specify a custom list of extractors, specify the desired output format and other flags |
| to suppress noise and generate advanced reports. |
| |
| Extract metadata from an <<HTML>> page: |
| |
| +----------------------------------------- |
| cli$ any23 rover http://yourdomain/yourfile |
| +----------------------------------------- |
| |
| Extract metadata from a <<local>> resource: |
| |
| +-------------------------------------- |
| cli$ any23 rover myfoaf.rdf |
| +-------------------------------------- |
| |
| Specify the output format, use the option <<"-f">> or <<"--format">>: |
| (Default output format is <<TURTLE>>). |
| |
| +-------------------------------------- |
| cli$ any23 rover -f quad myfoaf.rdf |
| +-------------------------------------- |
| |
| Filtering trivial statements |
| |
| By default, <<Apache Any23>> will extract <HTML/head> meta information, such as links to <CSS stylesheets> or meta |
| information like the author or the software used to create the <html>. Hence, if the user is only interested |
| in the structured content from the <HTML/body> tag we offer a filter functionality, activated by the <<"-t">> |
| command line argument. |
| |
| +------------------------- |
| core$ any23 rover -t -f quad myfoaf.rdf |
| +------------------------- |
| |
| ** The ExtractorDocumentation tool |
| |
| The ExtractorDocumentation returns human readable information |
| about the registered extractors. |
| |
| List all the available extractors: |
| |
| +-------------------------------------- |
| cli$ any23 extractor --list |
| csv [org.apache.any23.extractor.csv.CSVExtractorFactory] |
| html-embedded-jsonld [org.apache.any23.extractor.html.EmbeddedJSONLDExtractorFactory] |
| html-head-icbm [org.apache.any23.extractor.html.ICBMExtractorFactory] |
| html-head-links [org.apache.any23.extractor.html.HeadLinkExtractorFactory] |
| html-head-meta [org.apache.any23.extractor.html.HTMLMetaExtractorFactory] |
| html-head-title [org.apache.any23.extractor.html.TitleExtractorFactory] |
| html-mf-adr [org.apache.any23.extractor.html.AdrExtractorFactory] |
| html-mf-geo [org.apache.any23.extractor.html.GeoExtractorFactory] |
| html-mf-hcalendar [org.apache.any23.extractor.html.HCalendarExtractorFactory] |
| html-mf-hcard [org.apache.any23.extractor.html.HCardExtractorFactory] |
| html-mf-hlisting [org.apache.any23.extractor.html.HListingExtractorFactory] |
| html-mf-hrecipe [org.apache.any23.extractor.html.HRecipeExtractorFactory] |
| html-mf-hresume [org.apache.any23.extractor.html.HResumeExtractorFactory] |
| html-mf-hreview [org.apache.any23.extractor.html.HReviewExtractorFactory] |
| html-mf-hreview-aggregate [org.apache.any23.extractor.html.HReviewAggregateExtractorFactory] |
| html-mf-license [org.apache.any23.extractor.html.LicenseExtractorFactory] |
| html-mf-species [org.apache.any23.extractor.html.SpeciesExtractorFactory] |
| html-mf-xfn [org.apache.any23.extractor.html.XFNExtractorFactory] |
| html-microdata [org.apache.any23.extractor.microdata.MicrodataExtractorFactory] |
| html-rdfa11 [org.apache.any23.extractor.rdfa.RDFa11ExtractorFactory] |
| html-xpath [org.apache.any23.extractor.xpath.XPathExtractorFactory] |
| rdf-jsonld [org.apache.any23.extractor.rdf.JSONLDExtractorFactory] |
| rdf-nq [org.apache.any23.extractor.rdf.NQuadsExtractorFactory] |
| rdf-nt [org.apache.any23.extractor.rdf.NTriplesExtractorFactory] |
| rdf-trix [org.apache.any23.extractor.rdf.TriXExtractorFactory] |
| rdf-turtle [org.apache.any23.extractor.rdf.TurtleExtractorFactory] |
| rdf-xml [org.apache.any23.extractor.rdf.RDFXMLExtractorFactory] |
| yaml [org.apache.any23.extractor.yaml.YAMLExtractorFactory] |
| +-------------------------------------- |
| |
| ** The MicrodataParser tool |
| |
| The <MicrodataParser> tool allows to apply the only MicrodataExtractor |
| on a specific input source and returns the extracted data in the JSON format |
| declared in the Microdata specification section {{{http://www.w3.org/TR/microdata/#json}JSON}}. |
| |
| +-------------------------------------- |
| cli$ any23 microdata http://path/to/resource.html |
| +-------------------------------------- |
| |
| |
| ** The VocabPrinter tool |
| |
| The VocabPrinter Tool prints out the RDFSchema declared by all the <<Apache Any23>> |
| declared vocabularies. |
| |
| Just launch the command below to see all the managed vocabularies. |
| |
| +-------------------------------------- |
| cli$ any23 vocab |
| +-------------------------------------- |
| |
| <NOTE>: <<This tool is still in beta version.>> |
| |
| ** The MimeDetector tool |
| |
| The MimeDetector Tool extracts the <<MIME Type>> for a given source (http:// file:// inline://). |
| |
| Examples: |
| |
| +-------------------------------------- |
| cli$ any23 mimes http://www.michelemostarda.com/foaf.rdf |
| application/rdf+xml |
| +-------------------------------------- |
| |
| +-------------------------------------- |
| cli$ any23 mimes file://../src/test/resources/application/trix/test1.trx |
| application/trix |
| +-------------------------------------- |
| |
| +-------------------------------------- |
| cli$ any23 mimes 'inline://<http://s> <http://p> <http://o> .' |
| text/n3 |
| +-------------------------------------- |
| |
| ** The PluginVerifier tool |
| |
| The PluginVerifier tool allows checking installed plugin in the specified input directory |
| |
| Just launch the command below to sanity-check the input plugins directory |
| |
| +-------------------------------------- |
| cli$ any23 verify [/path/to/plugins/dir] |
| +-------------------------------------- |
| |
| * <<Apache Any23>> CLI <Plugins> |
| |
| The <<Apache Any23>> ToolRunner CLI (<bin/any23>) supports the auto detection of Tool plugins within the classpath. |
| For further details see {{{./any23-plugins.html}Plugins}} section. |
| |
| The default <<any23>> CLI plugins are enlisted below. |
| |
| ** Crawler Plugin |
| |
| {crawler-tool} |
| The <Crawler Plugin> provides basic site crawling and metadata extraction capabilities. |
| |
| +---------------------------------------------------------------------------- |
| cli$ any23 -h |
| [...] |
| crawler Any23 Crawler Command Line Tool. |
| Usage: crawler [options] input IRIs {<url>|<file>}+ |
| Options: |
| -d, --defaultns Override the default namespace used to |
| produce statements. |
| -e, --extractors a comma-separated list of extractors, e.g. |
| rdf-xml,rdf-turtle |
| Default: [] |
| -f, --format the output format |
| Default: turtle |
| -l, --log Produce log within a file. |
| -md, --maxdepth Max allowed crawler depth. |
| Default: 2147483647 |
| -mp, --maxpages Max number of pages before interrupting |
| crawl. |
| Default: 2147483647 |
| -n, --nesting Disable production of nesting triples. |
| Default: false |
| -t, --notrivial Filter trivial statements (e.g. CSS related |
| ones). |
| Default: false |
| -nc, --numcrawlers Sets the number of crawlers. |
| Default: 10 |
| -o, --output Specify Output file (defaults to standard |
| output) |
| Default: java.io.PrintStream@2911a3a4 |
| -pf, --pagefilter Regex used to filter out page URLs during |
| crawling. |
| Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$ |
| -p, --pedantic Validate and fixes HTML content detecting |
| commons issues. |
| Default: false |
| -pd, --politenessdelay Politeness delay in milliseconds. |
| Default: 2147483647 |
| -s, --stats Print out extraction statistics. |
| Default: false |
| -sf, --storagefolder Folder used to store crawler temporary data. |
| Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce |
| +---------------------------------------------------------------------------- |
| |
| A usage example: |
| |
| +---------------------------------------------------------------------------- |
| cli$ any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log |
| +---------------------------------------------------------------------------- |
| |
| * Use <<Apache Any23>> as a RESTful Web Service |
| |
| <<Apache Any23>> provides a Web Service that can be used to extract <RDF> from Web documents. |
| <<Apache Any23>> services can be accessed through a {{{./service.html}RESTful API}}. |
| |
| Running the server |
| |
| The server command line tool is defined within the <<service>> module. |
| Run the <<<any23server>>> script |
| |
| +-------------------------- |
| service$ ./bin/any23server |
| +-------------------------- |
| |
| from the command line in order to start up the server, then go to {{{http://localhost:8080/}}} |
| to access the web interface. A live demo version of such service is running at {{{http://any23.org/}}}. |
| You can also start the server from Java by running the |
| {{{./apidocs/org/apache/any23/servlet/Servlet.html}Apache Any23 Servlet}} class. Maven can be used to create a WAR |
| file for deployment into an existing servlet container such as {{{http://tomcat.apache.org/}Apache Tomcat}}. |
| |
| * Use <<Apache Any23>> as a Library |
| |
| See our {{{./developers.html}Developers guide}} for more details. |