| ------ |
| Apache Any23 - Plugins - Basic Crawler |
| ------ |
| The Apache Software Foundation |
| ------ |
| 2011-2012 |
| |
| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| |
| Basic Crawler Plugin |
| |
| The <Basic Crawler Plugin> implements a <CLI> {{{./apidocs/org/apache/any23/cli/Tool.html}Tool}} extending |
| {{{./apidocs/org/apache/any23/cli/Rover.html}Rover}} to add <site crawling> capabilities. |
| |
| The tool can be used to extract semantic content from a small/medium size sites. |
| |
| To use it make sure to have correctly configured the basic-crawler plugin to be found by the |
| <any23tools> script (follow the {{{./any23-plugins.html}Plugins}} section instructions): |
| |
| +-------------------------------------------------------------- |
| core/bin/$ ./any23tools Crawler |
| usage: [{<url>|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>] |
| [-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o |
| <arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s] |
| [-storagefolder <arg>] [-t] [-v] |
| -d,--defaultns <arg> Override the default namespace used to produce |
| statements. |
| -e <arg> Specify a comma-separated list of extractors, |
| e.g. rdf-xml,rdf-turtle. |
| -f,--Output format <arg> [turtle (default), rdfxml, ntriples, nquads, |
| trix, json, uri] |
| -h,--help Print this help. |
| -l,--log <arg> Produce log within a file. |
| -maxdepth <arg> Max allowed crawler depth. Default: no limit. |
| -maxpages <arg> Max number of pages before interrupting crawl. |
| Default: no limit. |
| -n,--nesting Disable production of nesting triples. |
| -numcrawlers <arg> Sets the number of crawlers. Default: 10 |
| -o,--output <arg> Specify Output file (defaults to standard |
| output). |
| -p,--pedantic Validate and fixes HTML content detecting |
| commons issues. |
| -pagefilter <arg> Regex used to filter out page URLs during |
| crawling. Default: |
| '.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2| |
| mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm |
| il|pdf|swf|zip|rar|gz|xml|txt))$' |
| -politenessdelay <arg> Politeness delay in milliseconds. Default: no |
| limit. |
| -s,--stats Print out extraction statistics. |
| -storagefolder <arg> Folder used to store crawler temporary data. |
| Default: |
| [/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g |
| q/T/] |
| -t,--notrivial Filter trivial statements (e.g. CSS related |
| ones). |
| -v,--verbose Show debug and progress information. |
| +-------------------------------------------------------------- |