blob: dcbcee95a6aa09ecaffa641925ffecc8889ffe46 [file] [log] [blame]
------
Apache Any23 - Plugins - Basic Crawler
------
The Apache Software Foundation
------
2011-2012
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
Basic Crawler Plugin
The <Basic Crawler Plugin> implements a <CLI> {{{./apidocs/org/apache/any23/cli/Tool.html}Tool}} extending
{{{./apidocs/org/apache/any23/cli/Rover.html}Rover}} to add <site crawling> capabilities.
The tool can be used to extract semantic content from a small/medium size sites.
To use it make sure to have correctly configured the basic-crawler plugin to be found by the
<any23tools> script (follow the {{{./any23-plugins.html}Plugins}} section instructions):
+--------------------------------------------------------------
core/bin/$ ./any23tools Crawler
usage: [{<url>|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>]
[-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o
<arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s]
[-storagefolder <arg>] [-t] [-v]
-d,--defaultns <arg> Override the default namespace used to produce
statements.
-e <arg> Specify a comma-separated list of extractors,
e.g. rdf-xml,rdf-turtle.
-f,--Output format <arg> [turtle (default), rdfxml, ntriples, nquads,
trix, json, uri]
-h,--help Print this help.
-l,--log <arg> Produce log within a file.
-maxdepth <arg> Max allowed crawler depth. Default: no limit.
-maxpages <arg> Max number of pages before interrupting crawl.
Default: no limit.
-n,--nesting Disable production of nesting triples.
-numcrawlers <arg> Sets the number of crawlers. Default: 10
-o,--output <arg> Specify Output file (defaults to standard
output).
-p,--pedantic Validate and fixes HTML content detecting
commons issues.
-pagefilter <arg> Regex used to filter out page URLs during
crawling. Default:
'.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|
mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm
il|pdf|swf|zip|rar|gz|xml|txt))$'
-politenessdelay <arg> Politeness delay in milliseconds. Default: no
limit.
-s,--stats Print out extraction statistics.
-storagefolder <arg> Folder used to store crawler temporary data.
Default:
[/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g
q/T/]
-t,--notrivial Filter trivial statements (e.g. CSS related
ones).
-v,--verbose Show debug and progress information.
+--------------------------------------------------------------