src/site/apt/plugin-basic-crawler.apt - any23 - Git at Google

                                     ------
                                     Apache Any23 - Plugins - Basic Crawler
                                     ------
                               The Apache Software Foundation
                                     ------
                                      2011-2012

 ~~  Licensed to the Apache Software Foundation (ASF) under one or more
 ~~  contributor license agreements.  See the NOTICE file distributed with
 ~~  this work for additional information regarding copyright ownership.
 ~~  The ASF licenses this file to You under the Apache License, Version 2.0
 ~~  (the "License"); you may not use this file except in compliance with
 ~~  the License.  You may obtain a copy of the License at
 ~~
 ~~     http://www.apache.org/licenses/LICENSE-2.0
 ~~
 ~~  Unless required by applicable law or agreed to in writing, software
 ~~  distributed under the License is distributed on an "AS IS" BASIS,
 ~~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 ~~  See the License for the specific language governing permissions and
 ~~  limitations under the License.

 Basic Crawler Plugin

   The <Basic Crawler Plugin> implements a <CLI> {{{./apidocs/org/apache/any23/cli/Tool.html}Tool}} extending
   {{{./apidocs/org/apache/any23/cli/Rover.html}Rover}} to add <site crawling> capabilities.

   The tool can be used to extract semantic content from a small/medium size sites.

   To use it make sure to have correctly configured the basic-crawler plugin to be found by the
   <any23tools> script (follow the {{{./any23-plugins.html}Plugins}} section instructions):

 +--------------------------------------------------------------
 core/bin/$ ./any23tools Crawler
 usage: [{<url>|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>]
        [-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o
        <arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s]
        [-storagefolder <arg>] [-t] [-v]
  -d,--defaultns <arg>       Override the default namespace used to produce
                             statements.
  -e <arg>                   Specify a comma-separated list of extractors,
                             e.g. rdf-xml,rdf-turtle.
  -f,--Output format <arg>   [turtle (default), rdfxml, ntriples, nquads,
                             trix, json, uri]
  -h,--help                  Print this help.
  -l,--log <arg>             Produce log within a file.
  -maxdepth <arg>            Max allowed crawler depth. Default: no limit.
  -maxpages <arg>            Max number of pages before interrupting crawl.
                             Default: no limit.
  -n,--nesting               Disable production of nesting triples.
  -numcrawlers <arg>         Sets the number of crawlers. Default: 10
  -o,--output <arg>          Specify Output file (defaults to standard
                             output).
  -p,--pedantic              Validate and fixes HTML content detecting
                             commons issues.
  -pagefilter <arg>          Regex used to filter out page URLs during
                             crawling. Default:
                             '.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|
                             mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm
                             il|pdf|swf|zip|rar|gz|xml|txt))$'
  -politenessdelay <arg>     Politeness delay in milliseconds. Default: no
                             limit.
  -s,--stats                 Print out extraction statistics.
  -storagefolder <arg>       Folder used to store crawler temporary data.
                             Default:
                             [/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g
                             q/T/]
  -t,--notrivial             Filter trivial statements (e.g. CSS related
                             ones).
  -v,--verbose               Show debug and progress information.
 +--------------------------------------------------------------
	------
	Apache Any23 - Plugins - Basic Crawler
	------
	The Apache Software Foundation
	------
	2011-2012

	~~ Licensed to the Apache Software Foundation (ASF) under one or more
	~~ contributor license agreements. See the NOTICE file distributed with
	~~ this work for additional information regarding copyright ownership.
	~~ The ASF licenses this file to You under the Apache License, Version 2.0
	~~ (the "License"); you may not use this file except in compliance with
	~~ the License. You may obtain a copy of the License at
	~~
	~~ http://www.apache.org/licenses/LICENSE-2.0
	~~
	~~ Unless required by applicable law or agreed to in writing, software
	~~ distributed under the License is distributed on an "AS IS" BASIS,
	~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	~~ See the License for the specific language governing permissions and
	~~ limitations under the License.

	Basic Crawler Plugin

	The <Basic Crawler Plugin> implements a <CLI> {{{./apidocs/org/apache/any23/cli/Tool.html}Tool}} extending
	{{{./apidocs/org/apache/any23/cli/Rover.html}Rover}} to add <site crawling> capabilities.

	The tool can be used to extract semantic content from a small/medium size sites.

	To use it make sure to have correctly configured the basic-crawler plugin to be found by the
	<any23tools> script (follow the {{{./any23-plugins.html}Plugins}} section instructions):

	+--------------------------------------------------------------
	core/bin/$ ./any23tools Crawler
	usage: [{<url>\|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>]
	[-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o
	<arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s]
	[-storagefolder <arg>] [-t] [-v]
	-d,--defaultns <arg> Override the default namespace used to produce
	statements.
	-e <arg> Specify a comma-separated list of extractors,
	e.g. rdf-xml,rdf-turtle.
	-f,--Output format <arg> [turtle (default), rdfxml, ntriples, nquads,
	trix, json, uri]
	-h,--help Print this help.
	-l,--log <arg> Produce log within a file.
	-maxdepth <arg> Max allowed crawler depth. Default: no limit.
	-maxpages <arg> Max number of pages before interrupting crawl.
	Default: no limit.
	-n,--nesting Disable production of nesting triples.
	-numcrawlers <arg> Sets the number of crawlers. Default: 10
	-o,--output <arg> Specify Output file (defaults to standard
	output).
	-p,--pedantic Validate and fixes HTML content detecting
	commons issues.
	-pagefilter <arg> Regex used to filter out page URLs during
	crawling. Default:
	'.*(\.(css\|js\|bmp\|gif\|jpe?g\|png\|tiff?\|mid\|mp2\|
	mp3\|mp4\|wav\|wma\|avi\|mov\|mpeg\|ram\|m4v\|wmv\|rm\|sm
	il\|pdf\|swf\|zip\|rar\|gz\|xml\|txt))$'
	-politenessdelay <arg> Politeness delay in milliseconds. Default: no
	limit.
	-s,--stats Print out extraction statistics.
	-storagefolder <arg> Folder used to store crawler temporary data.
	Default:
	[/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g
	q/T/]
	-t,--notrivial Filter trivial statements (e.g. CSS related
	ones).
	-v,--verbose Show debug and progress information.
	+--------------------------------------------------------------