blob: 0aca26db118c31ab547a235a304b5fb874a87929 [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="Apache Forrest" name="Generator">
<meta name="Forrest-version" content="0.7">
<meta name="Forrest-skin-name" content="pelt">
<title>Nutch robot</title>
<link type="text/css" href="skin/basic.css" rel="stylesheet">
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
<link type="text/css" href="skin/profile.css" rel="stylesheet">
<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
<link rel="shortcut icon" href="images/favicon.ico">
</head>
<body onload="init()">
<script type="text/javascript">ndeSetTextSize();</script>
<div id="top">
<div class="breadtrail">
<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://lucene.apache.org/">Lucene</a> &gt; <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
</div>
<div class="header">
<div class="grouplogo">
<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/java/docs/images/lucene_green_150.gif" title="Apache Lucene"></a>
</div>
<div class="projectlogo">
<a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a>
</div>
<div class="searchbox">
<form action="http://www.google.com/search" method="get" class="roundtopsmall">
<input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
<input attr="value" name="Search" value="Search" type="submit">
</form>
</div>
<ul id="tabs">
<li class="current">
<a class="base-selected" href="index.html">Main</a>
</li>
<li>
<a class="base-not-selected" href="http://wiki.apache.org/nutch/">Wiki</a>
</li>
</ul>
</div>
</div>
<div id="main">
<div id="publishedStrip">
<div id="level2tabs"></div>
<script type="text/javascript"><!--
document.write("<text>Last Published:</text> " + document.lastModified);
// --></script>
</div>
<div class="breadtrail">
&nbsp;
</div>
<div id="menu">
<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div>
<div id="menu_1.1" class="menuitemgroup">
<div class="menuitem">
<a href="index.html">News</a>
</div>
<div class="menuitem">
<a href="about.html">About</a>
</div>
<div class="menuitem">
<a href="credits.html">Credits</a>
</div>
<div class="menuitem">
<a href="http://www.cafepress.com/nutch/">Buy Stuff</a>
</div>
</div>
<div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
<div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
<div class="menuitem">
<a href="http://wiki.apache.org/nutch/FAQ">FAQ</a>
</div>
<div class="menuitem">
<a href="http://wiki.apache.org/nutch/">Wiki</a>
</div>
<div class="menuitem">
<a href="tutorial.html">Tutorial ver. 0.7.2</a>
</div>
<div class="menuitem">
<a href="tutorial8.html">Tutorial ver. 0.8</a>
</div>
<div class="menupage">
<div class="menupagetitle">Robot </div>
</div>
<div class="menuitem">
<a href="i18n.html">i18n</a>
</div>
<div class="menuitem">
<a href="apidocs/index.html">API Docs ver. 0.7.2</a>
</div>
<div class="menuitem">
<a href="nutch-nightly/docs/api/index.html">API Docs ver. 0.8</a>
</div>
</div>
<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
<div id="menu_1.3" class="menuitemgroup">
<div class="menuitem">
<a href="release/">Download</a>
</div>
<div class="menuitem">
<a href="nightly.html">Nightly builds</a>
</div>
<div class="menuitem">
<a href="mailing_lists.html">Mailing Lists</a>
</div>
<div class="menuitem">
<a href="issue_tracking.html">Issue Tracking</a>
</div>
<div class="menuitem">
<a href="version_control.html">Version Control</a>
</div>
</div>
<div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div>
<div id="menu_1.4" class="menuitemgroup">
<div class="menuitem">
<a href="http://lucene.apache.org/java/">Lucene Java</a>
</div>
<div class="menuitem">
<a href="http://lucene.apache.org/hadoop/">Hadoop</a>
</div>
</div>
<div id="credit"></div>
<div id="roundbottom">
<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
<div id="credit2"></div>
</div>
<div id="content">
<div title="Portable Document Format" class="pdflink">
<a class="dida" href="bot.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
PDF</a>
</div>
<h1>Nutch robot</h1>
<div id="minitoc-area">
<ul class="minitoc">
<li>
<a href="#Sysadmins%2Frobots.txt">Sysadmins/robots.txt</a>
</li>
<li>
<a href="#Webmasters%2FRobots+META">Webmasters/Robots META</a>
</li>
<li>
<a href="#Contact+us">Contact us</a>
</li>
</ul>
</div>
<p> If you're reading this, chances are you've seen a Nutch-based
robot visiting your site while looking through your server logs. Our
software obeys robots.txt files and robot META tags in HTML. These
are the standard mechanisms for webmasters to tell web robots which
portions of a site a robot is welcome to access. </p>
<a name="N1000F"></a><a name="Sysadmins%2Frobots.txt"></a>
<h2 class="h3">Sysadmins/robots.txt</h2>
<div class="section">
<p>We're a software project, not a service, so please understand that
a misbehaving crawler appearing with our Agent string is not run by
us. Our software may be run by anyone. However, we'd still like to
hear about any bad behavior. If possible, please include the name of
the domain and some representative log entries. We can be reached at
<a href="mailto:nutch-agent@lucene.apache.org">nutch-agent@lucene.apache.org</a>.</p>
<p> Our software obeys the robots.txt exclusion standard, described at
<a href="http://www.robotstxt.org/wc/exclusion.html#robotstxt">
http://www.robotstxt.org/wc/exclusion.html#robotstxt</a>. Different
installations of the Nutch software may specify different agent names,
but all should respond to the agent name "Nutch". Thus to ban all
Nutch-based crawlers from your site, place the following in your
robots.txt file:</p>
<blockquote>
<pre>User-agent: Nutch<br>Disallow: /</pre>
</blockquote>
</div>
<a name="N1002B"></a><a name="Webmasters%2FRobots+META"></a>
<h2 class="h3">Webmasters/Robots META</h2>
<div class="section">
<p>If you do not have permission to edit the
/robots.txt file on your server, you can still tell robots not to
index your pages or follow your links. The standard mechanism for
this is the robots META tag, as described at<a href="http://www.robotstxt.org/wc/meta-user.html">
http://www.robotstxt.org/wc/meta-user.html</a>. </p>
</div>
<a name="N10038"></a><a name="Contact+us"></a>
<h2 class="h3">Contact us</h2>
<div class="section">
<p>If your site has problems or questions about the Nutch crawler, please
send an email to the <a href="mailto:nutch-agent@lucene.apache.org">Nutch agent
mailing list</a>.</p>
</div>
</div>
<div class="clearboth">&nbsp;</div>
</div>
<div id="footer">
<div class="lastmodified">
<script type="text/javascript"><!--
document.write("<text>Last Published:</text> " + document.lastModified);
// --></script>
</div>
<div class="copyright">
Copyright &copy;
2005 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
</div>
</div>
</body>
</html>