| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta content="Apache Forrest" name="Generator"> |
| <meta name="Forrest-version" content="0.7"> |
| <meta name="Forrest-skin-name" content="pelt"> |
| <title>Nutch robot</title> |
| <link type="text/css" href="skin/basic.css" rel="stylesheet"> |
| <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet"> |
| <link media="print" type="text/css" href="skin/print.css" rel="stylesheet"> |
| <link type="text/css" href="skin/profile.css" rel="stylesheet"> |
| <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script> |
| <link rel="shortcut icon" href="images/favicon.ico"> |
| </head> |
| <body onload="init()"> |
| <script type="text/javascript">ndeSetTextSize();</script> |
| <div id="top"> |
| <div class="breadtrail"> |
| <a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a> > <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script> |
| </div> |
| <div class="header"> |
| <div class="grouplogo"> |
| <a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="images/lucene_green_150.gif" title="Apache Lucene"></a> |
| </div> |
| <div class="projectlogo"> |
| <a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a> |
| </div> |
| <div class="searchbox"> |
| <form action="http://www.google.com/search" method="get" class="roundtopsmall"> |
| <input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google"> |
| <input attr="value" name="Search" value="Search" type="submit"> |
| </form> |
| </div> |
| <ul id="tabs"> |
| <li class="current"> |
| <a class="base-selected" href="index.html">Main</a> |
| </li> |
| <li> |
| <a class="base-not-selected" href="http://wiki.apache.org/nutch/">Wiki</a> |
| </li> |
| <li> |
| <a class="base-not-selected" href="http://issues.apache.org/jira/browse/Nutch">Jira</a> |
| </li> |
| </ul> |
| </div> |
| </div> |
| <div id="main"> |
| <div id="publishedStrip"> |
| <div id="level2tabs"></div> |
| <script type="text/javascript"><!-- |
| document.write("<text>Last Published:</text> " + document.lastModified); |
| // --></script> |
| </div> |
| <div class="breadtrail"> |
| |
| |
| </div> |
| <div id="menu"> |
| <div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div> |
| <div id="menu_1.1" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="index.html">News</a> |
| </div> |
| <div class="menuitem"> |
| <a href="about.html">About</a> |
| </div> |
| <div class="menuitem"> |
| <a href="credits.html">Credits</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://www.cafepress.com/nutch/">Buy Stuff</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div> |
| <div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;"> |
| <div class="menuitem"> |
| <a href="http://wiki.apache.org/nutch/FAQ">FAQ</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://wiki.apache.org/nutch/">Wiki</a> |
| </div> |
| <div class="menuitem"> |
| <a href="tutorial.html">Tutorial (0.7.2)</a> |
| </div> |
| <div class="menuitem"> |
| <a href="tutorial8.html">Tutorial (0.8.x)</a> |
| </div> |
| <div class="menupage"> |
| <div class="menupagetitle">Robot </div> |
| </div> |
| <div class="menuitem"> |
| <a href="i18n.html">i18n</a> |
| </div> |
| <div class="menuitem"> |
| <a href="apidocs/index.html">API Docs (0.7.2)</a> |
| </div> |
| <div class="menuitem"> |
| <a href="apidocs-0.8.x/index.html">API Docs (0.8.x)</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://lucene.apache.org/nutch/nutch-nightly/docs/api/index.html">API Docs (nightly)</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div> |
| <div id="menu_1.3" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="release/">Download</a> |
| </div> |
| <div class="menuitem"> |
| <a href="nightly.html">Nightly builds</a> |
| </div> |
| <div class="menuitem"> |
| <a href="mailing_lists.html">Mailing Lists</a> |
| </div> |
| <div class="menuitem"> |
| <a href="issue_tracking.html">Issue Tracking</a> |
| </div> |
| <div class="menuitem"> |
| <a href="version_control.html">Version Control</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div> |
| <div id="menu_1.4" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="http://lucene.apache.org/java/">Lucene Java</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://lucene.apache.org/hadoop/">Hadoop</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://incubator.apache.org/solr/">Solr</a> |
| </div> |
| </div> |
| <div id="credit"></div> |
| <div id="roundbottom"> |
| <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div> |
| <div id="credit2"></div> |
| </div> |
| <div id="content"> |
| <div title="Portable Document Format" class="pdflink"> |
| <a class="dida" href="bot.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br> |
| PDF</a> |
| </div> |
| <h1>Nutch robot</h1> |
| <div id="minitoc-area"> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Sysadmins%2Frobots.txt">Sysadmins/robots.txt</a> |
| </li> |
| <li> |
| <a href="#Webmasters%2FRobots+META">Webmasters/Robots META</a> |
| </li> |
| <li> |
| <a href="#Contact+us">Contact us</a> |
| </li> |
| </ul> |
| </div> |
| |
| |
| <p> If you're reading this, chances are you've seen a Nutch-based |
| robot visiting your site while looking through your server logs. Our |
| software obeys robots.txt files and robot META tags in HTML. These |
| are the standard mechanisms for webmasters to tell web robots which |
| portions of a site a robot is welcome to access. </p> |
| |
| |
| <a name="N1000F"></a><a name="Sysadmins%2Frobots.txt"></a> |
| <h2 class="h3">Sysadmins/robots.txt</h2> |
| <div class="section"> |
| <p>We're a software project, not a service, so please understand that |
| a misbehaving crawler appearing with our Agent string is not run by |
| us. Our software may be run by anyone. However, we'd still like to |
| hear about any bad behavior. If possible, please include the name of |
| the domain and some representative log entries. We can be reached at |
| <a href="mailto:nutch-agent@lucene.apache.org">nutch-agent@lucene.apache.org</a>.</p> |
| <p> Our software obeys the robots.txt exclusion standard, described at |
| <a href="http://www.robotstxt.org/wc/exclusion.html#robotstxt"> |
| http://www.robotstxt.org/wc/exclusion.html#robotstxt</a>. Different |
| installations of the Nutch software may specify different agent names, |
| but all should respond to the agent name "Nutch". Thus to ban all |
| Nutch-based crawlers from your site, place the following in your |
| robots.txt file:</p> |
| <blockquote> |
| |
| <pre>User-agent: Nutch<br>Disallow: /</pre> |
| |
| </blockquote> |
| </div> |
| |
| |
| |
| <a name="N1002B"></a><a name="Webmasters%2FRobots+META"></a> |
| <h2 class="h3">Webmasters/Robots META</h2> |
| <div class="section"> |
| <p>If you do not have permission to edit the |
| /robots.txt file on your server, you can still tell robots not to |
| index your pages or follow your links. The standard mechanism for |
| this is the robots META tag, as described at<a href="http://www.robotstxt.org/wc/meta-user.html"> |
| http://www.robotstxt.org/wc/meta-user.html</a>. </p> |
| </div> |
| |
| |
| |
| <a name="N10038"></a><a name="Contact+us"></a> |
| <h2 class="h3">Contact us</h2> |
| <div class="section"> |
| <p>If your site has problems or questions about the Nutch crawler, please |
| send an email to the <a href="mailto:nutch-agent@lucene.apache.org">Nutch agent |
| mailing list</a>.</p> |
| </div> |
| |
| |
| |
| </div> |
| <div class="clearboth"> </div> |
| </div> |
| <div id="footer"> |
| <div class="lastmodified"> |
| <script type="text/javascript"><!-- |
| document.write("<text>Last Published:</text> " + document.lastModified); |
| // --></script> |
| </div> |
| <div class="copyright"> |
| Copyright © |
| 2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a> |
| </div> |
| </div> |
| </body> |
| </html> |