site/tutorial8.html - nutch - Git at Google

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
 <html>
 <head>
 <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <meta content="Apache Forrest" name="Generator">
 <meta name="Forrest-version" content="0.7">
 <meta name="Forrest-skin-name" content="pelt">
 <title>Nutch version 0.8 tutorial</title>
 <link type="text/css" href="skin/basic.css" rel="stylesheet">
 <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
 <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
 <link type="text/css" href="skin/profile.css" rel="stylesheet">
 <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
 <link rel="shortcut icon" href="images/favicon.ico">
 </head>
 <body onload="init()">
 <script type="text/javascript">ndeSetTextSize();</script>
 <div id="top">
 <div class="breadtrail">
 <a href="http://www.apache.org/">Apache</a> &gt; <a href="http://lucene.apache.org/">Lucene</a> &gt; <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
 </div>
 <div class="header">
 <div class="grouplogo">
 <a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/java/docs/images/lucene_green_150.gif" title="Apache Lucene"></a>
 </div>
 <div class="projectlogo">
 <a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a>
 </div>
 <div class="searchbox">
 <form action="http://www.google.com/search" method="get" class="roundtopsmall">
 <input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
                     <input attr="value" name="Search" value="Search" type="submit">
 </form>
 </div>
 <ul id="tabs">
 <li class="current">
 <a class="base-selected" href="index.html">Main</a>
 </li>
 <li>
 <a class="base-not-selected" href="http://wiki.apache.org/nutch/">Wiki</a>
 </li>
 </ul>
 </div>
 </div>
 <div id="main">
 <div id="publishedStrip">
 <div id="level2tabs"></div>
 <script type="text/javascript"><!--
 document.write("<text>Last Published:</text> " + document.lastModified);
 //  --></script>
 </div>
 <div class="breadtrail">

              &nbsp;
            </div>
 <div id="menu">
 <div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div>
 <div id="menu_1.1" class="menuitemgroup">
 <div class="menuitem">
 <a href="index.html">News</a>
 </div>
 <div class="menuitem">
 <a href="about.html">About</a>
 </div>
 <div class="menuitem">
 <a href="credits.html">Credits</a>
 </div>
 <div class="menuitem">
 <a href="http://www.cafepress.com/nutch/">Buy Stuff</a>
 </div>
 </div>
 <div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
 <div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
 <div class="menuitem">
 <a href="http://wiki.apache.org/nutch/FAQ">FAQ</a>
 </div>
 <div class="menuitem">
 <a href="http://wiki.apache.org/nutch/">Wiki</a>
 </div>
 <div class="menuitem">
 <a href="tutorial.html">Tutorial ver. 0.7.2</a>
 </div>
 <div class="menupage">
 <div class="menupagetitle">Tutorial ver. 0.8</div>
 </div>
 <div class="menuitem">
 <a href="bot.html">Robot     </a>
 </div>
 <div class="menuitem">
 <a href="i18n.html">i18n</a>
 </div>
 <div class="menuitem">
 <a href="apidocs/index.html">API Docs ver. 0.7.2</a>
 </div>
 <div class="menuitem">
 <a href="http://lucene.apache.org/nutch-nightly/docs/api/index.html">API Docs ver. 0.8</a>
 </div>
 </div>
 <div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
 <div id="menu_1.3" class="menuitemgroup">
 <div class="menuitem">
 <a href="release/">Download</a>
 </div>
 <div class="menuitem">
 <a href="nightly.html">Nightly builds</a>
 </div>
 <div class="menuitem">
 <a href="mailing_lists.html">Mailing Lists</a>
 </div>
 <div class="menuitem">
 <a href="issue_tracking.html">Issue Tracking</a>
 </div>
 <div class="menuitem">
 <a href="version_control.html">Version Control</a>
 </div>
 </div>
 <div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div>
 <div id="menu_1.4" class="menuitemgroup">
 <div class="menuitem">
 <a href="http://lucene.apache.org/java/">Lucene Java</a>
 </div>
 <div class="menuitem">
 <a href="http://lucene.apache.org/hadoop/">Hadoop</a>
 </div>
 </div>
 <div id="credit"></div>
 <div id="roundbottom">
 <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
 <div id="credit2"></div>
 </div>
 <div id="content">
 <div title="Portable Document Format" class="pdflink">
 <a class="dida" href="tutorial8.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
         PDF</a>
 </div>
 <h1>Nutch version 0.8 tutorial</h1>
 <div id="minitoc-area">
 <ul class="minitoc">
 <li>
 <a href="#Requirements">Requirements</a>
 </li>
 <li>
 <a href="#Getting+Started">Getting Started</a>
 </li>
 <li>
 <a href="#Intranet+Crawling">Intranet Crawling</a>
 <ul class="minitoc">
 <li>
 <a href="#Intranet%3A+Configuration">Intranet: Configuration</a>
 </li>
 <li>
 <a href="#Intranet%3A+Running+the+Crawl">Intranet: Running the Crawl</a>
 </li>
 </ul>
 </li>
 <li>
 <a href="#Whole-web+Crawling">Whole-web Crawling</a>
 <ul class="minitoc">
 <li>
 <a href="#Whole-web%3A+Concepts">Whole-web: Concepts</a>
 </li>
 <li>
 <a href="#Whole-web%3A+Boostrapping+the+Web+Database">Whole-web: Boostrapping the Web Database</a>
 </li>
 <li>
 <a href="#Whole-web%3A+Fetching">Whole-web: Fetching</a>
 </li>
 <li>
 <a href="#Whole-web%3A+Indexing">Whole-web: Indexing</a>
 </li>
 <li>
 <a href="#Searching">Searching</a>
 </li>
 </ul>
 </li>
 </ul>
 </div>


 <a name="N1000C"></a><a name="Requirements"></a>
 <h2 class="h3">Requirements</h2>
 <div class="section">
 <ol>

 <li>Java 1.4.x, either from <a href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on
  Linux is preferred.  Set <span class="codefrag">NUTCH_JAVA_HOME</span> to the root
  of your JVM installation.
   </li>

 <li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a>
 4.x.</li>

 <li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for
 shell support.  (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li>

 <li>Up to a gigabyte of free disk space, a high-speed connection, and
 an hour or so.
   </li>

 </ol>
 </div>

 <a name="N10035"></a><a name="Getting+Started"></a>
 <h2 class="h3">Getting Started</h2>
 <div class="section">
 <p>First, you need to get a copy of the Nutch code.  You can download
 a release from <a href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>.
 Unpack the release and connect to its top-level directory.  Or, check
 out the latest source code from <a href="version_control.html">subversion</a> and build it
 with <a href="http://ant.apache.org/">Ant</a>.</p>
 <p>Try the following command:</p>
 <pre class="code">bin/nutch</pre>
 <p>This will display the documentation for the Nutch command script.</p>
 <p>Now we're ready to crawl.  There are two approaches to crawling:</p>
 <ol>

 <li>Intranet crawling, with the <span class="codefrag">crawl</span> command.</li>

 <li>Whole-web crawling, with much greater control, using the lower
 level <span class="codefrag">inject</span>, <span class="codefrag">generate</span>, <span class="codefrag">fetch</span>
 and <span class="codefrag">updatedb</span> commands.</li>

 </ol>
 </div>

 <a name="N10070"></a><a name="Intranet+Crawling"></a>
 <h2 class="h3">Intranet Crawling</h2>
 <div class="section">
 <p>Intranet crawling is more appropriate when you intend to crawl up to
 around one million pages on a handful of web servers.</p>
 <a name="N10079"></a><a name="Intranet%3A+Configuration"></a>
 <h3 class="h4">Intranet: Configuration</h3>
 <p>To configure things for intranet crawling you must:</p>
 <ol>


 <li>Create a directory with a flat file of root urls.  For example, to
 crawl the <span class="codefrag">nutch</span> site you might start with a file named
 <span class="codefrag">urls/nutch</span> containing the url of just the Nutch home
 page.  All other Nutch pages should be reachable from this page.  The
 <span class="codefrag">urls/nutch</span> file would thus contain:
 <pre class="code">
 http://lucene.apache.org/nutch/
 </pre>

 </li>


 <li>Edit the file <span class="codefrag">conf/crawl-urlfilter.txt</span> and replace
 <span class="codefrag">MY.DOMAIN.NAME</span> with the name of the domain you wish to
 crawl.  For example, if you wished to limit the crawl to the
 <span class="codefrag">apache.org</span> domain, the line should read:
 <pre class="code">
 +^http://([a-z0-9]*\.)*apache.org/
 </pre>
 This will include any url in the domain <span class="codefrag">apache.org</span>.
 </li>

 <li>Edit the file <span class="codefrag">conf/nutch-site.xml</span>, insert at minimum
 following properties into it and edit in proper values for the properties:
 <pre class="code">

 &lt;property&gt;
   &lt;name&gt;http.agent.name&lt;/name&gt;
   &lt;value&gt;&lt;/value&gt;
   &lt;description&gt;HTTP 'User-Agent' request header. MUST NOT be empty -
   please set this to a single word uniquely related to your organization.

   NOTE: You should also check other related properties:

 	http.robots.agents
 	http.agent.description
 	http.agent.url
 	http.agent.email
 	http.agent.version

   and set their values appropriately.

   &lt;/description&gt;
 &lt;/property&gt;

 &lt;property&gt;
   &lt;name&gt;http.agent.description&lt;/name&gt;
   &lt;value&gt;&lt;/value&gt;
   &lt;description&gt;Further description of our bot- this text is used in
   the User-Agent header.  It appears in parenthesis after the agent name.
   &lt;/description&gt;
 &lt;/property&gt;

 &lt;property&gt;
   &lt;name&gt;http.agent.url&lt;/name&gt;
   &lt;value&gt;&lt;/value&gt;
   &lt;description&gt;A URL to advertise in the User-Agent header.  This will
    appear in parenthesis after the agent name. Custom dictates that this
    should be a URL of a page explaining the purpose and behavior of this
    crawler.
   &lt;/description&gt;
 &lt;/property&gt;

 &lt;property&gt;
   &lt;name&gt;http.agent.email&lt;/name&gt;
   &lt;value&gt;&lt;/value&gt;
   &lt;description&gt;An email address to advertise in the HTTP 'From' request
    header and User-Agent header. A good practice is to mangle this
    address (e.g. 'info at example dot com') to avoid spamming.
   &lt;/description&gt;
 &lt;/property&gt;


 </pre>

 </li>

 </ol>
 <a name="N100B3"></a><a name="Intranet%3A+Running+the+Crawl"></a>
 <h3 class="h4">Intranet: Running the Crawl</h3>
 <p>Once things are configured, running the crawl is easy.  Just use the
 crawl command.  Its options include:</p>
 <ul>

 <li>
 <span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li>

 <li>
 <span class="codefrag">-threads</span> <em>threads</em> determines the number of
 threads that will fetch in parallel.</li>

 <li>
 <span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root
 page that should be crawled.</li>

 <li>
 <span class="codefrag">-topN</span> <em>N</em> determines the maximum number of pages that
 will be retrieved at each level up to the depth.</li>

 </ul>
 <p>For example, a typical call might be:</p>
 <pre class="code">
 bin/nutch crawl urls -dir crawl -depth 3 -topN 50
 </pre>
 <p>Typically one starts testing one's configuration by crawling at
 shallow depths, sharply limiting the number of pages fetched at each
 level (<span class="codefrag">-topN</span>), and watching the output to check that
 desired pages are fetched and undesirable pages are not.  Once one is
 confident of the configuration, then an appropriate depth for a full
 crawl is around 10.  The number of pages per level
 (<span class="codefrag">-topN</span>) for a full crawl can be from tens of thousands to
 millions, depending on your resources.</p>
 <p>Once crawling has completed, one can skip to the Searching section
 below.</p>
 </div>


 <a name="N100F4"></a><a name="Whole-web+Crawling"></a>
 <h2 class="h3">Whole-web Crawling</h2>
 <div class="section">
 <p>Whole-web crawling is designed to handle very large crawls which may
 take weeks to complete, running on multiple machines.</p>
 <a name="N100FD"></a><a name="Whole-web%3A+Concepts"></a>
 <h3 class="h4">Whole-web: Concepts</h3>
 <p>Nutch data is composed of:</p>
 <ol>


 <li>The crawl database, or <em>crawldb</em>.  This contains
 information about every url known to Nutch, including whether it was
 fetched, and, if so, when.</li>


 <li>The link database, or <em>linkdb</em>.  This contains the list
 of known links to each url, including both the source url and anchor
 text of the link.</li>


 <li>A set of <em>segments</em>.  Each segment is a set of urls that are
 fetched as a unit.  Segments are directories with the following
 subdirectories:</li>


 <li>
 <ul>

 <li>a <em>crawl_generate</em> names a set of urls to be fetched</li>

 <li>a <em>crawl_fetch</em> contains the status of fetching each url</li>

 <li>a <em>content</em> contains the content of each url</li>

 <li>a <em>parse_text</em> contains the parsed text of each url</li>

 <li>a <em>parse_data</em> contains outlinks and metadata parsed
     from each url</li>

 <li>a <em>crawl_parse</em> contains the outlink urls, used to
     update the crawldb</li>

 </ul>
 </li>


 <li>The <em>indexes</em>are Lucene-format indexes.</li>


 </ol>
 <a name="N1014A"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a>
 <h3 class="h4">Whole-web: Boostrapping the Web Database</h3>
 <p>The <em>injector</em> adds urls to the crawldb.  Let's inject URLs
 from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we
 must download and uncompress the file listing all of the DMOZ pages.
 (This is a 200+Mb file, so this will take a few minutes.)</p>
 <pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
 gunzip content.rdf.u8.gz</pre>
 <p>Next we select a random subset of these pages.
  (We use a random subset so that everyone who runs this tutorial
 doesn't hammer the same sites.)  DMOZ contains around three million
 URLs.  We select one out of every 5000, so that we end up with
 around 1000 URLs:</p>
 <pre class="code">mkdir dmoz
 bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 &gt; dmoz/urls</pre>
 <p>The parser also takes a few minutes, as it must parse the full
 file.  Finally, we initialize the crawl db with the selected urls.</p>
 <pre class="code">bin/nutch inject crawl/crawldb dmoz</pre>
 <p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
 <a name="N10170"></a><a name="Whole-web%3A+Fetching"></a>
 <h3 class="h4">Whole-web: Fetching</h3>
 <p>
 Starting from 0.8 nutch user agent identifier needs to be configured
 before fetching. To do this you must edit the file <span class="codefrag">conf/nutch-site.xml</span>, insert at minimum
 following properties into it and edit in proper values for the properties:
 </p>
 <pre class="code">

 &lt;property&gt;
   &lt;name&gt;http.agent.name&lt;/name&gt;
   &lt;value&gt;&lt;/value&gt;
   &lt;description&gt;HTTP 'User-Agent' request header. MUST NOT be empty -
   please set this to a single word uniquely related to your organization.

   NOTE: You should also check other related properties:

   http.robots.agents
   http.agent.description
   http.agent.url
   http.agent.email
   http.agent.version

   and set their values appropriately.

   &lt;/description&gt;
 &lt;/property&gt;

 &lt;property&gt;
   &lt;name&gt;http.agent.description&lt;/name&gt;
   &lt;value&gt;&lt;/value&gt;
   &lt;description&gt;Further description of our bot- this text is used in
   the User-Agent header.  It appears in parenthesis after the agent name.
   &lt;/description&gt;
 &lt;/property&gt;

 &lt;property&gt;
   &lt;name&gt;http.agent.url&lt;/name&gt;
   &lt;value&gt;&lt;/value&gt;
   &lt;description&gt;A URL to advertise in the User-Agent header.  This will
    appear in parenthesis after the agent name. Custom dictates that this
    should be a URL of a page explaining the purpose and behavior of this
    crawler.
   &lt;/description&gt;
 &lt;/property&gt;

 &lt;property&gt;
   &lt;name&gt;http.agent.email&lt;/name&gt;
   &lt;value&gt;&lt;/value&gt;
   &lt;description&gt;An email address to advertise in the HTTP 'From' request
    header and User-Agent header. A good practice is to mangle this
    address (e.g. 'info at example dot com') to avoid spamming.
   &lt;/description&gt;
 &lt;/property&gt;


 </pre>
 <p>To fetch, we first generate a fetchlist from the database:</p>
 <pre class="code">bin/nutch generate crawl/crawldb crawl/segments
 </pre>
 <p>This generates a fetchlist for all of the pages due to be fetched.
  The fetchlist is placed in a newly created segment directory.
  The segment directory is named by the time it's created.  We
 save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p>
 <pre class="code">s1=`ls -d crawl/segments/2* | tail -1`
 echo $s1
 </pre>
 <p>Now we run the fetcher on this segment with:</p>
 <pre class="code">bin/nutch fetch $s1</pre>
 <p>When this is complete, we update the database with the results of the
 fetch:</p>
 <pre class="code">bin/nutch updatedb crawl/crawldb $s1</pre>
 <p>Now the database has entries for all of the pages referenced by the
 initial set.</p>
 <p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
 <pre class="code">bin/nutch generate crawl/crawldb crawl/segments -topN 1000
 s2=`ls -d crawl/segments/2* | tail -1`
 echo $s2

 bin/nutch fetch $s2
 bin/nutch updatedb crawl/crawldb $s2
 </pre>
 <p>Let's fetch one more round:</p>
 <pre class="code">
 bin/nutch generate crawl/crawldb crawl/segments -topN 1000
 s3=`ls -d crawl/segments/2* | tail -1`
 echo $s3

 bin/nutch fetch $s3
 bin/nutch updatedb crawl/crawldb $s3
 </pre>
 <p>By this point we've fetched a few thousand pages.  Let's index
 them!</p>
 <a name="N101B4"></a><a name="Whole-web%3A+Indexing"></a>
 <h3 class="h4">Whole-web: Indexing</h3>
 <p>Before indexing we first invert all of the links, so that we may
 index incoming anchor text with the pages.</p>
 <pre class="code">bin/nutch invertlinks crawl/linkdb crawl/segments</pre>
 <p>To index the segments we use the <span class="codefrag">index</span> command, as follows:</p>
 <pre class="code">bin/nutch index indexes crawl/linkdb crawl/segments/*</pre>
 <p>Now we're ready to search!</p>
 <a name="N101D5"></a><a name="Searching"></a>
 <h3 class="h4">Searching</h3>
 <p>To search you need to put the nutch war file into your servlet
 container.  (If instead of downloading a Nutch release you checked the
 sources out of SVN, then you'll first need to build the war file, with
 the command <span class="codefrag">ant war</span>.)</p>
 <p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
 file may be installed with the commands:</p>
 <pre class="code">rm -rf ~/local/tomcat/webapps/ROOT*
 cp nutch*.war ~/local/tomcat/webapps/ROOT.war
 </pre>
 <p>The webapp finds its indexes in <span class="codefrag">./crawl</span>, relative
 to where you start Tomcat, so use a command like:</p>
 <pre class="code">~/local/tomcat/bin/catalina.sh start
 </pre>
 <p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>
 and have fun!</p>
 <p>More detailed tutorials are available on the Nutch Wiki.
 </p>
 </div>


 </div>
 <div class="clearboth">&nbsp;</div>
 </div>
 <div id="footer">
 <div class="lastmodified">
 <script type="text/javascript"><!--
 document.write("<text>Last Published:</text> " + document.lastModified);
 //  --></script>
 </div>
 <div class="copyright">
         Copyright &copy;
          2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
 </div>
 </div>
 </body>
 </html>
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
	<html>
	<head>
	<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
	<meta content="Apache Forrest" name="Generator">
	<meta name="Forrest-version" content="0.7">
	<meta name="Forrest-skin-name" content="pelt">
	<title>Nutch version 0.8 tutorial</title>
	<link type="text/css" href="skin/basic.css" rel="stylesheet">
	<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
	<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
	<link type="text/css" href="skin/profile.css" rel="stylesheet">
	<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
	<link rel="shortcut icon" href="images/favicon.ico">
	</head>
	<body onload="init()">
	<script type="text/javascript">ndeSetTextSize();</script>
	<div id="top">
	<div class="breadtrail">
	<a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a> > <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
	</div>
	<div class="header">
	<div class="grouplogo">
	<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/java/docs/images/lucene_green_150.gif" title="Apache Lucene"></a>
	</div>
	<div class="projectlogo">
	<a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a>
	</div>
	<div class="searchbox">
	<form action="http://www.google.com/search" method="get" class="roundtopsmall">
	<input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">
	<input attr="value" name="Search" value="Search" type="submit">
	</form>
	</div>
	<ul id="tabs">
	<li class="current">
	<a class="base-selected" href="index.html">Main</a>
	</li>
	<li>
	<a class="base-not-selected" href="http://wiki.apache.org/nutch/">Wiki</a>
	</li>
	</ul>
	</div>
	</div>
	<div id="main">
	<div id="publishedStrip">
	<div id="level2tabs"></div>
	<script type="text/javascript"><!--
	document.write("<text>Last Published:</text> " + document.lastModified);
	// --></script>
	</div>
	<div class="breadtrail">


	</div>
	<div id="menu">
	<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div>
	<div id="menu_1.1" class="menuitemgroup">
	<div class="menuitem">
	<a href="index.html">News</a>
	</div>
	<div class="menuitem">
	<a href="about.html">About</a>
	</div>
	<div class="menuitem">
	<a href="credits.html">Credits</a>
	</div>
	<div class="menuitem">
	<a href="http://www.cafepress.com/nutch/">Buy Stuff</a>
	</div>
	</div>
	<div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
	<div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
	<div class="menuitem">
	<a href="http://wiki.apache.org/nutch/FAQ">FAQ</a>
	</div>
	<div class="menuitem">
	<a href="http://wiki.apache.org/nutch/">Wiki</a>
	</div>
	<div class="menuitem">
	<a href="tutorial.html">Tutorial ver. 0.7.2</a>
	</div>
	<div class="menupage">
	<div class="menupagetitle">Tutorial ver. 0.8</div>
	</div>
	<div class="menuitem">
	<a href="bot.html">Robot </a>
	</div>
	<div class="menuitem">
	<a href="i18n.html">i18n</a>
	</div>
	<div class="menuitem">
	<a href="apidocs/index.html">API Docs ver. 0.7.2</a>
	</div>
	<div class="menuitem">
	<a href="http://lucene.apache.org/nutch-nightly/docs/api/index.html">API Docs ver. 0.8</a>
	</div>
	</div>
	<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
	<div id="menu_1.3" class="menuitemgroup">
	<div class="menuitem">
	<a href="release/">Download</a>
	</div>
	<div class="menuitem">
	<a href="nightly.html">Nightly builds</a>
	</div>
	<div class="menuitem">
	<a href="mailing_lists.html">Mailing Lists</a>
	</div>
	<div class="menuitem">
	<a href="issue_tracking.html">Issue Tracking</a>
	</div>
	<div class="menuitem">
	<a href="version_control.html">Version Control</a>
	</div>
	</div>
	<div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div>
	<div id="menu_1.4" class="menuitemgroup">
	<div class="menuitem">
	<a href="http://lucene.apache.org/java/">Lucene Java</a>
	</div>
	<div class="menuitem">
	<a href="http://lucene.apache.org/hadoop/">Hadoop</a>
	</div>
	</div>
	<div id="credit"></div>
	<div id="roundbottom">
	<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
	<div id="credit2"></div>
	</div>
	<div id="content">
	<div title="Portable Document Format" class="pdflink">
	<a class="dida" href="tutorial8.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
	PDF</a>
	</div>
	<h1>Nutch version 0.8 tutorial</h1>
	<div id="minitoc-area">
	<ul class="minitoc">
	<li>
	<a href="#Requirements">Requirements</a>
	</li>
	<li>
	<a href="#Getting+Started">Getting Started</a>
	</li>
	<li>
	<a href="#Intranet+Crawling">Intranet Crawling</a>
	<ul class="minitoc">
	<li>
	<a href="#Intranet%3A+Configuration">Intranet: Configuration</a>
	</li>
	<li>
	<a href="#Intranet%3A+Running+the+Crawl">Intranet: Running the Crawl</a>
	</li>
	</ul>
	</li>
	<li>
	<a href="#Whole-web+Crawling">Whole-web Crawling</a>
	<ul class="minitoc">
	<li>
	<a href="#Whole-web%3A+Concepts">Whole-web: Concepts</a>
	</li>
	<li>
	<a href="#Whole-web%3A+Boostrapping+the+Web+Database">Whole-web: Boostrapping the Web Database</a>
	</li>
	<li>
	<a href="#Whole-web%3A+Fetching">Whole-web: Fetching</a>
	</li>
	<li>
	<a href="#Whole-web%3A+Indexing">Whole-web: Indexing</a>
	</li>
	<li>
	<a href="#Searching">Searching</a>
	</li>
	</ul>
	</li>
	</ul>
	</div>


	<a name="N1000C"></a><a name="Requirements"></a>
	<h2 class="h3">Requirements</h2>
	<div class="section">
	<ol>

	<li>Java 1.4.x, either from <a href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on
	Linux is preferred. Set <span class="codefrag">NUTCH_JAVA_HOME</span> to the root
	of your JVM installation.
	</li>

	<li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a>
	4.x.</li>

	<li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for
	shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li>

	<li>Up to a gigabyte of free disk space, a high-speed connection, and
	an hour or so.
	</li>

	</ol>
	</div>

	<a name="N10035"></a><a name="Getting+Started"></a>
	<h2 class="h3">Getting Started</h2>
	<div class="section">
	<p>First, you need to get a copy of the Nutch code. You can download
	a release from <a href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>.
	Unpack the release and connect to its top-level directory. Or, check
	out the latest source code from <a href="version_control.html">subversion</a> and build it
	with <a href="http://ant.apache.org/">Ant</a>.</p>
	<p>Try the following command:</p>
	<pre class="code">bin/nutch</pre>
	<p>This will display the documentation for the Nutch command script.</p>
	<p>Now we're ready to crawl. There are two approaches to crawling:</p>
	<ol>

	<li>Intranet crawling, with the <span class="codefrag">crawl</span> command.</li>

	<li>Whole-web crawling, with much greater control, using the lower
	level <span class="codefrag">inject</span>, <span class="codefrag">generate</span>, <span class="codefrag">fetch</span>
	and <span class="codefrag">updatedb</span> commands.</li>

	</ol>
	</div>

	<a name="N10070"></a><a name="Intranet+Crawling"></a>
	<h2 class="h3">Intranet Crawling</h2>
	<div class="section">
	<p>Intranet crawling is more appropriate when you intend to crawl up to
	around one million pages on a handful of web servers.</p>
	<a name="N10079"></a><a name="Intranet%3A+Configuration"></a>
	<h3 class="h4">Intranet: Configuration</h3>
	<p>To configure things for intranet crawling you must:</p>
	<ol>


	<li>Create a directory with a flat file of root urls. For example, to
	crawl the <span class="codefrag">nutch</span> site you might start with a file named
	<span class="codefrag">urls/nutch</span> containing the url of just the Nutch home
	page. All other Nutch pages should be reachable from this page. The
	<span class="codefrag">urls/nutch</span> file would thus contain:
	<pre class="code">
	http://lucene.apache.org/nutch/
	</pre>

	</li>


	<li>Edit the file <span class="codefrag">conf/crawl-urlfilter.txt</span> and replace
	<span class="codefrag">MY.DOMAIN.NAME</span> with the name of the domain you wish to
	crawl. For example, if you wished to limit the crawl to the
	<span class="codefrag">apache.org</span> domain, the line should read:
	<pre class="code">
	+^http://([a-z0-9]\.)apache.org/
	</pre>
	This will include any url in the domain <span class="codefrag">apache.org</span>.
	</li>

	<li>Edit the file <span class="codefrag">conf/nutch-site.xml</span>, insert at minimum
	following properties into it and edit in proper values for the properties:
	<pre class="code">

	<property>
	<name>http.agent.name</name>
	<value></value>
	<description>HTTP 'User-Agent' request header. MUST NOT be empty -
	please set this to a single word uniquely related to your organization.

	NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

	and set their values appropriately.

	</description>
	</property>

	<property>
	<name>http.agent.description</name>
	<value></value>
	<description>Further description of our bot- this text is used in
	the User-Agent header. It appears in parenthesis after the agent name.
	</description>
	</property>

	<property>
	<name>http.agent.url</name>
	<value></value>
	<description>A URL to advertise in the User-Agent header. This will
	appear in parenthesis after the agent name. Custom dictates that this
	should be a URL of a page explaining the purpose and behavior of this
	crawler.
	</description>
	</property>

	<property>
	<name>http.agent.email</name>
	<value></value>
	<description>An email address to advertise in the HTTP 'From' request
	header and User-Agent header. A good practice is to mangle this
	address (e.g. 'info at example dot com') to avoid spamming.
	</description>
	</property>


	</pre>

	</li>

	</ol>
	<a name="N100B3"></a><a name="Intranet%3A+Running+the+Crawl"></a>
	<h3 class="h4">Intranet: Running the Crawl</h3>
	<p>Once things are configured, running the crawl is easy. Just use the
	crawl command. Its options include:</p>
	<ul>

	<li>
	<span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li>

	<li>
	<span class="codefrag">-threads</span> <em>threads</em> determines the number of
	threads that will fetch in parallel.</li>

	<li>
	<span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root
	page that should be crawled.</li>

	<li>
	<span class="codefrag">-topN</span> <em>N</em> determines the maximum number of pages that
	will be retrieved at each level up to the depth.</li>

	</ul>
	<p>For example, a typical call might be:</p>
	<pre class="code">
	bin/nutch crawl urls -dir crawl -depth 3 -topN 50
	</pre>
	<p>Typically one starts testing one's configuration by crawling at
	shallow depths, sharply limiting the number of pages fetched at each
	level (<span class="codefrag">-topN</span>), and watching the output to check that
	desired pages are fetched and undesirable pages are not. Once one is
	confident of the configuration, then an appropriate depth for a full
	crawl is around 10. The number of pages per level
	(<span class="codefrag">-topN</span>) for a full crawl can be from tens of thousands to
	millions, depending on your resources.</p>
	<p>Once crawling has completed, one can skip to the Searching section
	below.</p>
	</div>


	<a name="N100F4"></a><a name="Whole-web+Crawling"></a>
	<h2 class="h3">Whole-web Crawling</h2>
	<div class="section">
	<p>Whole-web crawling is designed to handle very large crawls which may
	take weeks to complete, running on multiple machines.</p>
	<a name="N100FD"></a><a name="Whole-web%3A+Concepts"></a>
	<h3 class="h4">Whole-web: Concepts</h3>
	<p>Nutch data is composed of:</p>
	<ol>


	<li>The crawl database, or <em>crawldb</em>. This contains
	information about every url known to Nutch, including whether it was
	fetched, and, if so, when.</li>


	<li>The link database, or <em>linkdb</em>. This contains the list
	of known links to each url, including both the source url and anchor
	text of the link.</li>


	<li>A set of <em>segments</em>. Each segment is a set of urls that are
	fetched as a unit. Segments are directories with the following
	subdirectories:</li>


	<li>
	<ul>

	<li>a <em>crawl_generate</em> names a set of urls to be fetched</li>

	<li>a <em>crawl_fetch</em> contains the status of fetching each url</li>

	<li>a <em>content</em> contains the content of each url</li>

	<li>a <em>parse_text</em> contains the parsed text of each url</li>

	<li>a <em>parse_data</em> contains outlinks and metadata parsed
	from each url</li>

	<li>a <em>crawl_parse</em> contains the outlink urls, used to
	update the crawldb</li>

	</ul>
	</li>


	<li>The <em>indexes</em>are Lucene-format indexes.</li>


	</ol>
	<a name="N1014A"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a>
	<h3 class="h4">Whole-web: Boostrapping the Web Database</h3>
	<p>The <em>injector</em> adds urls to the crawldb. Let's inject URLs
	from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we
	must download and uncompress the file listing all of the DMOZ pages.
	(This is a 200+Mb file, so this will take a few minutes.)</p>
	<pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
	gunzip content.rdf.u8.gz</pre>
	<p>Next we select a random subset of these pages.
	(We use a random subset so that everyone who runs this tutorial
	doesn't hammer the same sites.) DMOZ contains around three million
	URLs. We select one out of every 5000, so that we end up with
	around 1000 URLs:</p>
	<pre class="code">mkdir dmoz
	bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls</pre>
	<p>The parser also takes a few minutes, as it must parse the full
	file. Finally, we initialize the crawl db with the selected urls.</p>
	<pre class="code">bin/nutch inject crawl/crawldb dmoz</pre>
	<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
	<a name="N10170"></a><a name="Whole-web%3A+Fetching"></a>
	<h3 class="h4">Whole-web: Fetching</h3>
	<p>
	Starting from 0.8 nutch user agent identifier needs to be configured
	before fetching. To do this you must edit the file <span class="codefrag">conf/nutch-site.xml</span>, insert at minimum
	following properties into it and edit in proper values for the properties:
	</p>
	<pre class="code">

	<property>
	<name>http.agent.name</name>
	<value></value>
	<description>HTTP 'User-Agent' request header. MUST NOT be empty -
	please set this to a single word uniquely related to your organization.

	NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

	and set their values appropriately.

	</description>
	</property>

	<property>
	<name>http.agent.description</name>
	<value></value>
	<description>Further description of our bot- this text is used in
	the User-Agent header. It appears in parenthesis after the agent name.
	</description>
	</property>

	<property>
	<name>http.agent.url</name>
	<value></value>
	<description>A URL to advertise in the User-Agent header. This will
	appear in parenthesis after the agent name. Custom dictates that this
	should be a URL of a page explaining the purpose and behavior of this
	crawler.
	</description>
	</property>

	<property>
	<name>http.agent.email</name>
	<value></value>
	<description>An email address to advertise in the HTTP 'From' request
	header and User-Agent header. A good practice is to mangle this
	address (e.g. 'info at example dot com') to avoid spamming.
	</description>
	</property>


	</pre>
	<p>To fetch, we first generate a fetchlist from the database:</p>
	<pre class="code">bin/nutch generate crawl/crawldb crawl/segments
	</pre>
	<p>This generates a fetchlist for all of the pages due to be fetched.
	The fetchlist is placed in a newly created segment directory.
	The segment directory is named by the time it's created. We
	save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p>
	<pre class="code">s1=`ls -d crawl/segments/2* \| tail -1`
	echo $s1
	</pre>
	<p>Now we run the fetcher on this segment with:</p>
	<pre class="code">bin/nutch fetch $s1</pre>
	<p>When this is complete, we update the database with the results of the
	fetch:</p>
	<pre class="code">bin/nutch updatedb crawl/crawldb $s1</pre>
	<p>Now the database has entries for all of the pages referenced by the
	initial set.</p>
	<p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
	<pre class="code">bin/nutch generate crawl/crawldb crawl/segments -topN 1000
	s2=`ls -d crawl/segments/2* \| tail -1`
	echo $s2

	bin/nutch fetch $s2
	bin/nutch updatedb crawl/crawldb $s2
	</pre>
	<p>Let's fetch one more round:</p>
	<pre class="code">
	bin/nutch generate crawl/crawldb crawl/segments -topN 1000
	s3=`ls -d crawl/segments/2* \| tail -1`
	echo $s3

	bin/nutch fetch $s3
	bin/nutch updatedb crawl/crawldb $s3
	</pre>
	<p>By this point we've fetched a few thousand pages. Let's index
	them!</p>
	<a name="N101B4"></a><a name="Whole-web%3A+Indexing"></a>
	<h3 class="h4">Whole-web: Indexing</h3>
	<p>Before indexing we first invert all of the links, so that we may
	index incoming anchor text with the pages.</p>
	<pre class="code">bin/nutch invertlinks crawl/linkdb crawl/segments</pre>
	<p>To index the segments we use the <span class="codefrag">index</span> command, as follows:</p>
	<pre class="code">bin/nutch index indexes crawl/linkdb crawl/segments/*</pre>
	<p>Now we're ready to search!</p>
	<a name="N101D5"></a><a name="Searching"></a>
	<h3 class="h4">Searching</h3>
	<p>To search you need to put the nutch war file into your servlet
	container. (If instead of downloading a Nutch release you checked the
	sources out of SVN, then you'll first need to build the war file, with
	the command <span class="codefrag">ant war</span>.)</p>
	<p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
	file may be installed with the commands:</p>
	<pre class="code">rm -rf ~/local/tomcat/webapps/ROOT*
	cp nutch*.war ~/local/tomcat/webapps/ROOT.war
	</pre>
	<p>The webapp finds its indexes in <span class="codefrag">./crawl</span>, relative
	to where you start Tomcat, so use a command like:</p>
	<pre class="code">~/local/tomcat/bin/catalina.sh start
	</pre>
	<p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>
	and have fun!</p>
	<p>More detailed tutorials are available on the Nutch Wiki.
	</p>
	</div>


	</div>
	<div class="clearboth"> </div>
	</div>
	<div id="footer">
	<div class="lastmodified">
	<script type="text/javascript"><!--
	document.write("<text>Last Published:</text> " + document.lastModified);
	// --></script>
	</div>
	<div class="copyright">
	Copyright ©
	2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
	</div>
	</div>
	</body>
	</html>