site/tutorial.html - nutch - Git at Google

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
 <html>
 <head>
 <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <meta content="Apache Forrest" name="Generator">
 <meta name="Forrest-version" content="0.8">
 <meta name="Forrest-skin-name" content="lucene">
 <title>Nutch version 0.7 tutorial</title>
 <link type="text/css" href="skin/basic.css" rel="stylesheet">
 <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
 <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
 <link type="text/css" href="skin/profile.css" rel="stylesheet">
 <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
 <link rel="shortcut icon" href="images/favicon.ico">
 </head>
 <body onload="init()">
 <script type="text/javascript">ndeSetTextSize();</script>
 <div id="top">
 <!--+
     |breadtrail
     +-->
 <div class="breadtrail">
 <a href="http://www.apache.org/">Apache</a> &gt; <a href="http://lucene.apache.org/">Lucene</a> &gt; <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
 </div>
 <!--+
     |header
     +-->
 <div class="header">
 <!--+
     |start group logo
     +-->
 <div class="grouplogo">
 <a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="images/lucene_green_150.gif" title="Apache Lucene"></a>
 </div>
 <!--+
     |end group logo
     +-->
 <!--+
     |start Project Logo
     +-->
 <div class="projectlogo">
 <a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a>
 </div>
 <!--+
     |end Project Logo
     +-->
 <!--+
     |start Search
     +-->
 <div class="searchbox">
 <form action="http://search.lucidimagination.com/p:nutch" method="get" class="roundtopsmall">
 <input onFocus="getBlank (this, 'Search the site with Solr');" size="25" name="q" id="query" type="text" value="Search the site with Solr">&nbsp;
                     <input name="Search" value="Search" type="submit">
 </form>
 <div style="position: relative; top: -5px; left: -10px">Powered by <a href="http://www.lucidimagination.com" style="color: #033268">Lucid Imagination</a>
 </div>
 </div>
 <!--+
     |end search
     +-->
 <!--+
     |start Tabs
     +-->
 <ul id="tabs">
 <li class="current">
 <a class="selected" href="index.html">Main</a>
 </li>
 <li>
 <a class="unselected" href="http://wiki.apache.org/nutch/">Wiki</a>
 </li>
 <li>
 <a class="unselected" href="http://issues.apache.org/jira/browse/Nutch">Jira</a>
 </li>
 </ul>
 <!--+
     |end Tabs
     +-->
 </div>
 </div>
 <div id="main">
 <div id="publishedStrip">
 <!--+
     |start Subtabs
     +-->
 <div id="level2tabs"></div>
 <!--+
     |end Endtabs
     +-->
 <script type="text/javascript"><!--
 document.write("Last Published: " + document.lastModified);
 //  --></script>
 </div>
 <!--+
     |breadtrail
     +-->
 <div class="breadtrail">

              &nbsp;
            </div>
 <!--+
     |start Menu, mainarea
     +-->
 <!--+
     |start Menu
     +-->
 <div id="menu">
 <div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div>
 <div id="menu_1.1" class="menuitemgroup">
 <div class="menuitem">
 <a href="index.html">News</a>
 </div>
 <div class="menuitem">
 <a href="about.html">About</a>
 </div>
 <div class="menuitem">
 <a href="credits.html">Credits</a>
 </div>
 <div class="menuitem">
 <a href="http://www.cafepress.com/nutch/">Buy Stuff</a>
 </div>
 </div>
 <div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
 <div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
 <div class="menuitem">
 <a href="http://wiki.apache.org/nutch/FAQ">FAQ</a>
 </div>
 <div class="menuitem">
 <a href="http://wiki.apache.org/nutch/">Wiki</a>
 </div>
 <div class="menupage">
 <div class="menupagetitle">Tutorial (0.7.2)</div>
 </div>
 <div class="menuitem">
 <a href="tutorial8.html">Tutorial (0.8.x)</a>
 </div>
 <div class="menuitem">
 <a href="bot.html">Robot     </a>
 </div>
 <div class="menuitem">
 <a href="i18n.html">i18n</a>
 </div>
 <div class="menuitem">
 <a href="apidocs-1.0/index.html">API Docs (1.0)</a>
 </div>
 <div class="menuitem">
 <a href="apidocs-0.9/index.html">API Docs (0.9)</a>
 </div>
 <div class="menuitem">
 <a href="apidocs-0.8.x/index.html">API Docs (0.8.x)</a>
 </div>
 <div class="menuitem">
 <a href="apidocs/index.html">API Docs (0.7.2)</a>
 </div>
 <div class="menuitem">
 <a href="http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/docs/api/index.html">API Docs (nightly)</a>
 </div>
 </div>
 <div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
 <div id="menu_1.3" class="menuitemgroup">
 <div class="menuitem">
 <a href="release/">Download</a>
 </div>
 <div class="menuitem">
 <a href="nightly.html">Nightly builds</a>
 </div>
 <div class="menuitem">
 <a href="mailing_lists.html">Mailing Lists</a>
 </div>
 <div class="menuitem">
 <a href="issue_tracking.html">Issue Tracking</a>
 </div>
 <div class="menuitem">
 <a href="version_control.html">Version Control</a>
 </div>
 </div>
 <div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div>
 <div id="menu_1.4" class="menuitemgroup">
 <div class="menuitem">
 <a href="http://lucene.apache.org/java/">Lucene Java</a>
 </div>
 <div class="menuitem">
 <a href="http://lucene.apache.org/hadoop/">Hadoop</a>
 </div>
 <div class="menuitem">
 <a href="http://incubator.apache.org/solr/">Solr</a>
 </div>
 </div>
 <div id="credit"></div>
 <div id="roundbottom">
 <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
 <!--+
   |alternative credits
   +-->
 <div id="credit2"></div>
 </div>
 <!--+
     |end Menu
     +-->
 <!--+
     |start content
     +-->
 <div id="content">
 <div title="Portable Document Format" class="pdflink">
 <a class="dida" href="tutorial.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
         PDF</a>
 </div>
 <h1>Nutch version 0.7 tutorial</h1>
 <div id="minitoc-area">
 <ul class="minitoc">
 <li>
 <a href="#Requirements">Requirements</a>
 </li>
 <li>
 <a href="#Getting+Started">Getting Started</a>
 </li>
 <li>
 <a href="#Intranet+Crawling">Intranet Crawling</a>
 <ul class="minitoc">
 <li>
 <a href="#Intranet%3A+Configuration">Intranet: Configuration</a>
 </li>
 <li>
 <a href="#Intranet%3A+Running+the+Crawl">Intranet: Running the Crawl</a>
 </li>
 </ul>
 </li>
 <li>
 <a href="#Whole-web+Crawling">Whole-web Crawling</a>
 <ul class="minitoc">
 <li>
 <a href="#Whole-web%3A+Concepts">Whole-web: Concepts</a>
 </li>
 <li>
 <a href="#Whole-web%3A+Boostrapping+the+Web+Database">Whole-web: Boostrapping the Web Database</a>
 </li>
 <li>
 <a href="#Whole-web%3A+Fetching">Whole-web: Fetching</a>
 </li>
 <li>
 <a href="#Whole-web%3A+Indexing">Whole-web: Indexing</a>
 </li>
 <li>
 <a href="#Searching">Searching</a>
 </li>
 </ul>
 </li>
 </ul>
 </div>


 <a name="N1000D"></a><a name="Requirements"></a>
 <h2 class="h3">Requirements</h2>
 <div class="section">
 <ol>

 <li>Java 1.4.x, either from <a href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on
  Linux is preferred.  Set <span class="codefrag">NUTCH_JAVA_HOME</span> to the root
  of your JVM installation.
   </li>

 <li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a>
 4.x.</li>

 <li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for
 shell support.  (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li>

 <li>Up to a gigabyte of free disk space, a high-speed connection, and
 an hour or so.
   </li>

 </ol>
 </div>

 <a name="N10036"></a><a name="Getting+Started"></a>
 <h2 class="h3">Getting Started</h2>
 <div class="section">
 <p>First, you need to get a copy of the Nutch code.  You can download
 a release from <a href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>.
 Unpack the release and connect to its top-level directory.  Or, check
 out the latest source code from <a href="version_control.html">subversion</a> and build it
 with <a href="http://ant.apache.org/">Ant</a>.</p>
 <p>Try the following command:</p>
 <pre class="code">bin/nutch</pre>
 <p>This will display the documentation for the Nutch command script.</p>
 <p>Now we're ready to crawl.  There are two approaches to crawling:</p>
 <ol>

 <li>Intranet crawling, with the <span class="codefrag">crawl</span> command.</li>

 <li>Whole-web crawling, with much greater control, using the lower
 level <span class="codefrag">inject</span>, <span class="codefrag">generate</span>, <span class="codefrag">fetch</span>
 and <span class="codefrag">updatedb</span> commands.</li>

 </ol>
 </div>

 <a name="N10071"></a><a name="Intranet+Crawling"></a>
 <h2 class="h3">Intranet Crawling</h2>
 <div class="section">
 <p>Intranet crawling is more appropriate when you intend to crawl up to
 around one million pages on a handful of web servers.</p>
 <a name="N1007A"></a><a name="Intranet%3A+Configuration"></a>
 <h3 class="h4">Intranet: Configuration</h3>
 <p>To configure things for intranet crawling you must:</p>
 <ol>


 <li>Create a flat file of root urls.  For example, to crawl the
 <span class="codefrag">nutch</span> site you might start with a file named
 <span class="codefrag">urls</span> containing just the Nutch home page.  All other
 Nutch pages should be reachable from this page.  The <span class="codefrag">urls</span>
 file would thus look like:
 <pre class="code">
 http://lucene.apache.org/nutch/
 </pre>

 </li>


 <li>Edit the file <span class="codefrag">conf/crawl-urlfilter.txt</span> and replace
 <span class="codefrag">MY.DOMAIN.NAME</span> with the name of the domain you wish to
 crawl.  For example, if you wished to limit the crawl to the
 <span class="codefrag">apache.org</span> domain, the line should read:
 <pre class="code">
 +^http://([a-z0-9]*\.)*apache.org/
 </pre>
 This will include any url in the domain <span class="codefrag">apache.org</span>.
 </li>


 </ol>
 <a name="N100AA"></a><a name="Intranet%3A+Running+the+Crawl"></a>
 <h3 class="h4">Intranet: Running the Crawl</h3>
 <p>Once things are configured, running the crawl is easy.  Just use the
 crawl command.  Its options include:</p>
 <ul>

 <li>
 <span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li>

 <li>
 <span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root
 page that should be crawled.</li>

 <li>
 <span class="codefrag">-delay</span> <em>delay</em> determines the number of seconds
 between accesses to each host.</li>

 <li>
 <span class="codefrag">-threads</span> <em>threads</em> determines the number of
 threads that will fetch in parallel.</li>

 </ul>
 <p>For example, a typical call might be:</p>
 <pre class="code">
 bin/nutch crawl urls -dir crawl.test -depth 3 &gt;&amp; crawl.log
 </pre>
 <p>Typically one starts testing one's configuration by crawling at low
 depths, and watching the output to check that desired pages are found.
 Once one is more confident of the configuration, then an appropriate
 depth for a full crawl is around 10.</p>
 <p>Once crawling has completed, one can skip to the Searching section
 below.</p>
 </div>


 <a name="N100E5"></a><a name="Whole-web+Crawling"></a>
 <h2 class="h3">Whole-web Crawling</h2>
 <div class="section">
 <p>Whole-web crawling is designed to handle very large crawls which may
 take weeks to complete, running on multiple machines.</p>
 <a name="N100EE"></a><a name="Whole-web%3A+Concepts"></a>
 <h3 class="h4">Whole-web: Concepts</h3>
 <p>Nutch data is of two types:</p>
 <ol>

 <li>The web database.  This contains information about every
 page known to Nutch, and about links between those pages.</li>

 <li>A set of segments.  Each segment is a set of pages that are
 fetched and indexed as a unit.  Segment data consists of the
 following types:</li>

 <li>
 <ul>

 <li>a <em>fetchlist</em> is a file
 that names a set of pages to be fetched</li>

 <li>the<em> fetcher output</em> is a
 set of files containing the fetched pages</li>

 <li>the <em>index </em>is a
 Lucene-format index of the fetcher output.</li>

 </ul>
 </li>

 </ol>
 <p>In the following examples we will keep our web database in a directory
 named <span class="codefrag">db</span> and our segments
 in a directory named <span class="codefrag">segments</span>:</p>
 <pre class="code">mkdir db
 mkdir segments</pre>
 <a name="N10124"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a>
 <h3 class="h4">Whole-web: Boostrapping the Web Database</h3>
 <p>The admin tool is used to create a new, empty database:</p>
 <pre class="code">bin/nutch admin db -create</pre>
 <p>The <em>injector</em> adds urls into the database.  Let's inject
 URLs from the <a href="http://dmoz.org/">DMOZ</a> Open
 Directory. First we must download and uncompress the file listing all
 of the DMOZ pages.  (This is a 200+Mb file, so this will take a few
 minutes.)</p>
 <pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
 gunzip content.rdf.u8.gz</pre>
 <p>Next we inject a random subset of these pages into the web database.
  (We use a random subset so that everyone who runs this tutorial
 doesn't hammer the same sites.)  DMOZ contains around three million
 URLs.  We inject one out of every 3000, so that we end up with
 around 1000 URLs:</p>
 <pre class="code">bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</pre>
 <p>This also takes a few minutes, as it must parse the full file.</p>
 <p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
 <a name="N1014D"></a><a name="Whole-web%3A+Fetching"></a>
 <h3 class="h4">Whole-web: Fetching</h3>
 <p>To fetch, we first generate a fetchlist from the database:</p>
 <pre class="code">bin/nutch generate db segments
 </pre>
 <p>This generates a fetchlist for all of the pages due to be fetched.
  The fetchlist is placed in a newly created segment directory.
  The segment directory is named by the time it's created.  We
 save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p>
 <pre class="code">s1=`ls -d segments/2* | tail -1`
 echo $s1
 </pre>
 <p>Now we run the fetcher on this segment with:</p>
 <pre class="code">bin/nutch fetch $s1</pre>
 <p>When this is complete, we update the database with the results of the
 fetch:</p>
 <pre class="code">bin/nutch updatedb db $s1</pre>
 <p>Now the database has entries for all of the pages referenced by the
 initial set.</p>
 <p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
 <pre class="code">bin/nutch generate db segments -topN 1000
 s2=`ls -d segments/2* | tail -1`
 echo $s2

 bin/nutch fetch $s2
 bin/nutch updatedb db $s2
 </pre>
 <p>Let's fetch one more round:</p>
 <pre class="code">
 bin/nutch generate db segments -topN 1000
 s3=`ls -d segments/2* | tail -1`
 echo $s3

 bin/nutch fetch $s3
 bin/nutch updatedb db $s3
 </pre>
 <p>By this point we've fetched a few thousand pages.  Let's index
 them!</p>
 <a name="N10187"></a><a name="Whole-web%3A+Indexing"></a>
 <h3 class="h4">Whole-web: Indexing</h3>
 <p>To index each segment we use the <span class="codefrag">index</span>
 command, as follows:</p>
 <pre class="code">bin/nutch index $s1
 bin/nutch index $s2
 bin/nutch index $s3</pre>
 <p>Then, before we can search a set of segments, we need to delete
 duplicate pages.  This is done with:</p>
 <pre class="code">bin/nutch dedup segments dedup.tmp</pre>
 <p>Now we're ready to search!</p>
 <a name="N101A2"></a><a name="Searching"></a>
 <h3 class="h4">Searching</h3>
 <p>To search you need to put the nutch war file into your servlet
 container.  (If instead of downloading a Nutch release you checked the
 sources out of SVN, then you'll first need to build the war file, with
 the command <span class="codefrag">ant war</span>.)</p>
 <p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
 file may be installed with the commands:</p>
 <pre class="code">rm -rf ~/local/tomcat/webapps/ROOT*
 cp nutch*.war ~/local/tomcat/webapps/ROOT.war
 </pre>
 <p>The webapp finds its indexes in <span class="codefrag">./segments</span>, relative
 to where you start Tomcat, so, if you've done intranet crawling,
 connect to your crawl directory, or, if you've done whole-web
 crawling, don't change directories, and give the command:</p>
 <pre class="code">~/local/tomcat/bin/catalina.sh start
 </pre>
 <p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>
 and have fun!</p>
 <p>More detailed tutorials are available on the Nutch Wiki.
 </p>
 </div>


 </div>
 <!--+
     |end content
     +-->
 <div class="clearboth">&nbsp;</div>
 </div>
 <div id="footer">
 <!--+
     |start bottomstrip
     +-->
 <div class="lastmodified">
 <script type="text/javascript"><!--
 document.write("Last Published: " + document.lastModified);
 //  --></script>
 </div>
 <div class="copyright">
         Copyright &copy;
          2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
 </div>
 <!--+
     |end bottomstrip
     +-->
 </div>
 </body>
 </html>
	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
	<html>
	<head>
	<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
	<meta content="Apache Forrest" name="Generator">
	<meta name="Forrest-version" content="0.8">
	<meta name="Forrest-skin-name" content="lucene">
	<title>Nutch version 0.7 tutorial</title>
	<link type="text/css" href="skin/basic.css" rel="stylesheet">
	<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
	<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
	<link type="text/css" href="skin/profile.css" rel="stylesheet">
	<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
	<link rel="shortcut icon" href="images/favicon.ico">
	</head>
	<body onload="init()">
	<script type="text/javascript">ndeSetTextSize();</script>
	<div id="top">
	<!--+
	\|breadtrail
	+-->
	<div class="breadtrail">
	<a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a> > <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
	</div>
	<!--+
	\|header
	+-->
	<div class="header">
	<!--+
	\|start group logo
	+-->
	<div class="grouplogo">
	<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="images/lucene_green_150.gif" title="Apache Lucene"></a>
	</div>
	<!--+
	\|end group logo
	+-->
	<!--+
	\|start Project Logo
	+-->
	<div class="projectlogo">
	<a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a>
	</div>
	<!--+
	\|end Project Logo
	+-->
	<!--+
	\|start Search
	+-->
	<div class="searchbox">
	<form action="http://search.lucidimagination.com/p:nutch" method="get" class="roundtopsmall">
	<input onFocus="getBlank (this, 'Search the site with Solr');" size="25" name="q" id="query" type="text" value="Search the site with Solr">
	<input name="Search" value="Search" type="submit">
	</form>
	<div style="position: relative; top: -5px; left: -10px">Powered by <a href="http://www.lucidimagination.com" style="color: #033268">Lucid Imagination</a>
	</div>
	</div>
	<!--+
	\|end search
	+-->
	<!--+
	\|start Tabs
	+-->
	<ul id="tabs">
	<li class="current">
	<a class="selected" href="index.html">Main</a>
	</li>
	<li>
	<a class="unselected" href="http://wiki.apache.org/nutch/">Wiki</a>
	</li>
	<li>
	<a class="unselected" href="http://issues.apache.org/jira/browse/Nutch">Jira</a>
	</li>
	</ul>
	<!--+
	\|end Tabs
	+-->
	</div>
	</div>
	<div id="main">
	<div id="publishedStrip">
	<!--+
	\|start Subtabs
	+-->
	<div id="level2tabs"></div>
	<!--+
	\|end Endtabs
	+-->
	<script type="text/javascript"><!--
	document.write("Last Published: " + document.lastModified);
	// --></script>
	</div>
	<!--+
	\|breadtrail
	+-->
	<div class="breadtrail">


	</div>
	<!--+
	\|start Menu, mainarea
	+-->
	<!--+
	\|start Menu
	+-->
	<div id="menu">
	<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div>
	<div id="menu_1.1" class="menuitemgroup">
	<div class="menuitem">
	<a href="index.html">News</a>
	</div>
	<div class="menuitem">
	<a href="about.html">About</a>
	</div>
	<div class="menuitem">
	<a href="credits.html">Credits</a>
	</div>
	<div class="menuitem">
	<a href="http://www.cafepress.com/nutch/">Buy Stuff</a>
	</div>
	</div>
	<div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
	<div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
	<div class="menuitem">
	<a href="http://wiki.apache.org/nutch/FAQ">FAQ</a>
	</div>
	<div class="menuitem">
	<a href="http://wiki.apache.org/nutch/">Wiki</a>
	</div>
	<div class="menupage">
	<div class="menupagetitle">Tutorial (0.7.2)</div>
	</div>
	<div class="menuitem">
	<a href="tutorial8.html">Tutorial (0.8.x)</a>
	</div>
	<div class="menuitem">
	<a href="bot.html">Robot </a>
	</div>
	<div class="menuitem">
	<a href="i18n.html">i18n</a>
	</div>
	<div class="menuitem">
	<a href="apidocs-1.0/index.html">API Docs (1.0)</a>
	</div>
	<div class="menuitem">
	<a href="apidocs-0.9/index.html">API Docs (0.9)</a>
	</div>
	<div class="menuitem">
	<a href="apidocs-0.8.x/index.html">API Docs (0.8.x)</a>
	</div>
	<div class="menuitem">
	<a href="apidocs/index.html">API Docs (0.7.2)</a>
	</div>
	<div class="menuitem">
	<a href="http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/docs/api/index.html">API Docs (nightly)</a>
	</div>
	</div>
	<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
	<div id="menu_1.3" class="menuitemgroup">
	<div class="menuitem">
	<a href="release/">Download</a>
	</div>
	<div class="menuitem">
	<a href="nightly.html">Nightly builds</a>
	</div>
	<div class="menuitem">
	<a href="mailing_lists.html">Mailing Lists</a>
	</div>
	<div class="menuitem">
	<a href="issue_tracking.html">Issue Tracking</a>
	</div>
	<div class="menuitem">
	<a href="version_control.html">Version Control</a>
	</div>
	</div>
	<div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div>
	<div id="menu_1.4" class="menuitemgroup">
	<div class="menuitem">
	<a href="http://lucene.apache.org/java/">Lucene Java</a>
	</div>
	<div class="menuitem">
	<a href="http://lucene.apache.org/hadoop/">Hadoop</a>
	</div>
	<div class="menuitem">
	<a href="http://incubator.apache.org/solr/">Solr</a>
	</div>
	</div>
	<div id="credit"></div>
	<div id="roundbottom">
	<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
	<!--+
	\|alternative credits
	+-->
	<div id="credit2"></div>
	</div>
	<!--+
	\|end Menu
	+-->
	<!--+
	\|start content
	+-->
	<div id="content">
	<div title="Portable Document Format" class="pdflink">
	<a class="dida" href="tutorial.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
	PDF</a>
	</div>
	<h1>Nutch version 0.7 tutorial</h1>
	<div id="minitoc-area">
	<ul class="minitoc">
	<li>
	<a href="#Requirements">Requirements</a>
	</li>
	<li>
	<a href="#Getting+Started">Getting Started</a>
	</li>
	<li>
	<a href="#Intranet+Crawling">Intranet Crawling</a>
	<ul class="minitoc">
	<li>
	<a href="#Intranet%3A+Configuration">Intranet: Configuration</a>
	</li>
	<li>
	<a href="#Intranet%3A+Running+the+Crawl">Intranet: Running the Crawl</a>
	</li>
	</ul>
	</li>
	<li>
	<a href="#Whole-web+Crawling">Whole-web Crawling</a>
	<ul class="minitoc">
	<li>
	<a href="#Whole-web%3A+Concepts">Whole-web: Concepts</a>
	</li>
	<li>
	<a href="#Whole-web%3A+Boostrapping+the+Web+Database">Whole-web: Boostrapping the Web Database</a>
	</li>
	<li>
	<a href="#Whole-web%3A+Fetching">Whole-web: Fetching</a>
	</li>
	<li>
	<a href="#Whole-web%3A+Indexing">Whole-web: Indexing</a>
	</li>
	<li>
	<a href="#Searching">Searching</a>
	</li>
	</ul>
	</li>
	</ul>
	</div>


	<a name="N1000D"></a><a name="Requirements"></a>
	<h2 class="h3">Requirements</h2>
	<div class="section">
	<ol>

	<li>Java 1.4.x, either from <a href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on
	Linux is preferred. Set <span class="codefrag">NUTCH_JAVA_HOME</span> to the root
	of your JVM installation.
	</li>

	<li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a>
	4.x.</li>

	<li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for
	shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li>

	<li>Up to a gigabyte of free disk space, a high-speed connection, and
	an hour or so.
	</li>

	</ol>
	</div>

	<a name="N10036"></a><a name="Getting+Started"></a>
	<h2 class="h3">Getting Started</h2>
	<div class="section">
	<p>First, you need to get a copy of the Nutch code. You can download
	a release from <a href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>.
	Unpack the release and connect to its top-level directory. Or, check
	out the latest source code from <a href="version_control.html">subversion</a> and build it
	with <a href="http://ant.apache.org/">Ant</a>.</p>
	<p>Try the following command:</p>
	<pre class="code">bin/nutch</pre>
	<p>This will display the documentation for the Nutch command script.</p>
	<p>Now we're ready to crawl. There are two approaches to crawling:</p>
	<ol>

	<li>Intranet crawling, with the <span class="codefrag">crawl</span> command.</li>

	<li>Whole-web crawling, with much greater control, using the lower
	level <span class="codefrag">inject</span>, <span class="codefrag">generate</span>, <span class="codefrag">fetch</span>
	and <span class="codefrag">updatedb</span> commands.</li>

	</ol>
	</div>

	<a name="N10071"></a><a name="Intranet+Crawling"></a>
	<h2 class="h3">Intranet Crawling</h2>
	<div class="section">
	<p>Intranet crawling is more appropriate when you intend to crawl up to
	around one million pages on a handful of web servers.</p>
	<a name="N1007A"></a><a name="Intranet%3A+Configuration"></a>
	<h3 class="h4">Intranet: Configuration</h3>
	<p>To configure things for intranet crawling you must:</p>
	<ol>


	<li>Create a flat file of root urls. For example, to crawl the
	<span class="codefrag">nutch</span> site you might start with a file named
	<span class="codefrag">urls</span> containing just the Nutch home page. All other
	Nutch pages should be reachable from this page. The <span class="codefrag">urls</span>
	file would thus look like:
	<pre class="code">
	http://lucene.apache.org/nutch/
	</pre>

	</li>


	<li>Edit the file <span class="codefrag">conf/crawl-urlfilter.txt</span> and replace
	<span class="codefrag">MY.DOMAIN.NAME</span> with the name of the domain you wish to
	crawl. For example, if you wished to limit the crawl to the
	<span class="codefrag">apache.org</span> domain, the line should read:
	<pre class="code">
	+^http://([a-z0-9]\.)apache.org/
	</pre>
	This will include any url in the domain <span class="codefrag">apache.org</span>.
	</li>


	</ol>
	<a name="N100AA"></a><a name="Intranet%3A+Running+the+Crawl"></a>
	<h3 class="h4">Intranet: Running the Crawl</h3>
	<p>Once things are configured, running the crawl is easy. Just use the
	crawl command. Its options include:</p>
	<ul>

	<li>
	<span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li>

	<li>
	<span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root
	page that should be crawled.</li>

	<li>
	<span class="codefrag">-delay</span> <em>delay</em> determines the number of seconds
	between accesses to each host.</li>

	<li>
	<span class="codefrag">-threads</span> <em>threads</em> determines the number of
	threads that will fetch in parallel.</li>

	</ul>
	<p>For example, a typical call might be:</p>
	<pre class="code">
	bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
	</pre>
	<p>Typically one starts testing one's configuration by crawling at low
	depths, and watching the output to check that desired pages are found.
	Once one is more confident of the configuration, then an appropriate
	depth for a full crawl is around 10.</p>
	<p>Once crawling has completed, one can skip to the Searching section
	below.</p>
	</div>


	<a name="N100E5"></a><a name="Whole-web+Crawling"></a>
	<h2 class="h3">Whole-web Crawling</h2>
	<div class="section">
	<p>Whole-web crawling is designed to handle very large crawls which may
	take weeks to complete, running on multiple machines.</p>
	<a name="N100EE"></a><a name="Whole-web%3A+Concepts"></a>
	<h3 class="h4">Whole-web: Concepts</h3>
	<p>Nutch data is of two types:</p>
	<ol>

	<li>The web database. This contains information about every
	page known to Nutch, and about links between those pages.</li>

	<li>A set of segments. Each segment is a set of pages that are
	fetched and indexed as a unit. Segment data consists of the
	following types:</li>

	<li>
	<ul>

	<li>a <em>fetchlist</em> is a file
	that names a set of pages to be fetched</li>

	<li>the<em> fetcher output</em> is a
	set of files containing the fetched pages</li>

	<li>the <em>index </em>is a
	Lucene-format index of the fetcher output.</li>

	</ul>
	</li>

	</ol>
	<p>In the following examples we will keep our web database in a directory
	named <span class="codefrag">db</span> and our segments
	in a directory named <span class="codefrag">segments</span>:</p>
	<pre class="code">mkdir db
	mkdir segments</pre>
	<a name="N10124"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a>
	<h3 class="h4">Whole-web: Boostrapping the Web Database</h3>
	<p>The admin tool is used to create a new, empty database:</p>
	<pre class="code">bin/nutch admin db -create</pre>
	<p>The <em>injector</em> adds urls into the database. Let's inject
	URLs from the <a href="http://dmoz.org/">DMOZ</a> Open
	Directory. First we must download and uncompress the file listing all
	of the DMOZ pages. (This is a 200+Mb file, so this will take a few
	minutes.)</p>
	<pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
	gunzip content.rdf.u8.gz</pre>
	<p>Next we inject a random subset of these pages into the web database.
	(We use a random subset so that everyone who runs this tutorial
	doesn't hammer the same sites.) DMOZ contains around three million
	URLs. We inject one out of every 3000, so that we end up with
	around 1000 URLs:</p>
	<pre class="code">bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</pre>
	<p>This also takes a few minutes, as it must parse the full file.</p>
	<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
	<a name="N1014D"></a><a name="Whole-web%3A+Fetching"></a>
	<h3 class="h4">Whole-web: Fetching</h3>
	<p>To fetch, we first generate a fetchlist from the database:</p>
	<pre class="code">bin/nutch generate db segments
	</pre>
	<p>This generates a fetchlist for all of the pages due to be fetched.
	The fetchlist is placed in a newly created segment directory.
	The segment directory is named by the time it's created. We
	save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p>
	<pre class="code">s1=`ls -d segments/2* \| tail -1`
	echo $s1
	</pre>
	<p>Now we run the fetcher on this segment with:</p>
	<pre class="code">bin/nutch fetch $s1</pre>
	<p>When this is complete, we update the database with the results of the
	fetch:</p>
	<pre class="code">bin/nutch updatedb db $s1</pre>
	<p>Now the database has entries for all of the pages referenced by the
	initial set.</p>
	<p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
	<pre class="code">bin/nutch generate db segments -topN 1000
	s2=`ls -d segments/2* \| tail -1`
	echo $s2

	bin/nutch fetch $s2
	bin/nutch updatedb db $s2
	</pre>
	<p>Let's fetch one more round:</p>
	<pre class="code">
	bin/nutch generate db segments -topN 1000
	s3=`ls -d segments/2* \| tail -1`
	echo $s3

	bin/nutch fetch $s3
	bin/nutch updatedb db $s3
	</pre>
	<p>By this point we've fetched a few thousand pages. Let's index
	them!</p>
	<a name="N10187"></a><a name="Whole-web%3A+Indexing"></a>
	<h3 class="h4">Whole-web: Indexing</h3>
	<p>To index each segment we use the <span class="codefrag">index</span>
	command, as follows:</p>
	<pre class="code">bin/nutch index $s1
	bin/nutch index $s2
	bin/nutch index $s3</pre>
	<p>Then, before we can search a set of segments, we need to delete
	duplicate pages. This is done with:</p>
	<pre class="code">bin/nutch dedup segments dedup.tmp</pre>
	<p>Now we're ready to search!</p>
	<a name="N101A2"></a><a name="Searching"></a>
	<h3 class="h4">Searching</h3>
	<p>To search you need to put the nutch war file into your servlet
	container. (If instead of downloading a Nutch release you checked the
	sources out of SVN, then you'll first need to build the war file, with
	the command <span class="codefrag">ant war</span>.)</p>
	<p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
	file may be installed with the commands:</p>
	<pre class="code">rm -rf ~/local/tomcat/webapps/ROOT*
	cp nutch*.war ~/local/tomcat/webapps/ROOT.war
	</pre>
	<p>The webapp finds its indexes in <span class="codefrag">./segments</span>, relative
	to where you start Tomcat, so, if you've done intranet crawling,
	connect to your crawl directory, or, if you've done whole-web
	crawling, don't change directories, and give the command:</p>
	<pre class="code">~/local/tomcat/bin/catalina.sh start
	</pre>
	<p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>
	and have fun!</p>
	<p>More detailed tutorials are available on the Nutch Wiki.
	</p>
	</div>


	</div>
	<!--+
	\|end content
	+-->
	<div class="clearboth"> </div>
	</div>
	<div id="footer">
	<!--+
	\|start bottomstrip
	+-->
	<div class="lastmodified">
	<script type="text/javascript"><!--
	document.write("Last Published: " + document.lastModified);
	// --></script>
	</div>
	<div class="copyright">
	Copyright ©
	2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
	</div>
	<!--+
	\|end bottomstrip
	+-->
	</div>
	</body>
	</html>