blob: fad315c67957d3bd2bd17c11ffb5e75a12cb104a [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="Apache Forrest" name="Generator">
<meta name="Forrest-version" content="0.7">
<meta name="Forrest-skin-name" content="pelt">
<title>Nutch version 0.7 tutorial</title>
<link type="text/css" href="skin/basic.css" rel="stylesheet">
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
<link type="text/css" href="skin/profile.css" rel="stylesheet">
<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
<link rel="shortcut icon" href="images/favicon.ico">
</head>
<body onload="init()">
<script type="text/javascript">ndeSetTextSize();</script>
<div id="top">
<div class="breadtrail">
<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://lucene.apache.org/">Lucene</a> &gt; <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
</div>
<div class="header">
<div class="grouplogo">
<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/java/docs/images/lucene_green_150.gif" title="Apache Lucene"></a>
</div>
<div class="projectlogo">
<a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a>
</div>
<div class="searchbox">
<form action="http://www.google.com/search" method="get" class="roundtopsmall">
<input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
<input attr="value" name="Search" value="Search" type="submit">
</form>
</div>
<ul id="tabs">
<li class="current">
<a class="base-selected" href="index.html">Main</a>
</li>
<li>
<a class="base-not-selected" href="http://wiki.apache.org/nutch/">Wiki</a>
</li>
</ul>
</div>
</div>
<div id="main">
<div id="publishedStrip">
<div id="level2tabs"></div>
<script type="text/javascript"><!--
document.write("<text>Last Published:</text> " + document.lastModified);
// --></script>
</div>
<div class="breadtrail">
&nbsp;
</div>
<div id="menu">
<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div>
<div id="menu_1.1" class="menuitemgroup">
<div class="menuitem">
<a href="index.html">News</a>
</div>
<div class="menuitem">
<a href="about.html">About</a>
</div>
<div class="menuitem">
<a href="credits.html">Credits</a>
</div>
<div class="menuitem">
<a href="http://www.cafepress.com/nutch/">Buy Stuff</a>
</div>
</div>
<div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
<div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
<div class="menuitem">
<a href="http://wiki.apache.org/nutch/FAQ">FAQ</a>
</div>
<div class="menuitem">
<a href="http://wiki.apache.org/nutch/">Wiki</a>
</div>
<div class="menupage">
<div class="menupagetitle">Tutorial ver. 0.7.2</div>
</div>
<div class="menuitem">
<a href="tutorial8.html">Tutorial ver. 0.8</a>
</div>
<div class="menuitem">
<a href="bot.html">Robot </a>
</div>
<div class="menuitem">
<a href="i18n.html">i18n</a>
</div>
<div class="menuitem">
<a href="apidocs/index.html">API Docs ver. 0.7.2</a>
</div>
<div class="menuitem">
<a href="nutch-nightly/docs/api/index.html">API Docs ver. 0.8</a>
</div>
</div>
<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
<div id="menu_1.3" class="menuitemgroup">
<div class="menuitem">
<a href="release/">Download</a>
</div>
<div class="menuitem">
<a href="nightly.html">Nightly builds</a>
</div>
<div class="menuitem">
<a href="mailing_lists.html">Mailing Lists</a>
</div>
<div class="menuitem">
<a href="issue_tracking.html">Issue Tracking</a>
</div>
<div class="menuitem">
<a href="version_control.html">Version Control</a>
</div>
</div>
<div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div>
<div id="menu_1.4" class="menuitemgroup">
<div class="menuitem">
<a href="http://lucene.apache.org/java/">Lucene Java</a>
</div>
<div class="menuitem">
<a href="http://lucene.apache.org/hadoop/">Hadoop</a>
</div>
</div>
<div id="credit"></div>
<div id="roundbottom">
<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
<div id="credit2"></div>
</div>
<div id="content">
<div title="Portable Document Format" class="pdflink">
<a class="dida" href="tutorial.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
PDF</a>
</div>
<h1>Nutch version 0.7 tutorial</h1>
<div id="minitoc-area">
<ul class="minitoc">
<li>
<a href="#Requirements">Requirements</a>
</li>
<li>
<a href="#Getting+Started">Getting Started</a>
</li>
<li>
<a href="#Intranet+Crawling">Intranet Crawling</a>
<ul class="minitoc">
<li>
<a href="#Intranet%3A+Configuration">Intranet: Configuration</a>
</li>
<li>
<a href="#Intranet%3A+Running+the+Crawl">Intranet: Running the Crawl</a>
</li>
</ul>
</li>
<li>
<a href="#Whole-web+Crawling">Whole-web Crawling</a>
<ul class="minitoc">
<li>
<a href="#Whole-web%3A+Concepts">Whole-web: Concepts</a>
</li>
<li>
<a href="#Whole-web%3A+Boostrapping+the+Web+Database">Whole-web: Boostrapping the Web Database</a>
</li>
<li>
<a href="#Whole-web%3A+Fetching">Whole-web: Fetching</a>
</li>
<li>
<a href="#Whole-web%3A+Indexing">Whole-web: Indexing</a>
</li>
<li>
<a href="#Searching">Searching</a>
</li>
</ul>
</li>
</ul>
</div>
<a name="N1000C"></a><a name="Requirements"></a>
<h2 class="h3">Requirements</h2>
<div class="section">
<ol>
<li>Java 1.4.x, either from <a href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on
Linux is preferred. Set <span class="codefrag">NUTCH_JAVA_HOME</span> to the root
of your JVM installation.
</li>
<li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a>
4.x.</li>
<li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for
shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li>
<li>Up to a gigabyte of free disk space, a high-speed connection, and
an hour or so.
</li>
</ol>
</div>
<a name="N10035"></a><a name="Getting+Started"></a>
<h2 class="h3">Getting Started</h2>
<div class="section">
<p>First, you need to get a copy of the Nutch code. You can download
a release from <a href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>.
Unpack the release and connect to its top-level directory. Or, check
out the latest source code from <a href="version_control.html">subversion</a> and build it
with <a href="http://ant.apache.org/">Ant</a>.</p>
<p>Try the following command:</p>
<pre class="code">bin/nutch</pre>
<p>This will display the documentation for the Nutch command script.</p>
<p>Now we're ready to crawl. There are two approaches to crawling:</p>
<ol>
<li>Intranet crawling, with the <span class="codefrag">crawl</span> command.</li>
<li>Whole-web crawling, with much greater control, using the lower
level <span class="codefrag">inject</span>, <span class="codefrag">generate</span>, <span class="codefrag">fetch</span>
and <span class="codefrag">updatedb</span> commands.</li>
</ol>
</div>
<a name="N10070"></a><a name="Intranet+Crawling"></a>
<h2 class="h3">Intranet Crawling</h2>
<div class="section">
<p>Intranet crawling is more appropriate when you intend to crawl up to
around one million pages on a handful of web servers.</p>
<a name="N10079"></a><a name="Intranet%3A+Configuration"></a>
<h3 class="h4">Intranet: Configuration</h3>
<p>To configure things for intranet crawling you must:</p>
<ol>
<li>Create a flat file of root urls. For example, to crawl the
<span class="codefrag">nutch</span> site you might start with a file named
<span class="codefrag">urls</span> containing just the Nutch home page. All other
Nutch pages should be reachable from this page. The <span class="codefrag">urls</span>
file would thus look like:
<pre class="code">
http://lucene.apache.org/nutch/
</pre>
</li>
<li>Edit the file <span class="codefrag">conf/crawl-urlfilter.txt</span> and replace
<span class="codefrag">MY.DOMAIN.NAME</span> with the name of the domain you wish to
crawl. For example, if you wished to limit the crawl to the
<span class="codefrag">apache.org</span> domain, the line should read:
<pre class="code">
+^http://([a-z0-9]*\.)*apache.org/
</pre>
This will include any url in the domain <span class="codefrag">apache.org</span>.
</li>
</ol>
<a name="N100A9"></a><a name="Intranet%3A+Running+the+Crawl"></a>
<h3 class="h4">Intranet: Running the Crawl</h3>
<p>Once things are configured, running the crawl is easy. Just use the
crawl command. Its options include:</p>
<ul>
<li>
<span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li>
<li>
<span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root
page that should be crawled.</li>
<li>
<span class="codefrag">-delay</span> <em>delay</em> determines the number of seconds
between accesses to each host.</li>
<li>
<span class="codefrag">-threads</span> <em>threads</em> determines the number of
threads that will fetch in parallel.</li>
</ul>
<p>For example, a typical call might be:</p>
<pre class="code">
bin/nutch crawl urls -dir crawl.test -depth 3 &gt;&amp; crawl.log
</pre>
<p>Typically one starts testing one's configuration by crawling at low
depths, and watching the output to check that desired pages are found.
Once one is more confident of the configuration, then an appropriate
depth for a full crawl is around 10.</p>
<p>Once crawling has completed, one can skip to the Searching section
below.</p>
</div>
<a name="N100E4"></a><a name="Whole-web+Crawling"></a>
<h2 class="h3">Whole-web Crawling</h2>
<div class="section">
<p>Whole-web crawling is designed to handle very large crawls which may
take weeks to complete, running on multiple machines.</p>
<a name="N100ED"></a><a name="Whole-web%3A+Concepts"></a>
<h3 class="h4">Whole-web: Concepts</h3>
<p>Nutch data is of two types:</p>
<ol>
<li>The web database. This contains information about every
page known to Nutch, and about links between those pages.</li>
<li>A set of segments. Each segment is a set of pages that are
fetched and indexed as a unit. Segment data consists of the
following types:</li>
<li>
<ul>
<li>a <em>fetchlist</em> is a file
that names a set of pages to be fetched</li>
<li>the<em> fetcher output</em> is a
set of files containing the fetched pages</li>
<li>the <em>index </em>is a
Lucene-format index of the fetcher output.</li>
</ul>
</li>
</ol>
<p>In the following examples we will keep our web database in a directory
named <span class="codefrag">db</span> and our segments
in a directory named <span class="codefrag">segments</span>:</p>
<pre class="code">mkdir db
mkdir segments</pre>
<a name="N10123"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a>
<h3 class="h4">Whole-web: Boostrapping the Web Database</h3>
<p>The admin tool is used to create a new, empty database:</p>
<pre class="code">bin/nutch admin db -create</pre>
<p>The <em>injector</em> adds urls into the database. Let's inject
URLs from the <a href="http://dmoz.org/">DMOZ</a> Open
Directory. First we must download and uncompress the file listing all
of the DMOZ pages. (This is a 200+Mb file, so this will take a few
minutes.)</p>
<pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz</pre>
<p>Next we inject a random subset of these pages into the web database.
(We use a random subset so that everyone who runs this tutorial
doesn't hammer the same sites.) DMOZ contains around three million
URLs. We inject one out of every 3000, so that we end up with
around 1000 URLs:</p>
<pre class="code">bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</pre>
<p>This also takes a few minutes, as it must parse the full file.</p>
<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
<a name="N1014C"></a><a name="Whole-web%3A+Fetching"></a>
<h3 class="h4">Whole-web: Fetching</h3>
<p>To fetch, we first generate a fetchlist from the database:</p>
<pre class="code">bin/nutch generate db segments
</pre>
<p>This generates a fetchlist for all of the pages due to be fetched.
The fetchlist is placed in a newly created segment directory.
The segment directory is named by the time it's created. We
save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p>
<pre class="code">s1=`ls -d segments/2* | tail -1`
echo $s1
</pre>
<p>Now we run the fetcher on this segment with:</p>
<pre class="code">bin/nutch fetch $s1</pre>
<p>When this is complete, we update the database with the results of the
fetch:</p>
<pre class="code">bin/nutch updatedb db $s1</pre>
<p>Now the database has entries for all of the pages referenced by the
initial set.</p>
<p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
<pre class="code">bin/nutch generate db segments -topN 1000
s2=`ls -d segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch updatedb db $s2
</pre>
<p>Let's fetch one more round:</p>
<pre class="code">
bin/nutch generate db segments -topN 1000
s3=`ls -d segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch updatedb db $s3
</pre>
<p>By this point we've fetched a few thousand pages. Let's index
them!</p>
<a name="N10186"></a><a name="Whole-web%3A+Indexing"></a>
<h3 class="h4">Whole-web: Indexing</h3>
<p>To index each segment we use the <span class="codefrag">index</span>
command, as follows:</p>
<pre class="code">bin/nutch index $s1
bin/nutch index $s2
bin/nutch index $s3</pre>
<p>Then, before we can search a set of segments, we need to delete
duplicate pages. This is done with:</p>
<pre class="code">bin/nutch dedup segments dedup.tmp</pre>
<p>Now we're ready to search!</p>
<a name="N101A1"></a><a name="Searching"></a>
<h3 class="h4">Searching</h3>
<p>To search you need to put the nutch war file into your servlet
container. (If instead of downloading a Nutch release you checked the
sources out of SVN, then you'll first need to build the war file, with
the command <span class="codefrag">ant war</span>.)</p>
<p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
file may be installed with the commands:</p>
<pre class="code">rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war
</pre>
<p>The webapp finds its indexes in <span class="codefrag">./segments</span>, relative
to where you start Tomcat, so, if you've done intranet crawling,
connect to your crawl directory, or, if you've done whole-web
crawling, don't change directories, and give the command:</p>
<pre class="code">~/local/tomcat/bin/catalina.sh start
</pre>
<p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>
and have fun!</p>
<p>More detailed tutorials are available on the Nutch Wiki.
</p>
</div>
</div>
<div class="clearboth">&nbsp;</div>
</div>
<div id="footer">
<div class="lastmodified">
<script type="text/javascript"><!--
document.write("<text>Last Published:</text> " + document.lastModified);
// --></script>
</div>
<div class="copyright">
Copyright &copy;
2005 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
</div>
</div>
</body>
</html>