blob: 1c1c27a84f49453ea519c8ef363e9176917ea46b [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="Apache Forrest" name="Generator">
<meta name="Forrest-version" content="0.7">
<meta name="Forrest-skin-name" content="pelt">
<title>Nutch version 0.8 tutorial</title>
<link type="text/css" href="skin/basic.css" rel="stylesheet">
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
<link type="text/css" href="skin/profile.css" rel="stylesheet">
<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
<link rel="shortcut icon" href="images/favicon.ico">
</head>
<body onload="init()">
<script type="text/javascript">ndeSetTextSize();</script>
<div id="top">
<div class="breadtrail">
<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://lucene.apache.org/">Lucene</a> &gt; <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
</div>
<div class="header">
<div class="grouplogo">
<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/java/docs/images/lucene_green_150.gif" title="Apache Lucene"></a>
</div>
<div class="projectlogo">
<a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a>
</div>
<div class="searchbox">
<form action="http://www.google.com/search" method="get" class="roundtopsmall">
<input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
<input attr="value" name="Search" value="Search" type="submit">
</form>
</div>
<ul id="tabs">
<li class="current">
<a class="base-selected" href="index.html">Main</a>
</li>
<li>
<a class="base-not-selected" href="http://wiki.apache.org/nutch/">Wiki</a>
</li>
</ul>
</div>
</div>
<div id="main">
<div id="publishedStrip">
<div id="level2tabs"></div>
<script type="text/javascript"><!--
document.write("<text>Last Published:</text> " + document.lastModified);
// --></script>
</div>
<div class="breadtrail">
&nbsp;
</div>
<div id="menu">
<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div>
<div id="menu_1.1" class="menuitemgroup">
<div class="menuitem">
<a href="index.html">News</a>
</div>
<div class="menuitem">
<a href="about.html">About</a>
</div>
<div class="menuitem">
<a href="credits.html">Credits</a>
</div>
<div class="menuitem">
<a href="http://www.cafepress.com/nutch/">Buy Stuff</a>
</div>
</div>
<div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
<div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
<div class="menuitem">
<a href="http://wiki.apache.org/nutch/FAQ">FAQ</a>
</div>
<div class="menuitem">
<a href="http://wiki.apache.org/nutch/">Wiki</a>
</div>
<div class="menuitem">
<a href="tutorial.html">Tutorial ver. 0.7.2</a>
</div>
<div class="menupage">
<div class="menupagetitle">Tutorial ver. 0.8</div>
</div>
<div class="menuitem">
<a href="bot.html">Robot </a>
</div>
<div class="menuitem">
<a href="i18n.html">i18n</a>
</div>
<div class="menuitem">
<a href="apidocs/index.html">API Docs ver. 0.7.2</a>
</div>
<div class="menuitem">
<a href="http://lucene.apache.org/nutch-nightly/docs/api/index.html">API Docs ver. 0.8</a>
</div>
</div>
<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
<div id="menu_1.3" class="menuitemgroup">
<div class="menuitem">
<a href="release/">Download</a>
</div>
<div class="menuitem">
<a href="nightly.html">Nightly builds</a>
</div>
<div class="menuitem">
<a href="mailing_lists.html">Mailing Lists</a>
</div>
<div class="menuitem">
<a href="issue_tracking.html">Issue Tracking</a>
</div>
<div class="menuitem">
<a href="version_control.html">Version Control</a>
</div>
</div>
<div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div>
<div id="menu_1.4" class="menuitemgroup">
<div class="menuitem">
<a href="http://lucene.apache.org/java/">Lucene Java</a>
</div>
<div class="menuitem">
<a href="http://lucene.apache.org/hadoop/">Hadoop</a>
</div>
</div>
<div id="credit"></div>
<div id="roundbottom">
<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
<div id="credit2"></div>
</div>
<div id="content">
<div title="Portable Document Format" class="pdflink">
<a class="dida" href="tutorial8.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
PDF</a>
</div>
<h1>Nutch version 0.8 tutorial</h1>
<div id="minitoc-area">
<ul class="minitoc">
<li>
<a href="#Requirements">Requirements</a>
</li>
<li>
<a href="#Getting+Started">Getting Started</a>
</li>
<li>
<a href="#Intranet+Crawling">Intranet Crawling</a>
<ul class="minitoc">
<li>
<a href="#Intranet%3A+Configuration">Intranet: Configuration</a>
</li>
<li>
<a href="#Intranet%3A+Running+the+Crawl">Intranet: Running the Crawl</a>
</li>
</ul>
</li>
<li>
<a href="#Whole-web+Crawling">Whole-web Crawling</a>
<ul class="minitoc">
<li>
<a href="#Whole-web%3A+Concepts">Whole-web: Concepts</a>
</li>
<li>
<a href="#Whole-web%3A+Boostrapping+the+Web+Database">Whole-web: Boostrapping the Web Database</a>
</li>
<li>
<a href="#Whole-web%3A+Fetching">Whole-web: Fetching</a>
</li>
<li>
<a href="#Whole-web%3A+Indexing">Whole-web: Indexing</a>
</li>
<li>
<a href="#Searching">Searching</a>
</li>
</ul>
</li>
</ul>
</div>
<a name="N1000C"></a><a name="Requirements"></a>
<h2 class="h3">Requirements</h2>
<div class="section">
<ol>
<li>Java 1.4.x, either from <a href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on
Linux is preferred. Set <span class="codefrag">NUTCH_JAVA_HOME</span> to the root
of your JVM installation.
</li>
<li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a>
4.x.</li>
<li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for
shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li>
<li>Up to a gigabyte of free disk space, a high-speed connection, and
an hour or so.
</li>
</ol>
</div>
<a name="N10035"></a><a name="Getting+Started"></a>
<h2 class="h3">Getting Started</h2>
<div class="section">
<p>First, you need to get a copy of the Nutch code. You can download
a release from <a href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>.
Unpack the release and connect to its top-level directory. Or, check
out the latest source code from <a href="version_control.html">subversion</a> and build it
with <a href="http://ant.apache.org/">Ant</a>.</p>
<p>Try the following command:</p>
<pre class="code">bin/nutch</pre>
<p>This will display the documentation for the Nutch command script.</p>
<p>Now we're ready to crawl. There are two approaches to crawling:</p>
<ol>
<li>Intranet crawling, with the <span class="codefrag">crawl</span> command.</li>
<li>Whole-web crawling, with much greater control, using the lower
level <span class="codefrag">inject</span>, <span class="codefrag">generate</span>, <span class="codefrag">fetch</span>
and <span class="codefrag">updatedb</span> commands.</li>
</ol>
</div>
<a name="N10070"></a><a name="Intranet+Crawling"></a>
<h2 class="h3">Intranet Crawling</h2>
<div class="section">
<p>Intranet crawling is more appropriate when you intend to crawl up to
around one million pages on a handful of web servers.</p>
<a name="N10079"></a><a name="Intranet%3A+Configuration"></a>
<h3 class="h4">Intranet: Configuration</h3>
<p>To configure things for intranet crawling you must:</p>
<ol>
<li>Create a directory with a flat file of root urls. For example, to
crawl the <span class="codefrag">nutch</span> site you might start with a file named
<span class="codefrag">urls/nutch</span> containing the url of just the Nutch home
page. All other Nutch pages should be reachable from this page. The
<span class="codefrag">urls/nutch</span> file would thus contain:
<pre class="code">
http://lucene.apache.org/nutch/
</pre>
</li>
<li>Edit the file <span class="codefrag">conf/crawl-urlfilter.txt</span> and replace
<span class="codefrag">MY.DOMAIN.NAME</span> with the name of the domain you wish to
crawl. For example, if you wished to limit the crawl to the
<span class="codefrag">apache.org</span> domain, the line should read:
<pre class="code">
+^http://([a-z0-9]*\.)*apache.org/
</pre>
This will include any url in the domain <span class="codefrag">apache.org</span>.
</li>
<li>Edit the file <span class="codefrag">conf/nutch-site.xml</span>, insert at minimum
following properties into it and edit in proper values for the properties:
<pre class="code">
&lt;property&gt;
&lt;name&gt;http.agent.name&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;http.agent.description&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;http.agent.url&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;http.agent.email&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
&lt;/description&gt;
&lt;/property&gt;
</pre>
</li>
</ol>
<a name="N100B3"></a><a name="Intranet%3A+Running+the+Crawl"></a>
<h3 class="h4">Intranet: Running the Crawl</h3>
<p>Once things are configured, running the crawl is easy. Just use the
crawl command. Its options include:</p>
<ul>
<li>
<span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li>
<li>
<span class="codefrag">-threads</span> <em>threads</em> determines the number of
threads that will fetch in parallel.</li>
<li>
<span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root
page that should be crawled.</li>
<li>
<span class="codefrag">-topN</span> <em>N</em> determines the maximum number of pages that
will be retrieved at each level up to the depth.</li>
</ul>
<p>For example, a typical call might be:</p>
<pre class="code">
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
</pre>
<p>Typically one starts testing one's configuration by crawling at
shallow depths, sharply limiting the number of pages fetched at each
level (<span class="codefrag">-topN</span>), and watching the output to check that
desired pages are fetched and undesirable pages are not. Once one is
confident of the configuration, then an appropriate depth for a full
crawl is around 10. The number of pages per level
(<span class="codefrag">-topN</span>) for a full crawl can be from tens of thousands to
millions, depending on your resources.</p>
<p>Once crawling has completed, one can skip to the Searching section
below.</p>
</div>
<a name="N100F4"></a><a name="Whole-web+Crawling"></a>
<h2 class="h3">Whole-web Crawling</h2>
<div class="section">
<p>Whole-web crawling is designed to handle very large crawls which may
take weeks to complete, running on multiple machines.</p>
<a name="N100FD"></a><a name="Whole-web%3A+Concepts"></a>
<h3 class="h4">Whole-web: Concepts</h3>
<p>Nutch data is composed of:</p>
<ol>
<li>The crawl database, or <em>crawldb</em>. This contains
information about every url known to Nutch, including whether it was
fetched, and, if so, when.</li>
<li>The link database, or <em>linkdb</em>. This contains the list
of known links to each url, including both the source url and anchor
text of the link.</li>
<li>A set of <em>segments</em>. Each segment is a set of urls that are
fetched as a unit. Segments are directories with the following
subdirectories:</li>
<li>
<ul>
<li>a <em>crawl_generate</em> names a set of urls to be fetched</li>
<li>a <em>crawl_fetch</em> contains the status of fetching each url</li>
<li>a <em>content</em> contains the content of each url</li>
<li>a <em>parse_text</em> contains the parsed text of each url</li>
<li>a <em>parse_data</em> contains outlinks and metadata parsed
from each url</li>
<li>a <em>crawl_parse</em> contains the outlink urls, used to
update the crawldb</li>
</ul>
</li>
<li>The <em>indexes</em>are Lucene-format indexes.</li>
</ol>
<a name="N1014A"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a>
<h3 class="h4">Whole-web: Boostrapping the Web Database</h3>
<p>The <em>injector</em> adds urls to the crawldb. Let's inject URLs
from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we
must download and uncompress the file listing all of the DMOZ pages.
(This is a 200+Mb file, so this will take a few minutes.)</p>
<pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz</pre>
<p>Next we select a random subset of these pages.
(We use a random subset so that everyone who runs this tutorial
doesn't hammer the same sites.) DMOZ contains around three million
URLs. We select one out of every 5000, so that we end up with
around 1000 URLs:</p>
<pre class="code">mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 &gt; dmoz/urls</pre>
<p>The parser also takes a few minutes, as it must parse the full
file. Finally, we initialize the crawl db with the selected urls.</p>
<pre class="code">bin/nutch inject crawl/crawldb dmoz</pre>
<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>
<a name="N10170"></a><a name="Whole-web%3A+Fetching"></a>
<h3 class="h4">Whole-web: Fetching</h3>
<p>
Starting from 0.8 nutch user agent identifier needs to be configured
before fetching. To do this you must edit the file <span class="codefrag">conf/nutch-site.xml</span>, insert at minimum
following properties into it and edit in proper values for the properties:
</p>
<pre class="code">
&lt;property&gt;
&lt;name&gt;http.agent.name&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;http.agent.description&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;http.agent.url&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;http.agent.email&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
&lt;/description&gt;
&lt;/property&gt;
</pre>
<p>To fetch, we first generate a fetchlist from the database:</p>
<pre class="code">bin/nutch generate crawl/crawldb crawl/segments
</pre>
<p>This generates a fetchlist for all of the pages due to be fetched.
The fetchlist is placed in a newly created segment directory.
The segment directory is named by the time it's created. We
save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p>
<pre class="code">s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
</pre>
<p>Now we run the fetcher on this segment with:</p>
<pre class="code">bin/nutch fetch $s1</pre>
<p>When this is complete, we update the database with the results of the
fetch:</p>
<pre class="code">bin/nutch updatedb crawl/crawldb $s1</pre>
<p>Now the database has entries for all of the pages referenced by the
initial set.</p>
<p>Now we fetch a new segment with the top-scoring 1000 pages:</p>
<pre class="code">bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch updatedb crawl/crawldb $s2
</pre>
<p>Let's fetch one more round:</p>
<pre class="code">
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch updatedb crawl/crawldb $s3
</pre>
<p>By this point we've fetched a few thousand pages. Let's index
them!</p>
<a name="N101B4"></a><a name="Whole-web%3A+Indexing"></a>
<h3 class="h4">Whole-web: Indexing</h3>
<p>Before indexing we first invert all of the links, so that we may
index incoming anchor text with the pages.</p>
<pre class="code">bin/nutch invertlinks crawl/linkdb crawl/segments</pre>
<p>To index the segments we use the <span class="codefrag">index</span> command, as follows:</p>
<pre class="code">bin/nutch index indexes crawl/linkdb crawl/segments/*</pre>
<p>Now we're ready to search!</p>
<a name="N101D5"></a><a name="Searching"></a>
<h3 class="h4">Searching</h3>
<p>To search you need to put the nutch war file into your servlet
container. (If instead of downloading a Nutch release you checked the
sources out of SVN, then you'll first need to build the war file, with
the command <span class="codefrag">ant war</span>.)</p>
<p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
file may be installed with the commands:</p>
<pre class="code">rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war
</pre>
<p>The webapp finds its indexes in <span class="codefrag">./crawl</span>, relative
to where you start Tomcat, so use a command like:</p>
<pre class="code">~/local/tomcat/bin/catalina.sh start
</pre>
<p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a>
and have fun!</p>
<p>More detailed tutorials are available on the Nutch Wiki.
</p>
</div>
</div>
<div class="clearboth">&nbsp;</div>
</div>
<div id="footer">
<div class="lastmodified">
<script type="text/javascript"><!--
document.write("<text>Last Published:</text> " + document.lastModified);
// --></script>
</div>
<div class="copyright">
Copyright &copy;
2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
</div>
</div>
</body>
</html>