| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta content="Apache Forrest" name="Generator"> |
| <meta name="Forrest-version" content="0.8"> |
| <meta name="Forrest-skin-name" content="lucene"> |
| <title>Nutch version 0.7 tutorial</title> |
| <link type="text/css" href="skin/basic.css" rel="stylesheet"> |
| <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet"> |
| <link media="print" type="text/css" href="skin/print.css" rel="stylesheet"> |
| <link type="text/css" href="skin/profile.css" rel="stylesheet"> |
| <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script> |
| <link rel="shortcut icon" href="images/favicon.ico"> |
| </head> |
| <body onload="init()"> |
| <script type="text/javascript">ndeSetTextSize();</script> |
| <div id="top"> |
| <!--+ |
| |breadtrail |
| +--> |
| <div class="breadtrail"> |
| <a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a> > <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script> |
| </div> |
| <!--+ |
| |header |
| +--> |
| <div class="header"> |
| <!--+ |
| |start group logo |
| +--> |
| <div class="grouplogo"> |
| <a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="images/lucene_green_150.gif" title="Apache Lucene"></a> |
| </div> |
| <!--+ |
| |end group logo |
| +--> |
| <!--+ |
| |start Project Logo |
| +--> |
| <div class="projectlogo"> |
| <a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a> |
| </div> |
| <!--+ |
| |end Project Logo |
| +--> |
| <!--+ |
| |start Search |
| +--> |
| <div class="searchbox"> |
| <form action="http://search.lucidimagination.com/p:nutch" method="get" class="roundtopsmall"> |
| <input onFocus="getBlank (this, 'Search the site with Solr');" size="25" name="q" id="query" type="text" value="Search the site with Solr"> |
| <input name="Search" value="Search" type="submit"> |
| </form> |
| <div style="position: relative; top: -5px; left: -10px">Powered by <a href="http://www.lucidimagination.com" style="color: #033268">Lucid Imagination</a> |
| </div> |
| </div> |
| <!--+ |
| |end search |
| +--> |
| <!--+ |
| |start Tabs |
| +--> |
| <ul id="tabs"> |
| <li class="current"> |
| <a class="selected" href="index.html">Main</a> |
| </li> |
| <li> |
| <a class="unselected" href="http://wiki.apache.org/nutch/">Wiki</a> |
| </li> |
| <li> |
| <a class="unselected" href="http://issues.apache.org/jira/browse/Nutch">Jira</a> |
| </li> |
| </ul> |
| <!--+ |
| |end Tabs |
| +--> |
| </div> |
| </div> |
| <div id="main"> |
| <div id="publishedStrip"> |
| <!--+ |
| |start Subtabs |
| +--> |
| <div id="level2tabs"></div> |
| <!--+ |
| |end Endtabs |
| +--> |
| <script type="text/javascript"><!-- |
| document.write("Last Published: " + document.lastModified); |
| // --></script> |
| </div> |
| <!--+ |
| |breadtrail |
| +--> |
| <div class="breadtrail"> |
| |
| |
| </div> |
| <!--+ |
| |start Menu, mainarea |
| +--> |
| <!--+ |
| |start Menu |
| +--> |
| <div id="menu"> |
| <div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div> |
| <div id="menu_1.1" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="index.html">News</a> |
| </div> |
| <div class="menuitem"> |
| <a href="about.html">About</a> |
| </div> |
| <div class="menuitem"> |
| <a href="credits.html">Credits</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://www.cafepress.com/nutch/">Buy Stuff</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div> |
| <div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;"> |
| <div class="menuitem"> |
| <a href="http://wiki.apache.org/nutch/FAQ">FAQ</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://wiki.apache.org/nutch/">Wiki</a> |
| </div> |
| <div class="menupage"> |
| <div class="menupagetitle">Tutorial (0.7.2)</div> |
| </div> |
| <div class="menuitem"> |
| <a href="tutorial8.html">Tutorial (0.8.x)</a> |
| </div> |
| <div class="menuitem"> |
| <a href="bot.html">Robot </a> |
| </div> |
| <div class="menuitem"> |
| <a href="i18n.html">i18n</a> |
| </div> |
| <div class="menuitem"> |
| <a href="apidocs-1.0/index.html">API Docs (1.0)</a> |
| </div> |
| <div class="menuitem"> |
| <a href="apidocs-0.9/index.html">API Docs (0.9)</a> |
| </div> |
| <div class="menuitem"> |
| <a href="apidocs-0.8.x/index.html">API Docs (0.8.x)</a> |
| </div> |
| <div class="menuitem"> |
| <a href="apidocs/index.html">API Docs (0.7.2)</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/docs/api/index.html">API Docs (nightly)</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div> |
| <div id="menu_1.3" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="release/">Download</a> |
| </div> |
| <div class="menuitem"> |
| <a href="nightly.html">Nightly builds</a> |
| </div> |
| <div class="menuitem"> |
| <a href="mailing_lists.html">Mailing Lists</a> |
| </div> |
| <div class="menuitem"> |
| <a href="issue_tracking.html">Issue Tracking</a> |
| </div> |
| <div class="menuitem"> |
| <a href="version_control.html">Version Control</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div> |
| <div id="menu_1.4" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="http://lucene.apache.org/java/">Lucene Java</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://lucene.apache.org/hadoop/">Hadoop</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://incubator.apache.org/solr/">Solr</a> |
| </div> |
| </div> |
| <div id="credit"></div> |
| <div id="roundbottom"> |
| <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div> |
| <!--+ |
| |alternative credits |
| +--> |
| <div id="credit2"></div> |
| </div> |
| <!--+ |
| |end Menu |
| +--> |
| <!--+ |
| |start content |
| +--> |
| <div id="content"> |
| <div title="Portable Document Format" class="pdflink"> |
| <a class="dida" href="tutorial.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br> |
| PDF</a> |
| </div> |
| <h1>Nutch version 0.7 tutorial</h1> |
| <div id="minitoc-area"> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Requirements">Requirements</a> |
| </li> |
| <li> |
| <a href="#Getting+Started">Getting Started</a> |
| </li> |
| <li> |
| <a href="#Intranet+Crawling">Intranet Crawling</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Intranet%3A+Configuration">Intranet: Configuration</a> |
| </li> |
| <li> |
| <a href="#Intranet%3A+Running+the+Crawl">Intranet: Running the Crawl</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#Whole-web+Crawling">Whole-web Crawling</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Whole-web%3A+Concepts">Whole-web: Concepts</a> |
| </li> |
| <li> |
| <a href="#Whole-web%3A+Boostrapping+the+Web+Database">Whole-web: Boostrapping the Web Database</a> |
| </li> |
| <li> |
| <a href="#Whole-web%3A+Fetching">Whole-web: Fetching</a> |
| </li> |
| <li> |
| <a href="#Whole-web%3A+Indexing">Whole-web: Indexing</a> |
| </li> |
| <li> |
| <a href="#Searching">Searching</a> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| </div> |
| |
| |
| <a name="N1000D"></a><a name="Requirements"></a> |
| <h2 class="h3">Requirements</h2> |
| <div class="section"> |
| <ol> |
| |
| <li>Java 1.4.x, either from <a href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on |
| Linux is preferred. Set <span class="codefrag">NUTCH_JAVA_HOME</span> to the root |
| of your JVM installation. |
| </li> |
| |
| <li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a> |
| 4.x.</li> |
| |
| <li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for |
| shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li> |
| |
| <li>Up to a gigabyte of free disk space, a high-speed connection, and |
| an hour or so. |
| </li> |
| |
| </ol> |
| </div> |
| |
| <a name="N10036"></a><a name="Getting+Started"></a> |
| <h2 class="h3">Getting Started</h2> |
| <div class="section"> |
| <p>First, you need to get a copy of the Nutch code. You can download |
| a release from <a href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>. |
| Unpack the release and connect to its top-level directory. Or, check |
| out the latest source code from <a href="version_control.html">subversion</a> and build it |
| with <a href="http://ant.apache.org/">Ant</a>.</p> |
| <p>Try the following command:</p> |
| <pre class="code">bin/nutch</pre> |
| <p>This will display the documentation for the Nutch command script.</p> |
| <p>Now we're ready to crawl. There are two approaches to crawling:</p> |
| <ol> |
| |
| <li>Intranet crawling, with the <span class="codefrag">crawl</span> command.</li> |
| |
| <li>Whole-web crawling, with much greater control, using the lower |
| level <span class="codefrag">inject</span>, <span class="codefrag">generate</span>, <span class="codefrag">fetch</span> |
| and <span class="codefrag">updatedb</span> commands.</li> |
| |
| </ol> |
| </div> |
| |
| <a name="N10071"></a><a name="Intranet+Crawling"></a> |
| <h2 class="h3">Intranet Crawling</h2> |
| <div class="section"> |
| <p>Intranet crawling is more appropriate when you intend to crawl up to |
| around one million pages on a handful of web servers.</p> |
| <a name="N1007A"></a><a name="Intranet%3A+Configuration"></a> |
| <h3 class="h4">Intranet: Configuration</h3> |
| <p>To configure things for intranet crawling you must:</p> |
| <ol> |
| |
| |
| <li>Create a flat file of root urls. For example, to crawl the |
| <span class="codefrag">nutch</span> site you might start with a file named |
| <span class="codefrag">urls</span> containing just the Nutch home page. All other |
| Nutch pages should be reachable from this page. The <span class="codefrag">urls</span> |
| file would thus look like: |
| <pre class="code"> |
| http://lucene.apache.org/nutch/ |
| </pre> |
| |
| </li> |
| |
| |
| <li>Edit the file <span class="codefrag">conf/crawl-urlfilter.txt</span> and replace |
| <span class="codefrag">MY.DOMAIN.NAME</span> with the name of the domain you wish to |
| crawl. For example, if you wished to limit the crawl to the |
| <span class="codefrag">apache.org</span> domain, the line should read: |
| <pre class="code"> |
| +^http://([a-z0-9]*\.)*apache.org/ |
| </pre> |
| This will include any url in the domain <span class="codefrag">apache.org</span>. |
| </li> |
| |
| |
| </ol> |
| <a name="N100AA"></a><a name="Intranet%3A+Running+the+Crawl"></a> |
| <h3 class="h4">Intranet: Running the Crawl</h3> |
| <p>Once things are configured, running the crawl is easy. Just use the |
| crawl command. Its options include:</p> |
| <ul> |
| |
| <li> |
| <span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li> |
| |
| <li> |
| <span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root |
| page that should be crawled.</li> |
| |
| <li> |
| <span class="codefrag">-delay</span> <em>delay</em> determines the number of seconds |
| between accesses to each host.</li> |
| |
| <li> |
| <span class="codefrag">-threads</span> <em>threads</em> determines the number of |
| threads that will fetch in parallel.</li> |
| |
| </ul> |
| <p>For example, a typical call might be:</p> |
| <pre class="code"> |
| bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log |
| </pre> |
| <p>Typically one starts testing one's configuration by crawling at low |
| depths, and watching the output to check that desired pages are found. |
| Once one is more confident of the configuration, then an appropriate |
| depth for a full crawl is around 10.</p> |
| <p>Once crawling has completed, one can skip to the Searching section |
| below.</p> |
| </div> |
| |
| |
| <a name="N100E5"></a><a name="Whole-web+Crawling"></a> |
| <h2 class="h3">Whole-web Crawling</h2> |
| <div class="section"> |
| <p>Whole-web crawling is designed to handle very large crawls which may |
| take weeks to complete, running on multiple machines.</p> |
| <a name="N100EE"></a><a name="Whole-web%3A+Concepts"></a> |
| <h3 class="h4">Whole-web: Concepts</h3> |
| <p>Nutch data is of two types:</p> |
| <ol> |
| |
| <li>The web database. This contains information about every |
| page known to Nutch, and about links between those pages.</li> |
| |
| <li>A set of segments. Each segment is a set of pages that are |
| fetched and indexed as a unit. Segment data consists of the |
| following types:</li> |
| |
| <li> |
| <ul> |
| |
| <li>a <em>fetchlist</em> is a file |
| that names a set of pages to be fetched</li> |
| |
| <li>the<em> fetcher output</em> is a |
| set of files containing the fetched pages</li> |
| |
| <li>the <em>index </em>is a |
| Lucene-format index of the fetcher output.</li> |
| |
| </ul> |
| </li> |
| |
| </ol> |
| <p>In the following examples we will keep our web database in a directory |
| named <span class="codefrag">db</span> and our segments |
| in a directory named <span class="codefrag">segments</span>:</p> |
| <pre class="code">mkdir db |
| mkdir segments</pre> |
| <a name="N10124"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a> |
| <h3 class="h4">Whole-web: Boostrapping the Web Database</h3> |
| <p>The admin tool is used to create a new, empty database:</p> |
| <pre class="code">bin/nutch admin db -create</pre> |
| <p>The <em>injector</em> adds urls into the database. Let's inject |
| URLs from the <a href="http://dmoz.org/">DMOZ</a> Open |
| Directory. First we must download and uncompress the file listing all |
| of the DMOZ pages. (This is a 200+Mb file, so this will take a few |
| minutes.)</p> |
| <pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz |
| gunzip content.rdf.u8.gz</pre> |
| <p>Next we inject a random subset of these pages into the web database. |
| (We use a random subset so that everyone who runs this tutorial |
| doesn't hammer the same sites.) DMOZ contains around three million |
| URLs. We inject one out of every 3000, so that we end up with |
| around 1000 URLs:</p> |
| <pre class="code">bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000</pre> |
| <p>This also takes a few minutes, as it must parse the full file.</p> |
| <p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p> |
| <a name="N1014D"></a><a name="Whole-web%3A+Fetching"></a> |
| <h3 class="h4">Whole-web: Fetching</h3> |
| <p>To fetch, we first generate a fetchlist from the database:</p> |
| <pre class="code">bin/nutch generate db segments |
| </pre> |
| <p>This generates a fetchlist for all of the pages due to be fetched. |
| The fetchlist is placed in a newly created segment directory. |
| The segment directory is named by the time it's created. We |
| save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p> |
| <pre class="code">s1=`ls -d segments/2* | tail -1` |
| echo $s1 |
| </pre> |
| <p>Now we run the fetcher on this segment with:</p> |
| <pre class="code">bin/nutch fetch $s1</pre> |
| <p>When this is complete, we update the database with the results of the |
| fetch:</p> |
| <pre class="code">bin/nutch updatedb db $s1</pre> |
| <p>Now the database has entries for all of the pages referenced by the |
| initial set.</p> |
| <p>Now we fetch a new segment with the top-scoring 1000 pages:</p> |
| <pre class="code">bin/nutch generate db segments -topN 1000 |
| s2=`ls -d segments/2* | tail -1` |
| echo $s2 |
| |
| bin/nutch fetch $s2 |
| bin/nutch updatedb db $s2 |
| </pre> |
| <p>Let's fetch one more round:</p> |
| <pre class="code"> |
| bin/nutch generate db segments -topN 1000 |
| s3=`ls -d segments/2* | tail -1` |
| echo $s3 |
| |
| bin/nutch fetch $s3 |
| bin/nutch updatedb db $s3 |
| </pre> |
| <p>By this point we've fetched a few thousand pages. Let's index |
| them!</p> |
| <a name="N10187"></a><a name="Whole-web%3A+Indexing"></a> |
| <h3 class="h4">Whole-web: Indexing</h3> |
| <p>To index each segment we use the <span class="codefrag">index</span> |
| command, as follows:</p> |
| <pre class="code">bin/nutch index $s1 |
| bin/nutch index $s2 |
| bin/nutch index $s3</pre> |
| <p>Then, before we can search a set of segments, we need to delete |
| duplicate pages. This is done with:</p> |
| <pre class="code">bin/nutch dedup segments dedup.tmp</pre> |
| <p>Now we're ready to search!</p> |
| <a name="N101A2"></a><a name="Searching"></a> |
| <h3 class="h4">Searching</h3> |
| <p>To search you need to put the nutch war file into your servlet |
| container. (If instead of downloading a Nutch release you checked the |
| sources out of SVN, then you'll first need to build the war file, with |
| the command <span class="codefrag">ant war</span>.)</p> |
| <p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war |
| file may be installed with the commands:</p> |
| <pre class="code">rm -rf ~/local/tomcat/webapps/ROOT* |
| cp nutch*.war ~/local/tomcat/webapps/ROOT.war |
| </pre> |
| <p>The webapp finds its indexes in <span class="codefrag">./segments</span>, relative |
| to where you start Tomcat, so, if you've done intranet crawling, |
| connect to your crawl directory, or, if you've done whole-web |
| crawling, don't change directories, and give the command:</p> |
| <pre class="code">~/local/tomcat/bin/catalina.sh start |
| </pre> |
| <p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a> |
| and have fun!</p> |
| <p>More detailed tutorials are available on the Nutch Wiki. |
| </p> |
| </div> |
| |
| |
| </div> |
| <!--+ |
| |end content |
| +--> |
| <div class="clearboth"> </div> |
| </div> |
| <div id="footer"> |
| <!--+ |
| |start bottomstrip |
| +--> |
| <div class="lastmodified"> |
| <script type="text/javascript"><!-- |
| document.write("Last Published: " + document.lastModified); |
| // --></script> |
| </div> |
| <div class="copyright"> |
| Copyright © |
| 2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a> |
| </div> |
| <!--+ |
| |end bottomstrip |
| +--> |
| </div> |
| </body> |
| </html> |