| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <meta content="Apache Forrest" name="Generator"> |
| <meta name="Forrest-version" content="0.7"> |
| <meta name="Forrest-skin-name" content="pelt"> |
| <title>Nutch version 0.8 tutorial</title> |
| <link type="text/css" href="skin/basic.css" rel="stylesheet"> |
| <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet"> |
| <link media="print" type="text/css" href="skin/print.css" rel="stylesheet"> |
| <link type="text/css" href="skin/profile.css" rel="stylesheet"> |
| <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script> |
| <link rel="shortcut icon" href="images/favicon.ico"> |
| </head> |
| <body onload="init()"> |
| <script type="text/javascript">ndeSetTextSize();</script> |
| <div id="top"> |
| <div class="breadtrail"> |
| <a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a> > <a href="http://lucene.apache.org/nutch/">Nutch</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script> |
| </div> |
| <div class="header"> |
| <div class="grouplogo"> |
| <a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/java/docs/images/lucene_green_150.gif" title="Apache Lucene"></a> |
| </div> |
| <div class="projectlogo"> |
| <a href="http://lucene.apache.org/nutch/"><img class="logoImage" alt="Nutch" src="images/nutch-logo.gif" title="Open Source Web Search Software"></a> |
| </div> |
| <div class="searchbox"> |
| <form action="http://www.google.com/search" method="get" class="roundtopsmall"> |
| <input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google"> |
| <input attr="value" name="Search" value="Search" type="submit"> |
| </form> |
| </div> |
| <ul id="tabs"> |
| <li class="current"> |
| <a class="base-selected" href="index.html">Main</a> |
| </li> |
| <li> |
| <a class="base-not-selected" href="http://wiki.apache.org/nutch/">Wiki</a> |
| </li> |
| </ul> |
| </div> |
| </div> |
| <div id="main"> |
| <div id="publishedStrip"> |
| <div id="level2tabs"></div> |
| <script type="text/javascript"><!-- |
| document.write("<text>Last Published:</text> " + document.lastModified); |
| // --></script> |
| </div> |
| <div class="breadtrail"> |
| |
| |
| </div> |
| <div id="menu"> |
| <div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Project</div> |
| <div id="menu_1.1" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="index.html">News</a> |
| </div> |
| <div class="menuitem"> |
| <a href="about.html">About</a> |
| </div> |
| <div class="menuitem"> |
| <a href="credits.html">Credits</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://www.cafepress.com/nutch/">Buy Stuff</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div> |
| <div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;"> |
| <div class="menuitem"> |
| <a href="http://wiki.apache.org/nutch/FAQ">FAQ</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://wiki.apache.org/nutch/">Wiki</a> |
| </div> |
| <div class="menuitem"> |
| <a href="tutorial.html">Tutorial ver. 0.7.2</a> |
| </div> |
| <div class="menupage"> |
| <div class="menupagetitle">Tutorial ver. 0.8</div> |
| </div> |
| <div class="menuitem"> |
| <a href="bot.html">Robot </a> |
| </div> |
| <div class="menuitem"> |
| <a href="i18n.html">i18n</a> |
| </div> |
| <div class="menuitem"> |
| <a href="apidocs/index.html">API Docs ver. 0.7.2</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://lucene.apache.org/nutch-nightly/docs/api/index.html">API Docs ver. 0.8</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div> |
| <div id="menu_1.3" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="release/">Download</a> |
| </div> |
| <div class="menuitem"> |
| <a href="nightly.html">Nightly builds</a> |
| </div> |
| <div class="menuitem"> |
| <a href="mailing_lists.html">Mailing Lists</a> |
| </div> |
| <div class="menuitem"> |
| <a href="issue_tracking.html">Issue Tracking</a> |
| </div> |
| <div class="menuitem"> |
| <a href="version_control.html">Version Control</a> |
| </div> |
| </div> |
| <div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div> |
| <div id="menu_1.4" class="menuitemgroup"> |
| <div class="menuitem"> |
| <a href="http://lucene.apache.org/java/">Lucene Java</a> |
| </div> |
| <div class="menuitem"> |
| <a href="http://lucene.apache.org/hadoop/">Hadoop</a> |
| </div> |
| </div> |
| <div id="credit"></div> |
| <div id="roundbottom"> |
| <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div> |
| <div id="credit2"></div> |
| </div> |
| <div id="content"> |
| <div title="Portable Document Format" class="pdflink"> |
| <a class="dida" href="tutorial8.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br> |
| PDF</a> |
| </div> |
| <h1>Nutch version 0.8 tutorial</h1> |
| <div id="minitoc-area"> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Requirements">Requirements</a> |
| </li> |
| <li> |
| <a href="#Getting+Started">Getting Started</a> |
| </li> |
| <li> |
| <a href="#Intranet+Crawling">Intranet Crawling</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Intranet%3A+Configuration">Intranet: Configuration</a> |
| </li> |
| <li> |
| <a href="#Intranet%3A+Running+the+Crawl">Intranet: Running the Crawl</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#Whole-web+Crawling">Whole-web Crawling</a> |
| <ul class="minitoc"> |
| <li> |
| <a href="#Whole-web%3A+Concepts">Whole-web: Concepts</a> |
| </li> |
| <li> |
| <a href="#Whole-web%3A+Boostrapping+the+Web+Database">Whole-web: Boostrapping the Web Database</a> |
| </li> |
| <li> |
| <a href="#Whole-web%3A+Fetching">Whole-web: Fetching</a> |
| </li> |
| <li> |
| <a href="#Whole-web%3A+Indexing">Whole-web: Indexing</a> |
| </li> |
| <li> |
| <a href="#Searching">Searching</a> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| </div> |
| |
| |
| <a name="N1000C"></a><a name="Requirements"></a> |
| <h2 class="h3">Requirements</h2> |
| <div class="section"> |
| <ol> |
| |
| <li>Java 1.4.x, either from <a href="http://java.sun.com/j2se/downloads.html">Sun</a> or <a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a> on |
| Linux is preferred. Set <span class="codefrag">NUTCH_JAVA_HOME</span> to the root |
| of your JVM installation. |
| </li> |
| |
| <li>Apache's <a href="http://jakarta.apache.org/tomcat/">Tomcat</a> |
| 4.x.</li> |
| |
| <li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for |
| shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)</li> |
| |
| <li>Up to a gigabyte of free disk space, a high-speed connection, and |
| an hour or so. |
| </li> |
| |
| </ol> |
| </div> |
| |
| <a name="N10035"></a><a name="Getting+Started"></a> |
| <h2 class="h3">Getting Started</h2> |
| <div class="section"> |
| <p>First, you need to get a copy of the Nutch code. You can download |
| a release from <a href="http://lucene.apache.org/nutch/release/">http://lucene.apache.org/nutch/release/</a>. |
| Unpack the release and connect to its top-level directory. Or, check |
| out the latest source code from <a href="version_control.html">subversion</a> and build it |
| with <a href="http://ant.apache.org/">Ant</a>.</p> |
| <p>Try the following command:</p> |
| <pre class="code">bin/nutch</pre> |
| <p>This will display the documentation for the Nutch command script.</p> |
| <p>Now we're ready to crawl. There are two approaches to crawling:</p> |
| <ol> |
| |
| <li>Intranet crawling, with the <span class="codefrag">crawl</span> command.</li> |
| |
| <li>Whole-web crawling, with much greater control, using the lower |
| level <span class="codefrag">inject</span>, <span class="codefrag">generate</span>, <span class="codefrag">fetch</span> |
| and <span class="codefrag">updatedb</span> commands.</li> |
| |
| </ol> |
| </div> |
| |
| <a name="N10070"></a><a name="Intranet+Crawling"></a> |
| <h2 class="h3">Intranet Crawling</h2> |
| <div class="section"> |
| <p>Intranet crawling is more appropriate when you intend to crawl up to |
| around one million pages on a handful of web servers.</p> |
| <a name="N10079"></a><a name="Intranet%3A+Configuration"></a> |
| <h3 class="h4">Intranet: Configuration</h3> |
| <p>To configure things for intranet crawling you must:</p> |
| <ol> |
| |
| |
| <li>Create a directory with a flat file of root urls. For example, to |
| crawl the <span class="codefrag">nutch</span> site you might start with a file named |
| <span class="codefrag">urls/nutch</span> containing the url of just the Nutch home |
| page. All other Nutch pages should be reachable from this page. The |
| <span class="codefrag">urls/nutch</span> file would thus contain: |
| <pre class="code"> |
| http://lucene.apache.org/nutch/ |
| </pre> |
| |
| </li> |
| |
| |
| <li>Edit the file <span class="codefrag">conf/crawl-urlfilter.txt</span> and replace |
| <span class="codefrag">MY.DOMAIN.NAME</span> with the name of the domain you wish to |
| crawl. For example, if you wished to limit the crawl to the |
| <span class="codefrag">apache.org</span> domain, the line should read: |
| <pre class="code"> |
| +^http://([a-z0-9]*\.)*apache.org/ |
| </pre> |
| This will include any url in the domain <span class="codefrag">apache.org</span>. |
| </li> |
| |
| <li>Edit the file <span class="codefrag">conf/nutch-site.xml</span>, insert at minimum |
| following properties into it and edit in proper values for the properties: |
| <pre class="code"> |
| |
| <property> |
| <name>http.agent.name</name> |
| <value></value> |
| <description>HTTP 'User-Agent' request header. MUST NOT be empty - |
| please set this to a single word uniquely related to your organization. |
| |
| NOTE: You should also check other related properties: |
| |
| http.robots.agents |
| http.agent.description |
| http.agent.url |
| http.agent.email |
| http.agent.version |
| |
| and set their values appropriately. |
| |
| </description> |
| </property> |
| |
| <property> |
| <name>http.agent.description</name> |
| <value></value> |
| <description>Further description of our bot- this text is used in |
| the User-Agent header. It appears in parenthesis after the agent name. |
| </description> |
| </property> |
| |
| <property> |
| <name>http.agent.url</name> |
| <value></value> |
| <description>A URL to advertise in the User-Agent header. This will |
| appear in parenthesis after the agent name. Custom dictates that this |
| should be a URL of a page explaining the purpose and behavior of this |
| crawler. |
| </description> |
| </property> |
| |
| <property> |
| <name>http.agent.email</name> |
| <value></value> |
| <description>An email address to advertise in the HTTP 'From' request |
| header and User-Agent header. A good practice is to mangle this |
| address (e.g. 'info at example dot com') to avoid spamming. |
| </description> |
| </property> |
| |
| |
| </pre> |
| |
| </li> |
| |
| </ol> |
| <a name="N100B3"></a><a name="Intranet%3A+Running+the+Crawl"></a> |
| <h3 class="h4">Intranet: Running the Crawl</h3> |
| <p>Once things are configured, running the crawl is easy. Just use the |
| crawl command. Its options include:</p> |
| <ul> |
| |
| <li> |
| <span class="codefrag">-dir</span> <em>dir</em> names the directory to put the crawl in.</li> |
| |
| <li> |
| <span class="codefrag">-threads</span> <em>threads</em> determines the number of |
| threads that will fetch in parallel.</li> |
| |
| <li> |
| <span class="codefrag">-depth</span> <em>depth</em> indicates the link depth from the root |
| page that should be crawled.</li> |
| |
| <li> |
| <span class="codefrag">-topN</span> <em>N</em> determines the maximum number of pages that |
| will be retrieved at each level up to the depth.</li> |
| |
| </ul> |
| <p>For example, a typical call might be:</p> |
| <pre class="code"> |
| bin/nutch crawl urls -dir crawl -depth 3 -topN 50 |
| </pre> |
| <p>Typically one starts testing one's configuration by crawling at |
| shallow depths, sharply limiting the number of pages fetched at each |
| level (<span class="codefrag">-topN</span>), and watching the output to check that |
| desired pages are fetched and undesirable pages are not. Once one is |
| confident of the configuration, then an appropriate depth for a full |
| crawl is around 10. The number of pages per level |
| (<span class="codefrag">-topN</span>) for a full crawl can be from tens of thousands to |
| millions, depending on your resources.</p> |
| <p>Once crawling has completed, one can skip to the Searching section |
| below.</p> |
| </div> |
| |
| |
| <a name="N100F4"></a><a name="Whole-web+Crawling"></a> |
| <h2 class="h3">Whole-web Crawling</h2> |
| <div class="section"> |
| <p>Whole-web crawling is designed to handle very large crawls which may |
| take weeks to complete, running on multiple machines.</p> |
| <a name="N100FD"></a><a name="Whole-web%3A+Concepts"></a> |
| <h3 class="h4">Whole-web: Concepts</h3> |
| <p>Nutch data is composed of:</p> |
| <ol> |
| |
| |
| <li>The crawl database, or <em>crawldb</em>. This contains |
| information about every url known to Nutch, including whether it was |
| fetched, and, if so, when.</li> |
| |
| |
| <li>The link database, or <em>linkdb</em>. This contains the list |
| of known links to each url, including both the source url and anchor |
| text of the link.</li> |
| |
| |
| <li>A set of <em>segments</em>. Each segment is a set of urls that are |
| fetched as a unit. Segments are directories with the following |
| subdirectories:</li> |
| |
| |
| <li> |
| <ul> |
| |
| <li>a <em>crawl_generate</em> names a set of urls to be fetched</li> |
| |
| <li>a <em>crawl_fetch</em> contains the status of fetching each url</li> |
| |
| <li>a <em>content</em> contains the content of each url</li> |
| |
| <li>a <em>parse_text</em> contains the parsed text of each url</li> |
| |
| <li>a <em>parse_data</em> contains outlinks and metadata parsed |
| from each url</li> |
| |
| <li>a <em>crawl_parse</em> contains the outlink urls, used to |
| update the crawldb</li> |
| |
| </ul> |
| </li> |
| |
| |
| <li>The <em>indexes</em>are Lucene-format indexes.</li> |
| |
| |
| </ol> |
| <a name="N1014A"></a><a name="Whole-web%3A+Boostrapping+the+Web+Database"></a> |
| <h3 class="h4">Whole-web: Boostrapping the Web Database</h3> |
| <p>The <em>injector</em> adds urls to the crawldb. Let's inject URLs |
| from the <a href="http://dmoz.org/">DMOZ</a> Open Directory. First we |
| must download and uncompress the file listing all of the DMOZ pages. |
| (This is a 200+Mb file, so this will take a few minutes.)</p> |
| <pre class="code">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz |
| gunzip content.rdf.u8.gz</pre> |
| <p>Next we select a random subset of these pages. |
| (We use a random subset so that everyone who runs this tutorial |
| doesn't hammer the same sites.) DMOZ contains around three million |
| URLs. We select one out of every 5000, so that we end up with |
| around 1000 URLs:</p> |
| <pre class="code">mkdir dmoz |
| bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls</pre> |
| <p>The parser also takes a few minutes, as it must parse the full |
| file. Finally, we initialize the crawl db with the selected urls.</p> |
| <pre class="code">bin/nutch inject crawl/crawldb dmoz</pre> |
| <p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p> |
| <a name="N10170"></a><a name="Whole-web%3A+Fetching"></a> |
| <h3 class="h4">Whole-web: Fetching</h3> |
| <p> |
| Starting from 0.8 nutch user agent identifier needs to be configured |
| before fetching. To do this you must edit the file <span class="codefrag">conf/nutch-site.xml</span>, insert at minimum |
| following properties into it and edit in proper values for the properties: |
| </p> |
| <pre class="code"> |
| |
| <property> |
| <name>http.agent.name</name> |
| <value></value> |
| <description>HTTP 'User-Agent' request header. MUST NOT be empty - |
| please set this to a single word uniquely related to your organization. |
| |
| NOTE: You should also check other related properties: |
| |
| http.robots.agents |
| http.agent.description |
| http.agent.url |
| http.agent.email |
| http.agent.version |
| |
| and set their values appropriately. |
| |
| </description> |
| </property> |
| |
| <property> |
| <name>http.agent.description</name> |
| <value></value> |
| <description>Further description of our bot- this text is used in |
| the User-Agent header. It appears in parenthesis after the agent name. |
| </description> |
| </property> |
| |
| <property> |
| <name>http.agent.url</name> |
| <value></value> |
| <description>A URL to advertise in the User-Agent header. This will |
| appear in parenthesis after the agent name. Custom dictates that this |
| should be a URL of a page explaining the purpose and behavior of this |
| crawler. |
| </description> |
| </property> |
| |
| <property> |
| <name>http.agent.email</name> |
| <value></value> |
| <description>An email address to advertise in the HTTP 'From' request |
| header and User-Agent header. A good practice is to mangle this |
| address (e.g. 'info at example dot com') to avoid spamming. |
| </description> |
| </property> |
| |
| |
| </pre> |
| <p>To fetch, we first generate a fetchlist from the database:</p> |
| <pre class="code">bin/nutch generate crawl/crawldb crawl/segments |
| </pre> |
| <p>This generates a fetchlist for all of the pages due to be fetched. |
| The fetchlist is placed in a newly created segment directory. |
| The segment directory is named by the time it's created. We |
| save the name of this segment in the shell variable <span class="codefrag">s1</span>:</p> |
| <pre class="code">s1=`ls -d crawl/segments/2* | tail -1` |
| echo $s1 |
| </pre> |
| <p>Now we run the fetcher on this segment with:</p> |
| <pre class="code">bin/nutch fetch $s1</pre> |
| <p>When this is complete, we update the database with the results of the |
| fetch:</p> |
| <pre class="code">bin/nutch updatedb crawl/crawldb $s1</pre> |
| <p>Now the database has entries for all of the pages referenced by the |
| initial set.</p> |
| <p>Now we fetch a new segment with the top-scoring 1000 pages:</p> |
| <pre class="code">bin/nutch generate crawl/crawldb crawl/segments -topN 1000 |
| s2=`ls -d crawl/segments/2* | tail -1` |
| echo $s2 |
| |
| bin/nutch fetch $s2 |
| bin/nutch updatedb crawl/crawldb $s2 |
| </pre> |
| <p>Let's fetch one more round:</p> |
| <pre class="code"> |
| bin/nutch generate crawl/crawldb crawl/segments -topN 1000 |
| s3=`ls -d crawl/segments/2* | tail -1` |
| echo $s3 |
| |
| bin/nutch fetch $s3 |
| bin/nutch updatedb crawl/crawldb $s3 |
| </pre> |
| <p>By this point we've fetched a few thousand pages. Let's index |
| them!</p> |
| <a name="N101B4"></a><a name="Whole-web%3A+Indexing"></a> |
| <h3 class="h4">Whole-web: Indexing</h3> |
| <p>Before indexing we first invert all of the links, so that we may |
| index incoming anchor text with the pages.</p> |
| <pre class="code">bin/nutch invertlinks crawl/linkdb crawl/segments</pre> |
| <p>To index the segments we use the <span class="codefrag">index</span> command, as follows:</p> |
| <pre class="code">bin/nutch index indexes crawl/linkdb crawl/segments/*</pre> |
| <p>Now we're ready to search!</p> |
| <a name="N101D5"></a><a name="Searching"></a> |
| <h3 class="h4">Searching</h3> |
| <p>To search you need to put the nutch war file into your servlet |
| container. (If instead of downloading a Nutch release you checked the |
| sources out of SVN, then you'll first need to build the war file, with |
| the command <span class="codefrag">ant war</span>.)</p> |
| <p>Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war |
| file may be installed with the commands:</p> |
| <pre class="code">rm -rf ~/local/tomcat/webapps/ROOT* |
| cp nutch*.war ~/local/tomcat/webapps/ROOT.war |
| </pre> |
| <p>The webapp finds its indexes in <span class="codefrag">./crawl</span>, relative |
| to where you start Tomcat, so use a command like:</p> |
| <pre class="code">~/local/tomcat/bin/catalina.sh start |
| </pre> |
| <p>Then visit <a href="http://localhost:8080/">http://localhost:8080/</a> |
| and have fun!</p> |
| <p>More detailed tutorials are available on the Nutch Wiki. |
| </p> |
| </div> |
| |
| |
| </div> |
| <div class="clearboth"> </div> |
| </div> |
| <div id="footer"> |
| <div class="lastmodified"> |
| <script type="text/javascript"><!-- |
| document.write("<text>Last Published:</text> " + document.lastModified); |
| // --></script> |
| </div> |
| <div class="copyright"> |
| Copyright © |
| 2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a> |
| </div> |
| </div> |
| </body> |
| </html> |