| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <!-- NewPage --> |
| <html lang="en"> |
| <head> |
| <!-- Generated by javadoc (1.8.0_221) on Tue Jan 19 12:28:00 PST 2021 --> |
| <title>org.apache.nutch.collection (apache-nutch 1.18 API)</title> |
| <meta name="date" content="2021-01-19"> |
| <link rel="stylesheet" type="text/css" href="../../../../stylesheet.css" title="Style"> |
| <script type="text/javascript" src="../../../../script.js"></script> |
| </head> |
| <body> |
| <script type="text/javascript"><!-- |
| try { |
| if (location.href.indexOf('is-external=true') == -1) { |
| parent.document.title="org.apache.nutch.collection (apache-nutch 1.18 API)"; |
| } |
| } |
| catch(err) { |
| } |
| //--> |
| </script> |
| <noscript> |
| <div>JavaScript is disabled on your browser.</div> |
| </noscript> |
| <!-- ========= START OF TOP NAVBAR ======= --> |
| <div class="topNav"><a name="navbar.top"> |
| <!-- --> |
| </a> |
| <div class="skipNav"><a href="#skip.navbar.top" title="Skip navigation links">Skip navigation links</a></div> |
| <a name="navbar.top.firstrow"> |
| <!-- --> |
| </a> |
| <ul class="navList" title="Navigation"> |
| <li><a href="../../../../overview-summary.html">Overview</a></li> |
| <li class="navBarCell1Rev">Package</li> |
| <li>Class</li> |
| <li><a href="package-use.html">Use</a></li> |
| <li><a href="package-tree.html">Tree</a></li> |
| <li><a href="../../../../deprecated-list.html">Deprecated</a></li> |
| <li><a href="../../../../index-all.html">Index</a></li> |
| <li><a href="../../../../help-doc.html">Help</a></li> |
| </ul> |
| </div> |
| <div class="subNav"> |
| <ul class="navList"> |
| <li><a href="../../../../org/apache/nutch/any23/package-summary.html">Prev Package</a></li> |
| <li><a href="../../../../org/apache/nutch/crawl/package-summary.html">Next Package</a></li> |
| </ul> |
| <ul class="navList"> |
| <li><a href="../../../../index.html?org/apache/nutch/collection/package-summary.html" target="_top">Frames</a></li> |
| <li><a href="package-summary.html" target="_top">No Frames</a></li> |
| </ul> |
| <ul class="navList" id="allclasses_navbar_top"> |
| <li><a href="../../../../allclasses-noframe.html">All Classes</a></li> |
| </ul> |
| <div> |
| <script type="text/javascript"><!-- |
| allClassesLink = document.getElementById("allclasses_navbar_top"); |
| if(window==top) { |
| allClassesLink.style.display = "block"; |
| } |
| else { |
| allClassesLink.style.display = "none"; |
| } |
| //--> |
| </script> |
| </div> |
| <a name="skip.navbar.top"> |
| <!-- --> |
| </a></div> |
| <!-- ========= END OF TOP NAVBAR ========= --> |
| <div class="header"> |
| <h1 title="Package" class="title">Package org.apache.nutch.collection</h1> |
| <div class="docSummary"> |
| <div class="block"> |
| Subcollection is a subset of an index.</div> |
| </div> |
| <p>See: <a href="#package.description">Description</a></p> |
| </div> |
| <div class="contentContainer"> |
| <ul class="blockList"> |
| <li class="blockList"> |
| <table class="typeSummary" border="0" cellpadding="3" cellspacing="0" summary="Class Summary table, listing classes, and an explanation"> |
| <caption><span>Class Summary</span><span class="tabEnd"> </span></caption> |
| <tr> |
| <th class="colFirst" scope="col">Class</th> |
| <th class="colLast" scope="col">Description</th> |
| </tr> |
| <tbody> |
| <tr class="altColor"> |
| <td class="colFirst"><a href="../../../../org/apache/nutch/collection/CollectionManager.html" title="class in org.apache.nutch.collection">CollectionManager</a></td> |
| <td class="colLast"> </td> |
| </tr> |
| <tr class="rowColor"> |
| <td class="colFirst"><a href="../../../../org/apache/nutch/collection/Subcollection.html" title="class in org.apache.nutch.collection">Subcollection</a></td> |
| <td class="colLast"> |
| <div class="block">SubCollection represents a subset of index, you can define url patterns that |
| will indicate that particular page (url) is part of SubCollection.</div> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </li> |
| </ul> |
| <a name="package.description"> |
| <!-- --> |
| </a> |
| <h2 title="Package org.apache.nutch.collection Description">Package org.apache.nutch.collection Description</h2> |
| <div class="block"><p> |
| Subcollection is a subset of an index. Subcollections are defined |
| by urlpatterns in form of white/blacklist. So to get the page into |
| subcollection it must match the whitelist and not the blacklist. |
| </p> |
| <p> |
| Subcollection definitions are read from a file subcollections.xml |
| and the format is as follows (imagine here that you are crawling all |
| the virtualhosts from apache.org and you wan't to tag pages with |
| url pattern "http://lucene.apache.org/nutch" and http://wiki.apache.org/nutch/ |
| to be part of subcollection "nutch", this allows you to later search |
| specifically from this subcollection) |
| </p> |
| <p/> |
| <p/> |
| <pre> |
| <?xml version="1.0" encoding="UTF-8"?> |
| <subcollections> |
| <subcollection> |
| <name>nutch</name> |
| <id>lucene</id> |
| <whitelist>http://lucene.apache.org/nutch</whitelist> |
| <whitelist>http://wiki.apache.org/nutch/</whitelist> |
| <blacklist /> |
| </subcollection> |
| </subcollections> |
| </pre> |
| </p> |
| <p>Despite of this configuration you still can crawl any urls |
| as long as they pass through your global url filters. (note that |
| you must also seed your urls in normal nutch way) |
| </p></div> |
| </div> |
| <!-- ======= START OF BOTTOM NAVBAR ====== --> |
| <div class="bottomNav"><a name="navbar.bottom"> |
| <!-- --> |
| </a> |
| <div class="skipNav"><a href="#skip.navbar.bottom" title="Skip navigation links">Skip navigation links</a></div> |
| <a name="navbar.bottom.firstrow"> |
| <!-- --> |
| </a> |
| <ul class="navList" title="Navigation"> |
| <li><a href="../../../../overview-summary.html">Overview</a></li> |
| <li class="navBarCell1Rev">Package</li> |
| <li>Class</li> |
| <li><a href="package-use.html">Use</a></li> |
| <li><a href="package-tree.html">Tree</a></li> |
| <li><a href="../../../../deprecated-list.html">Deprecated</a></li> |
| <li><a href="../../../../index-all.html">Index</a></li> |
| <li><a href="../../../../help-doc.html">Help</a></li> |
| </ul> |
| </div> |
| <div class="subNav"> |
| <ul class="navList"> |
| <li><a href="../../../../org/apache/nutch/any23/package-summary.html">Prev Package</a></li> |
| <li><a href="../../../../org/apache/nutch/crawl/package-summary.html">Next Package</a></li> |
| </ul> |
| <ul class="navList"> |
| <li><a href="../../../../index.html?org/apache/nutch/collection/package-summary.html" target="_top">Frames</a></li> |
| <li><a href="package-summary.html" target="_top">No Frames</a></li> |
| </ul> |
| <ul class="navList" id="allclasses_navbar_bottom"> |
| <li><a href="../../../../allclasses-noframe.html">All Classes</a></li> |
| </ul> |
| <div> |
| <script type="text/javascript"><!-- |
| allClassesLink = document.getElementById("allclasses_navbar_bottom"); |
| if(window==top) { |
| allClassesLink.style.display = "block"; |
| } |
| else { |
| allClassesLink.style.display = "none"; |
| } |
| //--> |
| </script> |
| </div> |
| <a name="skip.navbar.bottom"> |
| <!-- --> |
| </a></div> |
| <!-- ======== END OF BOTTOM NAVBAR ======= --> |
| <p class="legalCopy"><small>Copyright © 2021 The Apache Software Foundation</small></p> |
| </body> |
| </html> |