blob: 4f68e5efb4127819aa05daa50da6982f5ecf0339 [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- NewPage -->
<html lang="en">
<head>
<!-- Generated by javadoc (1.8.0_221) on Tue Jan 19 12:28:00 PST 2021 -->
<title>org.apache.nutch.collection (apache-nutch 1.18 API)</title>
<meta name="date" content="2021-01-19">
<link rel="stylesheet" type="text/css" href="../../../../stylesheet.css" title="Style">
<script type="text/javascript" src="../../../../script.js"></script>
</head>
<body>
<script type="text/javascript"><!--
try {
if (location.href.indexOf('is-external=true') == -1) {
parent.document.title="org.apache.nutch.collection (apache-nutch 1.18 API)";
}
}
catch(err) {
}
//-->
</script>
<noscript>
<div>JavaScript is disabled on your browser.</div>
</noscript>
<!-- ========= START OF TOP NAVBAR ======= -->
<div class="topNav"><a name="navbar.top">
<!-- -->
</a>
<div class="skipNav"><a href="#skip.navbar.top" title="Skip navigation links">Skip navigation links</a></div>
<a name="navbar.top.firstrow">
<!-- -->
</a>
<ul class="navList" title="Navigation">
<li><a href="../../../../overview-summary.html">Overview</a></li>
<li class="navBarCell1Rev">Package</li>
<li>Class</li>
<li><a href="package-use.html">Use</a></li>
<li><a href="package-tree.html">Tree</a></li>
<li><a href="../../../../deprecated-list.html">Deprecated</a></li>
<li><a href="../../../../index-all.html">Index</a></li>
<li><a href="../../../../help-doc.html">Help</a></li>
</ul>
</div>
<div class="subNav">
<ul class="navList">
<li><a href="../../../../org/apache/nutch/any23/package-summary.html">Prev&nbsp;Package</a></li>
<li><a href="../../../../org/apache/nutch/crawl/package-summary.html">Next&nbsp;Package</a></li>
</ul>
<ul class="navList">
<li><a href="../../../../index.html?org/apache/nutch/collection/package-summary.html" target="_top">Frames</a></li>
<li><a href="package-summary.html" target="_top">No&nbsp;Frames</a></li>
</ul>
<ul class="navList" id="allclasses_navbar_top">
<li><a href="../../../../allclasses-noframe.html">All&nbsp;Classes</a></li>
</ul>
<div>
<script type="text/javascript"><!--
allClassesLink = document.getElementById("allclasses_navbar_top");
if(window==top) {
allClassesLink.style.display = "block";
}
else {
allClassesLink.style.display = "none";
}
//-->
</script>
</div>
<a name="skip.navbar.top">
<!-- -->
</a></div>
<!-- ========= END OF TOP NAVBAR ========= -->
<div class="header">
<h1 title="Package" class="title">Package&nbsp;org.apache.nutch.collection</h1>
<div class="docSummary">
<div class="block">
Subcollection is a subset of an index.</div>
</div>
<p>See:&nbsp;<a href="#package.description">Description</a></p>
</div>
<div class="contentContainer">
<ul class="blockList">
<li class="blockList">
<table class="typeSummary" border="0" cellpadding="3" cellspacing="0" summary="Class Summary table, listing classes, and an explanation">
<caption><span>Class Summary</span><span class="tabEnd">&nbsp;</span></caption>
<tr>
<th class="colFirst" scope="col">Class</th>
<th class="colLast" scope="col">Description</th>
</tr>
<tbody>
<tr class="altColor">
<td class="colFirst"><a href="../../../../org/apache/nutch/collection/CollectionManager.html" title="class in org.apache.nutch.collection">CollectionManager</a></td>
<td class="colLast">&nbsp;</td>
</tr>
<tr class="rowColor">
<td class="colFirst"><a href="../../../../org/apache/nutch/collection/Subcollection.html" title="class in org.apache.nutch.collection">Subcollection</a></td>
<td class="colLast">
<div class="block">SubCollection represents a subset of index, you can define url patterns that
will indicate that particular page (url) is part of SubCollection.</div>
</td>
</tr>
</tbody>
</table>
</li>
</ul>
<a name="package.description">
<!-- -->
</a>
<h2 title="Package org.apache.nutch.collection Description">Package org.apache.nutch.collection Description</h2>
<div class="block"><p>
Subcollection is a subset of an index. Subcollections are defined
by urlpatterns in form of white/blacklist. So to get the page into
subcollection it must match the whitelist and not the blacklist.
</p>
<p>
Subcollection definitions are read from a file subcollections.xml
and the format is as follows (imagine here that you are crawling all
the virtualhosts from apache.org and you wan't to tag pages with
url pattern "http://lucene.apache.org/nutch" and http://wiki.apache.org/nutch/
to be part of subcollection "nutch", this allows you to later search
specifically from this subcollection)
</p>
<p/>
<p/>
<pre>
&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;subcollections>
&lt;subcollection>
&lt;name>nutch&lt;/name>
&lt;id>lucene&lt;/id>
&lt;whitelist>http://lucene.apache.org/nutch&lt;/whitelist>
&lt;whitelist>http://wiki.apache.org/nutch/&lt;/whitelist>
&lt;blacklist />
&lt;/subcollection>
&lt;/subcollections>
</pre>
</p>
<p>Despite of this configuration you still can crawl any urls
as long as they pass through your global url filters. (note that
you must also seed your urls in normal nutch way)
</p></div>
</div>
<!-- ======= START OF BOTTOM NAVBAR ====== -->
<div class="bottomNav"><a name="navbar.bottom">
<!-- -->
</a>
<div class="skipNav"><a href="#skip.navbar.bottom" title="Skip navigation links">Skip navigation links</a></div>
<a name="navbar.bottom.firstrow">
<!-- -->
</a>
<ul class="navList" title="Navigation">
<li><a href="../../../../overview-summary.html">Overview</a></li>
<li class="navBarCell1Rev">Package</li>
<li>Class</li>
<li><a href="package-use.html">Use</a></li>
<li><a href="package-tree.html">Tree</a></li>
<li><a href="../../../../deprecated-list.html">Deprecated</a></li>
<li><a href="../../../../index-all.html">Index</a></li>
<li><a href="../../../../help-doc.html">Help</a></li>
</ul>
</div>
<div class="subNav">
<ul class="navList">
<li><a href="../../../../org/apache/nutch/any23/package-summary.html">Prev&nbsp;Package</a></li>
<li><a href="../../../../org/apache/nutch/crawl/package-summary.html">Next&nbsp;Package</a></li>
</ul>
<ul class="navList">
<li><a href="../../../../index.html?org/apache/nutch/collection/package-summary.html" target="_top">Frames</a></li>
<li><a href="package-summary.html" target="_top">No&nbsp;Frames</a></li>
</ul>
<ul class="navList" id="allclasses_navbar_bottom">
<li><a href="../../../../allclasses-noframe.html">All&nbsp;Classes</a></li>
</ul>
<div>
<script type="text/javascript"><!--
allClassesLink = document.getElementById("allclasses_navbar_bottom");
if(window==top) {
allClassesLink.style.display = "block";
}
else {
allClassesLink.style.display = "none";
}
//-->
</script>
</div>
<a name="skip.navbar.bottom">
<!-- -->
</a></div>
<!-- ======== END OF BOTTOM NAVBAR ======= -->
<p class="legalCopy"><small>Copyright &copy; 2021 The Apache Software Foundation</small></p>
</body>
</html>