| Nutch Change Log |
| |
| Unreleased changes (0.8.2) |
| |
| 1. NUTCH-391 ParseUtil logs file contents to log file when it |
| cannot find parser (siren) |
| |
| 2. NUTCH-379 - ParseUtil does not pass through the content's URL |
| to the ParserFactory (Chris A. Mattmann via siren) |
| |
| 3. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one |
| partition. (ab) |
| |
| Release 0.8.1 - 2006-09-24 |
| |
| 1. Changed log4j confiquration to log to stdout on commandline |
| tools (siren) |
| |
| 2. NUTCH-266 - Updated hadoop.jar to contain patch from HADOOP-387 |
| (siren) |
| |
| 3. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren) |
| |
| 4. Optionally skip pages with abnormally large Crawl-Delay values |
| (Dennis Kubes via ab) |
| |
| 5. Fix incorrect calculation of max and min scores in readdb -stats |
| (Chris Schneider via ab) |
| |
| 6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris |
| Schneider and Stefan Groschupf via ab) |
| |
| 7. NUTCH-338 - Remove the text parser as an option for parsing PDF files |
| in parse-plugins.xml (Chris A. Mattmann via siren) |
| |
| 8. NUTCH-105 - Network error during robots.txt fetch causes file to |
| beignored (Greg Kim via siren) |
| |
| 9. Use a CombiningCollector when calculating readdb -stats. This |
| drastically reduces the size of intermediate data, resulting in |
| significant speed-ups for large databases (ab) |
| |
| 10. NUTCH-332 - Fix doubling score caused by links to self (Stefan |
| Groschupf via ab) |
| |
| 11. NUTCH-336 - Differentiate between newly discovered pages and newly |
| injected pages (Chris Schneider via ab) NOTE: this changes the |
| scoring API, filter implementations need to be updated. |
| |
| 12. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf |
| via ab) |
| |
| 13. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE |
| (Stefan Groschupf via ab) |
| |
| Release 0.8 - 2006-07-25 |
| |
| 0. Totally new architecture, based on hadoop |
| [http://lucene.apache.org/hadoop] (cutting) |
| |
| 1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross). |
| |
| 2. NUTCH-108 - Log hosts that exceed generate.max.per.host. |
| (Rod Taylor via cutting) |
| |
| 3. NUTCH-88 - Enhance ParserFactory plugin selection policy |
| (jerome) |
| |
| 4. NUTCH-124 - Protocol-httpclient does not follow redirects when |
| fetching robots.txt (cutting) |
| |
| 5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?) |
| (stack@archive.org, cutting) |
| |
| 6. NUTCH-114 - Getting number of urls and links from crawldb |
| (Stefan Groschupf via ab) |
| |
| 7. NUTCH-112 - Link in cached.jsp page to cached content is an |
| absolute link (Chris A. Mattmann via jerome) |
| |
| 8. NUTCH-135 - Http header meta data are case insensitive in the |
| real world (Stefan Groschupf via jerome) |
| |
| 9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due |
| to UTF-8 BOM (KuroSaka TeruHiko via siren) |
| |
| 10. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab) |
| |
| 11. Added support for OpenSearch (cutting) |
| |
| 12. NUTCH-142 - NutchConf should use the thread context classloader |
| (Mike Cannon-Brookes via pkosiorowski) |
| |
| 13. NUTCH-160 - Use standard Java Regex library rather than |
| org.apache.oro.text.regex (Rod Taylor via cutting) |
| |
| 14. NUTCH-151 - CommandRunner can hang after the main thread exec is |
| finished and has inefficient busy loop (Paul Baclace via cutting) |
| |
| 15. NUTCH-174 - Problem encountered with ant during compilation |
| |
| 16. NUTCH-190 - ParseUtil drops reason for failed parse |
| (stack@archive.org via ab) |
| |
| 17. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab) |
| |
| 18. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab) |
| |
| 19. NUTCH-178 - in search.jsp must be session creation "false" |
| (YourSoft via siren) |
| |
| 20. NUTCH-200 - OpenSearch Servlet ist broken |
| (Marko Bauhardt via siren) |
| |
| 21. NUTCH-81 - Webapp only works when deployed in root |
| (AJ Banck, Michael Nebel via siren) |
| |
| 22. NUTCH-139 - Standard metadata property names in the ParseData |
| metadata (Chris A. Mattmann, jerome) |
| |
| 23. NUTCH-192 - Meta data support for CrawlDatum |
| (Stefan Groschupf via ab) |
| |
| 24. NUTCH-52 - Parser plugin for MS Excel files |
| (Rohit Kulkarni via jerome) |
| |
| 25. NUTCH-53 - Parser plugin for Zip files |
| (Rohit Kulkarni via jerome) |
| |
| 26. NUTCH-137 - footer is not displayed in search result page |
| (KuroSaka TeruHiko via siren) |
| |
| 27. NUTCH-118 - FAQ link points to invalid URL |
| (Steve Betts via siren) |
| |
| 28. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin) |
| translation (Ivan Sekulovic via siren) |
| |
| 29. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf |
| via cutting) |
| |
| 30. NUTCH-140 - Add alias capability in parse-plugins.xml file that |
| allows mimeType->extensionId mapping (Chris A. Mattmann via jerome) |
| |
| 31. NUTCH-214 - Added Links to web site to search mailling list |
| (Jake Vanderdray via jerome) |
| |
| 32. NUTCH-204 - Multiple field values in HitDetails |
| (Stefan Groschupf via jerome) |
| |
| 33. NUTCH-219 - file.content.limit & ftp.content.limit should be changed |
| to -1 to be consistent with http (jerome) |
| |
| 34. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren) |
| |
| 35. NUTCH-91 - Empty encoding causes exception (Michael Nebel via |
| pkosiorowski) |
| |
| 36. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via |
| jerome) |
| |
| 37. NUTCH-229 - Improved handling of plugin folder configuration |
| (Stefan Groschupf via ab) |
| |
| 38. NUTCH-206 - Search server throws InstantiationException (ab) |
| |
| 39. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt |
| via ab) |
| |
| 40. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab) |
| |
| 41. Update to lucene 1.9.1 (cutting) |
| |
| 42. NUTCH-235 - Duplicate Inlink values (ab) |
| |
| 43. NUTCH-234 - Clustering extension code cleanups and a real |
| JUnit test case for the current implementation (Dawid Weiss via ab) |
| |
| 44. NUTCH-210 - Context.xml file for Nutch web application |
| (Chris A. Mattmann via jerome) |
| |
| 45. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome) |
| |
| 46. NUTCH-232 - Search.jsp has multiple search forms creating |
| invalid html / incorrect focus function (jerome) |
| |
| 47. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome) |
| |
| 48. NUTCH-244 - Inconsistent handling of property values |
| boundaries / unable to set db.max.outlinks.per.page to |
| infinite (jerome) |
| |
| 49. NUTCH-245 - DTD for plugin.xml configuration files |
| (Chris A. Mattmann via jerome) |
| |
| 50. NUTCH-250 - Generate to log truncation caused by |
| generate.max.per.host (Rod Taylor via cutting) |
| |
| 51. NUTCH-125 - OpenOffice Parser plugin (ab) |
| |
| 52. Switch from using java.io.File to org.apache.hadoop.fs.Path. |
| (cutting) |
| |
| 53. NUTCH-240 - Scoring API: extension point, scoring filters and |
| an OPIC plugin (ab) |
| |
| 54. NUTCH-134 - Summarizer doesn't select the best snippets (jerome) |
| |
| 55. NUTCH-268 - Generator and lib-http use different definitions of |
| "unique host" (ab) |
| |
| 56. NUTCH-280 - Url query causes NullPointerException (Grant Glouser |
| via siren) |
| |
| 57. NUTCH-285 - LinkDb Fails rename doesn't create parent directories |
| (Dennis Kubes via ab) |
| |
| 58. NUTCH-201 - Add support for subcollections |
| (siren) |
| |
| 59. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown |
| (Stefan Groschupf via jerome) |
| |
| 60. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome) |
| |
| 61. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query |
| (Stefan Groschupf via jerome) |
| |
| 62. NUTCH-110 - OpenSearchServlet outputs illegal xml characters |
| (stack@archive.org via siren) |
| |
| 63. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space |
| (Stefan Neufeind via siren) |
| |
| 64. NUTCH-307 - Wrong configured log4j.properties (jerome) |
| |
| 65. NUTCH-303 - Logging improvements (jerome) |
| |
| 66. NUTCH-308 - Maximum search time limit (ab) |
| |
| 67. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency |
| problem (Grant Glouser via siren) |
| |
| 68. Update to hadoop-0.4 (Milind Bhandarkar, cutting) |
| |
| 69. NUTCH-317 - Clarify what the queryLanguage argument of |
| Query.parse(...) means (jerome) |
| |
| 70. Added alternative experimental web gui in contrib containing |
| extensions like subcollection, keymatch, user preferences, |
| caching, implemented mainly using tiles and jstl (siren) |
| |
| 71. NUTCH-320 DmozParser does not output list of urls to stdout |
| but to a log file instead. Original functionality restored. |
| |
| 72. NUTCH-271 - Add ability to limit crawling to the set of initially |
| injected hosts (db.ignore.external.links) (Philippe Eugene, |
| Stefan Neufeind via ab) |
| |
| 73. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab) |
| |
| 74. NUTCH-327 - Fixed logging directory on cygwin (siren) |
| |
| Release 0.7 - 2005-08-17 |
| |
| 1. Added support for "type:" in queries. Search results are limited/qualified |
| by mimetype or its primary type or sub type. For example, |
| (1) searching with "type:application/pdf" restricts results |
| to pages which were identified to be of mimetype "application/pdf". |
| (2) with "type:application", nutch will return pages of |
| primary type "application". |
| (3) with "type:pdf", only pages of sub type "pdf" will be listed. |
| (John Xing, 20050120) |
| |
| 2. Added support for "date:" in queries. Last-Modified is indexed. |
| Search results are restricted by lower and upper date (inclusive) |
| as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231 |
| only returns pages with Last-Modified in year 2004. |
| (John Xing, 20050122) |
| |
| 3. Add URLFilter plugin interface and convert existing url filters into |
| plugins. (John Xing, 20050206) |
| |
| 4. Add UpdateSegmentsFromDb tool, which updates the scores and |
| anchors of existing segments with the current values in the web |
| db. This is used by CrawlTool, so that pages are now only fetched |
| once per crawl. (Doug Cutting, 20050221) |
| |
| 5. Moved code into org.apache.nutch sub-packages. Changed license to |
| Apache 2.0. Removed jar files whose licenses do not permit |
| redistribution by Apache. Disabled compilation of plugins which |
| require these libraries. (Doug Cutting 20050301) |
| |
| 6. Index host and title in separate fields. Host was indexed |
| previously only as a part of the URL. Title was indexed as an |
| anchor. Now boosts for matching these fields may be adjusted |
| separately from boosts for matching anchors and url. Also: move |
| site indexing to index-basic plugin to minimize the number of |
| times the URL needs to be parsed; and, stop using anchor analyzer |
| for anything but anchors. (Piotr Kosiorowski via Doug Cutting |
| 20050323) |
| |
| 7. Add servlet Cached.java that serves cached Content of any mime type. |
| Slightly modified are web.xml and cached.jsp. |
| (John Xing, 20050401) |
| |
| 8. Add skipCompressedByteArray() to WritableUtils.java. |
| (John Xing, 20050402) |
| |
| 9. Fixes to jsp and static web pages. These now use relative links, |
| so that the Nutch webapp file can be used in places other than at |
| the root. Also fixed links to the about and help pages. Bug #32. |
| (Jerome Charron via cutting, 20050404) |
| |
| 10. Added some features to DistributedSearch: new segments can be added |
| to searchservers without restarting the frontend, defective search |
| servers are not queried until tey come back online, watchdog keeps |
| an eye for your searchservers and writes simple statistics. |
| (Sami Siren, 20050407) |
| |
| 11. Fix for bug #4 - Unbalanced quote in query eats all resources. |
| (Piotr Kosiorowski, Sami Siren, 20050407) |
| |
| 12. Close Issue #33 - MIME content type detector (using magic char sequences). |
| (Jerome Charron and Hari Kodungallur via John Xing, 20050416) |
| |
| 13. Add a servlet that implements A9's OpenSearch RSS web service. |
| (cutting, 20050418) |
| |
| 14. Remove references to link analysis from tutorial, and enable |
| scoring by link count when generating fetchlists and searching. |
| (cutting, 20040419) |
| |
| 15. Make query boosts for host, title, anchor and phrase matches |
| configurable. (Piotr Kosiorowski via cutting, 20050419) |
| |
| 16. Add support for sorting search results and search-time deduping by |
| fields other than site. |
| |
| 17. Automatically convert range queries into cached range filters. |
| This improves the performance and scalability of, e.g., date range |
| searching. |
| |
| 18. Several methods have been renamed due to misspellings. The old |
| methods have been deprecated and will be removed before the 1.0 |
| release. |
| |
| |
| Release 0.6 |
| |
| 1. Added clustering-carrot2 plugin, together with introduction of clustering |
| api and modification to search jsp. (Dawid Weiss via John Xing, 20040809) |
| |
| 2. Make a number of changes to NDFS (Nutch Distributed File System) |
| to fix bugs, add admin tools, etc. |
| |
| Also, modify all command line tools so you can indicate whether to |
| use NDFS or the local filesystem. If you indicate nothing, then |
| it defaults to the local fs. |
| |
| I've used this to do a 35m page crawl via NDFS, distributed over a |
| dozen machines. (Mike Cafarella) |
| |
| 3. Add support for BASE tags in HTML. Outlinks are now correctly |
| extracted when a BASE tag is present. (cutting) |
| |
| 4. Fix two bugs in result pagination. When the last hit on a page |
| was the last hit overall, the "next" button was sometimes shown |
| when the "show all" button should be shown instead. Also, in |
| certain cases, the "show all" button would be shown when the |
| "next" button should have been shown. (cutting) |
| |
| 5. Add config parameter "indexer.max.tokens" that determines the |
| maximum number of tokens indexed per field. (Andy Hedges via cutting) |
| |
| 6. Add parser for mp3 files. (Andy Hedges via cutting) |
| |
| 7. Add RegexUrlNormalizer. This is useful for things like stripping |
| out session IDs from URLs. To use it, add values for |
| urlnormalizer.class and urlnormalizer.regex.file to your |
| nutch-site.xml. The RegexUrlNormalizer class extends the |
| BasicUrlNormalizer, and does basic normalization as well. |
| (Luke Baker via cutting) |
| |
| 8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910) |
| |
| 9. Added Polish translation (Andrzej Bialecki, 20040911) |
| |
| 10. Added 3 more language profiles to language identifier (ru,hu,pl). |
| Other changes to language identifier: Porfiles converted to utf8, |
| added some test cases, changed the similarity calculation. |
| (Sami Siren, 20040925) |
| |
| 11. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929) |
| |
| 12. Added plugin index-more and more.jsp (John Xing, 20041003) |
| |
| 13. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced |
| in DistributedSearch.java. text.jsp is added. (John Xing, 20041006) |
| |
| 14. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp |
| (but not search.jsp) with NullPointerException in distributed search. |
| It seems that this bug appears after "hits per site" stuff is added. |
| The fix is done in Hit.java, making sure String site is never null. |
| Hope this fix not have bad effetct on "hits per site" code. |
| (John Xing, 20041006) |
| |
| 15. Fixed a bug that fails fullyDelete() in FileUtil.java for |
| LocalFileSystem.java. This bug also exposes possible incompleteness |
| of NDFSFile.java, where a few methods are not supported, including |
| delete(). Nothing changed in NDFSFile.java though. Leave it for future |
| improvement (John Xing, 20041022). |
| |
| 16. Introduced option -noParsing to Fetcher.java and added ParseSegment.java. |
| A new status code CANT_PARSE is added to FetcherOutput.java. |
| Without option -noParsing , no change in fetcher behavior. With |
| option -noParsing, fetcher does crawls only, no parsing is carried out. |
| Then, ParseSegment.java should be used to parse in separate pass. |
| (John Xing, 20041025) |
| |
| 17. Added ontology plugin. Currently it is used for query refinement, as |
| examplified in refine-query-init.jsp and refine-query.jsp. By default, |
| query refinement is disabled in search.jsp. Please check |
| ./src/plugin/ontology/README.txt for further description. |
| Ontology plugin certainly can be used for many other things. |
| (Michael J. Pan via John Xing, 20041129) |
| |
| 18. Changed fetcher.server.delay to be a float, so that sub-second |
| delays can be specified. (cutting) |
| |
| 19. Added plugin.includes config parameter that determines which |
| plugins are included. By default now only http, html and basic |
| indexing and search plugins are enabled, rather than all plugins. |
| This should make default performance more predictable and reliable |
| going forward. (cutting) |
| |
| 20. Cleaned up some filesystem code, including: |
| |
| - Replaced BufferedRandomAccessFile with two simpler utilties, |
| NFSDataInputStream and NFSDataOutputStream. |
| |
| - Fixed the bug where SequenceFiles were no longer flushed when |
| created, so that, when fetches crashed, segments were |
| unreadable. Now segments are always readable after crashes. |
| Only the contents of the last buffer is lost. |
| |
| - Simplified the FSOutputStream API to not include seek(). We |
| should never need that functionality. |
| |
| - Simplified LocalFileSystem's implementations of FSInputStream |
| and FSOutputStream and optimized FSInputStream.seek(). |
| |
| (cutting) |
| |
| 21. Fixed BasicUrlNormalizer to better handle relative urls. The file |
| part of a URL is normalized in the following manner: |
| |
| 1. "/aa/../" will be replaced by "/" This is done step by step until |
| the url doesn´t change anymore. So we ensure, that |
| "/aa/bb/../../" will be replaced by "/", too |
| |
| 2. leading "/../" will be replaced by "/" |
| |
| (Sven Wende via cutting) |
| |
| 22. Fix Page constructors so that next fetch date is less likely to be |
| misconstrued as a float. This patches a problem in WebDBInjector, |
| where new pages were added to the db with nextScore set to the |
| intended nextFetch date. This, in turn, confused link analysis. |
| |
| 23. In ndfs code, replace addLocalFile(), putToLocalFile() with |
| copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and |
| moveToLocalFile(). (John Xing, 20041217) |
| |
| 24. Added new config parameter fetcher.threads.per.host. This is used |
| by the Http protocol. When this is one behavior is as before. |
| When this is greater than one then multiple threads are permitted |
| to access a host at once. Note that fetcher.server.delay is no |
| longer consistently observed when this is greater than one. |
| (Luke Baker via Doug Cutting) |
| |
| Release 0.5 |
| |
| 1. Changed plugin directory to be a list of directories. |
| |
| 2. Permit Plugin to be the default plugin implementation. |
| |
| 3. Added pluggable interface for network protocols in new package |
| net.nutch.protocol. Moved http code from core into a plugin. |
| |
| 4. Added pluggable interface for content parsing in new package |
| net.nutch.parse. Moved html parsing code from core into a |
| plugin. |
| |
| 5. Fixed a bug in NutchAnalysis where 16-bit characters were not |
| processed correctly. |
| |
| 6. Fixed bug #971731: random summaries on result page. |
| (Daniel Naber via cutting) |
| |
| 7. Made Nutch logo transparent. (Daniel Naber via cutting) |
| |
| 8. Added file protocol plugin. (John Xing via cutting) |
| |
| 9. Added ftp protocol plugin. (John Xing via cutting) |
| |
| 10. Added pdf and msword parser plugins. (John Xing via cutting) |
| |
| 11. Added pluggable indexing interface. By default, url, content, |
| anchors and title are indexed, as before, but now one can easily |
| alter this to, e.g., index metadata. A demonstration is provided |
| which extracts and indexes Creative Commons license urls. (cutting) |
| |
| 12. Add language identification plugin. |
| |
| The process of identification is as follows: |
| |
| 1. html (html only, HTML 4.0 "lang" attribute) |
| 2. meta tags (html only, http-equiv, dc.language) |
| 3. http header (Content-Language) |
| 4. if all above fail "statistical analysis" |
| |
| 1 & 2 are run during the fetching phase and 3 & 4 are run on |
| indexing phase. |
| |
| Currently supported languages (in "statistical analysis") are |
| da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed |
| from http://www.isi.edu/~koehn/europarl/ and the profiles were |
| build with tool supplied in patch. |
| |
| After indexing the language can be found from field named "lang" |
| |
| It's not 100% accurate but it's a start. |
| (Sami Siren) |
| |
| 13. Added SegmentMergeTool and "mergesegs" command, to remove |
| duplicated or otherwise not used content from several segments and |
| joining them together into a single new segment. The tool also |
| optionally performs several other steps required for proper |
| operation of Nutch - such as indexing segments, deleting |
| duplicates, merging indices, and indexing the new single segment. |
| (Andrzej Bialecki) |
| |
| 14. Add the ability to retrieve ParseData of a search hit. ParseData |
| contains many valuable properties of a search hit. |
| |
| This is required (among others) to properly display the cached |
| content because it's not possible to determine the character |
| encoding from the output of the getContent() method (which returns |
| byte[]). The symptoms are that for HTML pages using non-latin1 or |
| non-UTF8 encodings the cached preview will almost certainly look |
| broken. Using the attached patch it is possible to determine the |
| character encoding from the ParseData (for HTTP: Content-Type |
| metadata), and encode the content accordingly. (Andrzej Bialecki) |
| |
| 15. Add a pluggable query interface. By default, the content, anchor |
| and url fields are searched as before. A sample plugin indexes |
| the host name and adds a "site:" keyword to query parsing. |
| |
| 16. Added support for "lang:" in queries. For example, searching with |
| "lang:en" restricts results to pages which were identified to |
| be in English. |
| |
| 17. Automatically optimize field queries to use cached Lucene filters. |
| This makes, for example, searches restricted by languages or sites |
| that are very common much faster. |
| |
| 18. Improved charset handling in jsp pages. (jshin by cutting) |
| |
| 19. Permit topic filtering when injecting DMOZ pages. (jshin by cutting) |
| |
| 20. When parsing crawled pages, interpret charset specifications in |
| html meta tags. (jshin by cutting) |
| |
| 21. Added support for "cc:licensed" in queries, which searches for documents |
| released under Creative Commons licenses. Attributes of the |
| license may also be queried, with, e.g., "cc:by" for |
| attribution-required licenses, "cc:nc" for non-commercial |
| licenses, etc. |
| |
| 22. Relative paths named in plugin.folders are now searched for on the |
| classpath. This makes, e.g., deployment in a war file much simpler. |
| |
| 23. Modifications to Fetcher.java. |
| |
| 1. Make sure it works properly with regard to creation and initialization |
| of plugin instances. The problem was that multiple threads race to |
| startUp() or shutDown() plugin instances. It was solved by synchronizing |
| certain codes in PluginRepository.java and Extension.java. |
| (Stefan Groschupf via John Xing) |
| |
| 2. Added code to explictly shutDown() plugins. Otherwise FetcherThreads |
| may never return (quit) if there are still data or other structures |
| (e.g., persistent socket connections) associated with plugins. (John Xing) |
| |
| 3. Fixed one type of Fetcher "hang" problems by monitoring named |
| FetcherThreads. If all FetcherThreads are gone (finished), |
| Fetcher.java is considered done. The problem was: there could be |
| runaway threads started by external libs via FetcherThreads. |
| Those threads never return, thus keep Fetcher from exiting normally. |
| (John Xing) |
| |
| 24. Eliminate excessive hits from sites. This is done efficiently by |
| adding the site name to Hit instances, and, when needed, |
| re-querying with too-frequent sites prohibited in the query. |
| |
| |
| Release 0.4 |
| |
| 1. Http class refactored. (Kevin Smith via Tom Pierce) |
| |
| 2. Add Finnish translation. (Sampo Syreeni via Doug Cutting) |
| |
| 3. Added Japanese translation. (Yukio Andoh via Doug Cutting) |
| |
| 4. Updated Dutch translation. (Ype Kingma via Doug Cutting) |
| |
| 5. Initial version of Distributed DB code. (Mike Cafarella) |
| |
| 6. Make things more tolerant of crashed fetcher output files. |
| (Doug Cutting) |
| |
| 7. New skin for website. (Frank Henze via Doug Cutting) |
| |
| 8. Added Spanish translation. (Diego Basch via Doug Cutting) |
| |
| 9. Add FTP support to fetcher. (John Xing via Doug Cutting) |
| |
| 10. Added Thai translation. (Pichai Ongvasith via Doug Cutting) |
| |
| 11. Added Robots.txt & throttling support to Fetcher.java. (Mike |
| Cafarella) |
| |
| 12. Added nightly build. (Doug Cutting) |
| |
| 13. Default all link scores to 1.0. (Doug Cutting) |
| |
| 14. Permit one to keep internal links. (Doug Cutting) |
| |
| 15. Fixed dedup to select shortest URL. (Doug Cutting) |
| |
| 16. Changed index merger so that merged index is written to named |
| directory, rather than to a generated name in that directory. |
| (Doug Cutting) |
| |
| 17. Disable coordination weighting of query clauses and other minor |
| scoring improvements. (Doug Cutting) |
| |
| 18. Added a new command, crawl, that constructs a database, injects a |
| url file and performs a few rounds of generate/fetch/updatedb. |
| This simplifies use for intranet sites. Changed some defaults to |
| be more intranet friendly. (Doug Cutting) |
| |
| 19. Fixed a bug where Fetcher.java didn't construct correct relative |
| links when a page was redirected. (Doug Cutting) |
| |
| 20. Fixed a query parser problem with lookahead over plusses and minuses. |
| (Doug Cutting) |
| |
| 21. Add support for HTTP proxy servers. (Sami Siren via Doug Cutting) |
| |
| 22. Permit searching while fetching and/or indexing. |
| (Sami Siren via Doug Cutting) |
| |
| 23. Fix a bug when throttling is disabled. (Sami Siren via Doug Cutting) |
| |
| 24. Updated Bahasa Malaysia translation. (Michael Lim via Doug Cutting) |
| |
| 25. Added Catalan translation. (Xavier Guardiola via Doug Cutting) |
| |
| 26. Added brazilian portuguese translation. |
| (A. Moreir via Doug Cutting) |
| |
| 27. Added a french translation. (Julien Nioche via Doug Cutting) |
| |
| 28. Updated to Lucene 1.4RC3. (Doug Cutting) |
| |
| 29. Add capability to boost by link count & use it in crawl tool. |
| (Doug Cutting) |
| |
| 30. Added plugin system. (Stefan Groschupf via Doug Cutting) |
| |
| 31. Add this change log file, for recording significant changes to |
| Nutch. Populate it with changes from the last few months. |