NUTCH-2716 Response headers are not stored for a compressed response

Even when store.http.headers=true, the HTTP headers are not saved for a
gzipped or deflated response, because they may contain an incorrect
content-length header.
This causes WARCExporter to generate "resource" (header-less) entries
instead of "response" entries.
The correct behaviour is to store all the headers, and code that uses
them should be aware and careful that they represent the original
headers, not the stored content.

This fixes protocol-http, protocol-selenium, and protocol-htmlunit to
write the raw response headers, and adds logic to WARCExporter and
CommonCrawlDataDumper to fix these headers.

It also fixed NUTCH-2715 (WARCExporter fails on large records), and
upgrades lib-htmlunit to use version 3.141.5 of Selenium, since Eclipse
fails to compile otherwise (conflicts with lib-selenium).
7 files changed
tree: ee2c9c06943181f0faba962de674710377b37f37
  1. .github/
  2. conf/
  3. docker/
  4. ivy/
  5. lib/
  6. src/
  7. .gitignore
  8. build.xml
  9. CHANGES.txt
  10. default.properties
  11. eclipse-codeformat.xml
  12. KEYS
  13. LICENSE.txt
  14. NOTICE.txt
  15. README.md
README.md

Apache Nutch README

For the latest information about Nutch, please visit our website at:

http://nutch.apache.org

and our wiki, at:

http://wiki.apache.org/nutch/

To get started using Nutch read Tutorial:

http://wiki.apache.org/nutch/NutchTutorial

Contributing

To contribute a patch, follow these instructions (note that installing Hub is not strictly required, but is recommended).

0. Download and install hub.github.com
1. File JIRA issue for your fix at https://issues.apache.org/jira/browse/NUTCH
- you will get issue id NUTCH-xxx where xxx is the issue ID.
2. git clone http://github.com/apache/nutch.git 
3. cd nutch
4. git checkout -b NUTCH-xxx
5. edit files (please try and include a test case if possible)
6. git status (make sure it shows what files you expected to edit)
7. Make sure that your code complies with the [Nutch codeformatting template](http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml), which is basially two space indents
8. git add <files>
9. git commit -m “fix for NUTCH-xxx contributed by <your username>”
10. git fork
11. git push -u <your git username> NUTCH-xxx
12. git pull-request

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Nutch uses the PDFBox API in its parse-tika plugin for extracting textual content and metadata from encrypted PDF files. See http://pdfbox.apache.org for more details on PDFBox.