We are currently using a shim (https://github.com/tballison/hadoop-safe-tika because of binary conflicts in commons-io versions between what Hadoop supports and the more modern features that Apache Tika and Apache POI were using in commons-io.
For now, all you have to do is update the fat jar dependencies:
tika-core-shaded in ivy/ivy.xml
tika-parsers-standard-package-shaded in src/plugin/parse-tika/ivy.xml
The library name version for tika-parsers-standard-package-shaded in src/plugin/parse-tika/plugin.xml
Repeat steps 2 and 3 for the language-identifier
Build Nutch and run all unit tests:
$ cd ../../../ $ ant clean runtime test
The following directions are what we used to do with thin jars. Hopefully, we'll be able to get back to these directions once we have version harmony with Hadoop and Tika/POI.
Upgrade Tika dependency (tika-core) in ivy/ivy.xml
Upgrade Tika dependency in src/plugin/parse-tika/ivy.xml
Upgrade Tika's own dependencies in src/plugin/parse-tika/plugin.xml
To get the list of dependencies and their versions execute: $ cd src/plugin/parse-tika/ $ ant -f ./build-ivy.xml $ ls lib | sed ‘s/^/ <library name="/g’ | sed ‘s/$/"/>/g’
In the plugin.xml replace all lines between
and
with the output of the command above.
(Optionally) remove overlapping dependencies between parse-tika and Nutch core dependencies:
Remove the locally “installed” dependencies in src/plugin/parse-tika/lib/:
$ rm -rf lib/
Repeat steps 2-5 for the language-identifier plugin which also depends on Tika modules
$ cd ../language-identifier/
Build Nutch and run all unit tests:
$ cd ../../../ $ ant clean runtime test