Index HTML content of the pages in Apaache Nutch 2.x ( 2.2.1 )
##Instruction:
###Compile from Source
Download the plugin folder “index-html” and copy it to you Apache nutch 2 plugin directory ( ex: apache-nutch-2.2.1/src/plugin )
Add the ( index-html ) plugin to The plugin folder build.xml ( apache-nutch-2.2.1/src/plugin/build.xml ) in target ( deploy and clean ) so the file will look like
<target name="deploy"> ....... <ant dir="index-basic" target="deploy"/> <ant dir="index-more" target="deploy"/> <ant dir="index-html" target="deploy"/> <ant dir="language-identifier" target="deploy"/> ......... </target> <target name="clean"> ....... <ant dir="index-basic" target="deploy"/> <ant dir="index-more" target="deploy"/> <ant dir="index-html" target="deploy"/> <ant dir="language-identifier" target="deploy"/> ......... </target>
<configuration> .......... <property> <name>plugin.includes</name> <value>...........someplugins....|index-html</value> </property> .......... </configuration>
<field name="rawcontent" type="text" sstored="true" indexed="true" multiValued="false"/>
Run the crawler and you should see the new field rawcontent in index!
###Use Pre-Compiled Library
###Screen Shot
I'm Always glad ot help and assist. so if you have an idea that could make this project better
Submit git issue or contact me www.meabed.net
Make a fork, commit to develop branch and make a pull request