This plugin allows you to fetch Javascript pages using Selenium, while relying on the rest of the awesome Nutch stack!
The underlying code is based on the nutch-htmlunit plugin, which was in turn based on nutch-httpclient.
There are essentially two ways in which Nutch can be used with Selenium.
sudo apt-get install firefox
This step is not necessary for the PhantomJs broswer and may not be needed for all browsers.
sudo apt-get install xorg synaptic xvfb gtk2-engines-pixbuf xfonts-cyrillic xfonts-100dpi \
xfonts-75dpi xfonts-base xfonts-scalable freeglut3-dev dbus-x11 openbox x11-xserver-utils \
libxrender1 cabextract
sudo /usr/bin/Xvfb :11 -screen 0 1024x768x24 & sudo export DISPLAY=:11
Using the Selenium Grid will allow you to parallelize the job by facilitating access of several instances of browsers whether on one machine or on several machines. Note that grid facilitates heterogeneity with regards to browser types used. However, these steps have been tested using a homogenous Selenium Grid with Firefox and PhantomJS browsers.
Download the Selenium Standalone Server and follow the installation instructions.
Some important configurations to note while setting up the selenium-hub and the selenium-nodes are:
For the hub:
For the nodes:
Go headless with your selenium Grid installation. There are different ways to this. See this resource for further details.
For Nutch efficiency, and optimization of the grid, consider editing the following configs in nutch-site.xml
To activate the full selenium grid, edit $NUTCH_HOME/runtime/local/bin/crawl script:
<!-- NUTCH_HOME/conf/nutch-site.xml -->
<configuration>
...
<property>
<name>plugin.includes</name>
<value>protocol-selenium|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
<!-- protocol-selenium plugin properties -->
<property>
<name>selenium.driver</name>
<value>firefox</value>
<description>
A String value representing the flavour of Selenium
WebDriver() to use. Currently the following options
exist - 'firefox', 'chrome', 'safari', 'opera', 'phantomjs', and 'remote'.
If 'remote' is used it is essential to also set correct properties for
'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host' and
'selenium.hub.protocol'.
</description>
</property>
<property>
<name>selenium.take.screenshot</name>
<value>false</value>
<description>
Boolean property determining whether the protocol-selenium
WebDriver should capture a screenshot of the URL. If set to
true remember to define the 'selenium.screenshot.location'
property as this determines the location screenshots should be
persisted to on HDFS. If that property is not set, screenshots
are simply discarded.
</description>
</property>
<property>
<name>selenium.screenshot.location</name>
<value></value>
<description>
The location on disk where a URL screenshot should be saved
to if the 'selenium.take.screenshot' proerty is set to true.
By default this is null, in this case screenshots held in memory
are simply discarded.
</description>
</property>
<property>
<name>selenium.hub.port</name>
<value>4444</value>
<description>Selenium Hub Location connection port</description>
</property>
<property>
<name>selenium.hub.path</name>
<value>/wd/hub</value>
<description>Selenium Hub Location connection path</description>
</property>
<property>
<name>selenium.hub.host</name>
<value>localhost</value>
<description>Selenium Hub Location connection host</description>
</property>
<property>
<name>selenium.hub.protocol</name>
<value>http</value>
<description>Selenium Hub Location connection protocol</description>
</property>
<property>
<name>selenium.grid.driver</name>
<value>firefox</value>
<description>A String value representing the flavour of Selenium
WebDriver() used on the selenium grid. Currently the following options
exist - 'firefox' or 'phantomjs' </description>
</property>
<property>
<name>selenium.grid.binary</name>
<value></value>
<description>A String value representing the path to the browser binary
location for each node
</description>
</property>
<!-- lib-selenium configuration -->
<property>
<name>libselenium.page.load.delay</name>
<value>3</value>
<description>
The delay in seconds to use when loading a page with lib-selenium. This
setting is used by protocol-selenium and protocol-interactiveselenium
since they depending on lib-selenium for fetching.
</description>
</property>
If you‘ve selected ‘remote’ value for the ‘selenium.driver’ property, ensure that you’ve configured the additional properties based on your Selenium-Grid installation.
Compile nutch
ant runtime