This plugin allows you to fetch Javascript pages using Selenium, while relying on the rest of the awesome Nutch stack!
The underlying code is based on the nutch-htmlunit plugin, which was in turn based on nutch-httpclient.
There are essentially two ways in which Nutch can be used with Selenium.
sudo apt-get install firefox
This step is not necessary for the PhantomJs broswer and may not be needed for all browsers.
sudo apt-get install xorg synaptic xvfb gtk2-engines-pixbuf xfonts-cyrillic xfonts-100dpi \ xfonts-75dpi xfonts-base xfonts-scalable freeglut3-dev dbus-x11 openbox x11-xserver-utils \ libxrender1 cabextract
sudo /usr/bin/Xvfb :11 -screen 0 1024x768x24 & sudo export DISPLAY=:11
Using the Selenium Grid will allow you to parallelize the job by facilitating access of several instances of browsers whether on one machine or on several machines. Note that grid facilitates heterogeneity with regards to browser types used. However, these steps have been tested using a homogenous Selenium Grid with Firefox and PhantomJS browsers.
Download the Selenium Standalone Server and follow the installation instructions.
Some important configurations to note while setting up the selenium-hub and the selenium-nodes are:
For the hub:
For the nodes:
Go headless with your selenium Grid installation. There are different ways to this. See this resource for further details.
For Nutch efficiency, and optimization of the grid, consider editing the following configs in nutch-site.xml
To activate the full selenium grid, edit $NUTCH_HOME/runtime/local/bin/crawl script:
<!-- NUTCH_HOME/conf/nutch-site.xml --> <configuration> ... <property> <name>plugin.includes</name> <value>protocol-selenium|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>
<!-- protocol-selenium plugin properties --> <property> <name>selenium.driver</name> <value>firefox</value> <description> A String value representing the flavour of Selenium WebDriver() to use. Currently the following options exist - 'firefox', 'chrome', 'safari', 'opera', 'phantomjs', and 'remote'. If 'remote' is used it is essential to also set correct properties for 'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host' and 'selenium.hub.protocol'. </description> </property> <property> <name>selenium.take.screenshot</name> <value>false</value> <description> Boolean property determining whether the protocol-selenium WebDriver should capture a screenshot of the URL. If set to true remember to define the 'selenium.screenshot.location' property as this determines the location screenshots should be persisted to on HDFS. If that property is not set, screenshots are simply discarded. </description> </property> <property> <name>selenium.screenshot.location</name> <value></value> <description> The location on disk where a URL screenshot should be saved to if the 'selenium.take.screenshot' proerty is set to true. By default this is null, in this case screenshots held in memory are simply discarded. </description> </property> <property> <name>selenium.hub.port</name> <value>4444</value> <description>Selenium Hub Location connection port</description> </property> <property> <name>selenium.hub.path</name> <value>/wd/hub</value> <description>Selenium Hub Location connection path</description> </property> <property> <name>selenium.hub.host</name> <value>localhost</value> <description>Selenium Hub Location connection host</description> </property> <property> <name>selenium.hub.protocol</name> <value>http</value> <description>Selenium Hub Location connection protocol</description> </property> <property> <name>selenium.grid.driver</name> <value>firefox</value> <description>A String value representing the flavour of Selenium WebDriver() used on the selenium grid. Currently the following options exist - 'firefox' or 'phantomjs' </description> </property> <property> <name>selenium.grid.binary</name> <value></value> <description>A String value representing the path to the browser binary location for each node </description> </property> <!-- lib-selenium configuration --> <property> <name>libselenium.page.load.delay</name> <value>3</value> <description> The delay in seconds to use when loading a page with lib-selenium. This setting is used by protocol-selenium and protocol-interactiveselenium since they depending on lib-selenium for fetching. </description> </property>
If you‘ve selected ‘remote’ value for the ‘selenium.driver’ property, ensure that you’ve configured the additional properties based on your Selenium-Grid installation.
Compile nutch
ant runtime