Define your custom normalizer here. Any normalizer class
must have to implement org.apache.nutch.core.jsoup.extractor.normalizer.Normalizable interface
Every normalizer must have a name and class attribute. See conf/jsoup-extractor-example.xml for example
<normalizer name="simpleNormalizer" class="org.apache.nutch.core.jsoup.extractor.normalizer.SimpleStringNormalizer" />
Every document must have url-pattern attribute which will contain the expected URL regex for filtering
<document url-pattern=".*" >
A <field> tag can have following properties:
1. attribute - 'name', contains the name of the field for indexing (mandatory).
2. tag - <css-selector>, contains the jsoup selector-syntax to find content using jsoup select() API (mandatory).
3. tag - <attribute>, contains the html attribute name to find content using jsoup attr() API along with select() API
(optional, if <attribute> is defined, select(<css-selector>).attr(<attribute>) will be used to extract content, otherwise, select(<css-selector>).ownText() will be used)
4. tag - <default-value>, contains the default value in case nothing found after jsoup selection. This is optional
5. tag - <normalizer>, name of the normalizer class defined in <normalizers> section. This is optional.
See conf/jsoup-extractor-example.xml for example.
Sample field example
<field name="title">
<default-value>A placeholder Title</default-value>