src/plugin/urlfilter-fast - nutch

tree: c83f8ff8ebbab6e05d42d11eaa77c9ef1e879113 [path history] [tgz]

src/plugin/urlfilter-fast/README.md

Filters URLs based on a file of regular expressions using host/domains matching first. The default policy is to accept a URL if no matches are found.

Rule Format:

Host www.example.org
  DenyPath /path/to/be/excluded
  DenyPath /some/other/path/excluded

# Deny everything from *.example.com and example.com
Domain example.com
  DenyPath .*

Domain example.org
  DenyPathQuery /resource/.*?action=exclude

Host rules are evaluated before Domain rules. For Host rules the entire host name of a URL must match while the domain names in Domain rules are considered as matches if the domain is a suffix of the host name (consisting of complete host name parts). Shorter domain suffixes are checked first, a single dot “.” as “domain name” can be used to specify global rules applied to every URL.

E.g., for “www.example.com” the rules given above are looked up in the following order:

check “www.example.com” whether host-based rules exist and whether one of them matches
check “www.example.com” for domain-based rules
check “example.com” for domain-based rules
check “com” for domain-based rules
check for global rules (domain name is “.”)

The first matching rule will reject the URL and no further rules are checked. If no rule matches the URL is accepted. URLs without a host name (e.g., file:/path/file.txt are checked for global rules only. URLs which fail to be parsed as java.net.URL are always rejected.

For rules either the URL path (DenyPath) or path and query (DenyPathQuery) are checked whether the given Java Regular expression is found (see Matcher.find()) in the URL path (and query).

Rules are applied in the order of their definition. For better performance, regular expressions which are simpler/faster or match more URLs should be defined earlier.

Comments in the rule file start with the # character and reach until the end of the line.

The rules file is defined via the property urlfilter.fast.file, the default name is fast-urlfilter.txt.