Module to provide a fast, HTML5/XML tokenzier. TagStream takes an InputStream and tokenizes into a Tag or Text element which can be used for inspection or transformation by leveraging Java's Stream API.
This provides:
There are multiple ways which provides access to the processing of an HTML/XML source.
The TagIterator
allows you to iterate the HTML/XML document utilizing a pull methodology. Whenever you request the next element, the element is tokenized from the InputStream
The Tag
class wraps the TagIterator
to provide a Stream<Element>
provider.
Examples:
//count the number of start tags Tag.stream(inputStream).filter(elem -> elem.getType() == ElementType.START_TAG ).count();
// find any elements that has an href attribute that doesn't point anywhere // print out the bad links as an html list of anchors stream.filter(elem -> elem.getType() == ElementType.START_TAG && elem.hasAttribute("href") ) .filter(elem -> isLinkBad(elem.getAttribute("href"))) .map(HtmlStreams.TO_HTML) .forEach(System.out::println);
//count the number of tags HtmlSAXSupport saxEventGenerator = new HtmlSAXSupport(customHandler); Tag.stream(inputStream).forEach(saxEventGenerator);
TagStream works by using the W3C‘s HTML5 parsing rules to properly identify a Tag. This is a separate set of guidelines from what defines valid HTML. The TagStream parser responds with a Tag or Text element and then proceeds to the next section. It does not attempt to create a DOM tree, it doesn’t perform validation of tag it found. It assumes that you know what you are doing and won't judge you.