To fit in with existing components we were looking for something that was Java based and also allowed good integration with Saxonica’s Saxon XSLT processor. Another key requirement was that the parser solution should conform to the HTML5 specification. One of the great strengths of HTML is that it is very forgiving when confronted with poorly formed tag content in the source, HTML5 therefore provides a parsing specification for both valid and invalid HTML to ensure that the same DOM is generated for any conforming parser implementation.
Given the above constraints, we selected the Mozilla backed Validator.nu parser, an HTML5 serializer that complements the behaviour of the parser is also included with this component.
The sample code excerpt below shows the Java code used to work interoperably between the HTML5 and XML components:
From above, we can see that
org.xml.sax.XMLReader, whilst HTML5Serializer implements
org.xml.sax.ContentHandler, allowing us make Saxon9 API calls to methods with arguments that implement these interfaces. For each HTML5 input, we can therefore create an XdmNode instance, which is then processed by Core.
The XML output from Core is also an XdmNode instance, this is used to construct the HTML5Serializer object, which in turn is supplied to the
setDestination method of the
XSLTTransformer instance. Executing the
transform() method of this instance (in this case using an XLST identity transform) gives us standard serialized HTML5 output.
Please note that a design constraint required the use of XdmNodes in this particular example, the solution would have been simpler and allowed parallel parsing if we had used two
SaxSource instances, with each instance constructed using its own independent XmlReader object.
If you’re interested in more information on the use of SAX interfaces in pipelines, please see Powering Pipelines with JAXP on our Articles and Papers page.
Using this HTML5 parser/serializer combination with its SAX API allowed us to conveniently process the HTML5 DOM almost as if we were dealing with XHTML. Once parsed, HTML5 elements are even in the XHTML namespace. You may have noticed from the code extract above that the class that represents the pipeline for Core is
XHTMLCompare, this is because we were actually using a customized version of an exsiting pipeline designed for XHTML. There are of course important differences with XHTML even once parsed, one is the absence of default attributes. For example, to use XSLT filters designed for XHTML we needed to add
xml:space="preserve" attributes to
We found that by selecting standards-based parsing and serializing solutions, processing HTML5 using an XML-based pipeline can be relatively straightforwards. Existing resources for processing XHTML can also be used with only minor adjustments.