Posted28 November 2013
byPhil G. Fearon

Processing HTML5 in an XML Pipeline

Posted28 November 2013
byPhil G. Fearon

My last blog post was about HTML5 Compare, an online comparison tool for HTML5. The comparison process we used for this relies on an XML pipeline built around DeltaXML’s Core product. In this blog post I describe the solution we used for parsing HTML5 for the pipeline input and serializing to HTML5 from the pipeline output.

Selecting an HTML5 Parser/Serializer

To fit in with existing components we were looking for something that was Java based and also allowed good integration with Saxonica’s Saxon XSLT processor. Another key requirement was that the parser solution should conform to the HTML5 specification. One of the great strengths of HTML is that it is very forgiving when confronted with poorly formed tag content in the source, HTML5 therefore provides a parsing specification for both valid and invalid HTML to ensure that the same DOM is generated for any conforming parser implementation.

Given the above constraints, we selected the Mozilla backed Validator.nu parser, an HTML5 serializer that complements the behaviour of the parser is also included with this component.

DXP Pipeline

Integrating the Parser and Serializer into the pipeline

The sample code excerpt below shows the Java code used to work interoperably between the HTML5 and XML components:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

public void compare(InputStream is1, String systemId1, InputStream is2,
                    String systemId2, OutputStream result){
   
    HtmlParser htmlParser= new HtmlParser();
    htmlParser.setErrorHandler(errorHandler);
    org.xml.sax.XMLReader xmlReader= htmlParser;
    
    Processor saxonProcessor= new Processor(true);      
    InputSource in1= new InputSource(is1);
    InputSource in2= new InputSource(is2);
    
    in1.setSystemId(systemId1);
    in2.setSystemId(systemId2);
    
    DocumentBuilder db= saxonProcessor.newDocumentBuilder();
    
    SAXSource saxSource1= new SAXSource(xmlReader, in1);
    SAXSource saxSource2= new SAXSource(xmlReader, in2);
    XdmNode inputNode1= db.build(saxSource1);
    XdmNode inputNode2= db.build(saxSource2);
    
    com.deltaxml.xhtml.XhtmlCompare xhtmlCompare= new XhtmlCompare();
    XdmNode compareOutputNode= xhtmlCompare.compare(inputNode1, inputNode2);
    
    XsltCompiler comp= saxonProcessor.newXsltCompiler();
    
    StreamSource finalXslt= new StreamSource(xsltStringReader);
    XsltExecutable execFilter= comp.compile(finalXslt);
    XsltTransformer transformer= execFilter.load();
    transformer.setInitialContextNode(compareOutputNode);
    
    ContentHandler ch= new HtmlSerializer(result);
    transformer.setDestination(new SAXDestination(ch));
    transformer.transform();
 }

From above, we can see that HTML5Parser implements org.xml.sax.XMLReader, whilst HTML5Serializer implements org.xml.sax.ContentHandler, allowing us make Saxon9 API calls to methods with arguments that implement these interfaces. For each HTML5 input, we can therefore create an XdmNode instance, which is then processed by Core.

The XML output from Core is also an XdmNode instance, this is used to construct the HTML5Serializer object, which in turn is supplied to the setDestination method of the XSLTTransformer instance. Executing the transform() method of this instance (in this case using an XLST identity transform) gives us standard serialized HTML5 output.

Please note that a design constraint required the use of XdmNodes in this particular example, the solution would have been simpler and allowed parallel parsing if we had used two SaxSource instances, with each instance constructed using its own independent XmlReader object.

If you’re interested in more information on the use of SAX interfaces in pipelines, please see Powering Pipelines with JAXP on our Articles and Papers page.

Processing HTML5

Using this HTML5 parser/serializer combination with its SAX API allowed us to conveniently process the HTML5 DOM almost as if we were dealing with XHTML. Once parsed, HTML5 elements are even in the XHTML namespace. You may have noticed from the code extract above that the class that represents the pipeline for Core is XHTMLCompare, this is because we were actually using a customized version of an exsiting pipeline designed for XHTML. There are of course important differences with XHTML even once parsed, one is the absence of default attributes. For example, to use XSLT filters designed for XHTML we needed to add xml:space="preserve" attributes to pre elements.

Conclusion

We found that by selecting standards-based parsing and serializing solutions, processing HTML5 using an XML-based pipeline can be relatively straightforwards. Existing resources for processing XHTML can also be used with only minor adjustments.

Keep Reading

How Move Detection Improves Document Management

3 July 2024

/

0 Comments

Learn how move detection technology improves document management by accurately tracking relocated content.

Streamlining Data Syndication in PIM Systems through JSON Comparison

3 July 2024

/

0 Comments

Utilise JSON comparison to reduce errors, labour costs, and system downtime.

Move detection when comparing XML files

28 May 2024

/

0 Comments

DeltaXML introduces an enhanced move detection feature that provides a clearer insight of how your content has changed.

Configuring XML Compare for Efficient XML Comparison

21 May 2024

/

0 Comments

Define pipelines and fine-tune the comparison process with various configuration options for output format, parser features, and more.

A Beginner’s Guide to Comparing XML Files

20 May 2024

/

0 Comments

With XML Compare, you receive more than just a basic comparison tool. Get started with the most intelligent XML Comparison software.

Introducing Character By Character Comparison

11 April 2024

Find even the smallest differences in your documents with speed and precision with character by character comparison.

Everything Great About DeltaJSON

20 February 2024

Accessible through an intuitive online GUI or REST API, DeltaJSON is the complete package for managing changing JSON data. Learn everything about makes DeltaJSON great.

Mastering Complex Table Comparisons Within Your Technical Documentation

16 February 2024

Our software excels at presenting changes in complex tables and technical content.

Simplifying Your JSON Management Experience with DeltaJSON

13 February 2024

DeltaJSON simplifies JSON data management with the introduction of an NPM package.

1 reply

Trackbacks & Pingbacks

Profiling Go Programs – From 100 to 4,000 Wikipedia article per second | SoshiTech says:

30 December 2013 at 12:56 am

[…] Processing HTML5 in an XML Pipeline (blogs.deltaxml.com) […]

XML Compare →

XML Data Compare →

DITA Compare →

DocBook Compare →

Watch our latest video

How DeltaXML are the Industry Standard

XML Merge →

DITA Merge →

New Release for Merge

Mastering table comparison and merging

Content Compare S1000D →

Content Compare JATS →

Content Compare BITS →

Content Compare NISO-STS →

Content Compare XSL-FO →

Watch our latest video

Getting started with Content Compare

DeltaJSON →

Book a demo

Integrate DeltaJSON into your workflows and applications

ConversionQA →

Meet your new best friend

Getting started with ConversionQA

Resources →

Events and webinars →

Customer stories →

Partners →

Speak to an expert →

Story spotlight

How Karnov Group Merged Two Legal Publishing Companies’ Incompatible Content Databases

All documentation →

Support portal →

BitBucket repositories →

AWS AMI documentation →

Licensing user guide →

Trialing our software?

Get up and running with your DeltaXML evaluation with our video playlist

oXygen Adaptor →

XSLT / Xpath → 55K+ installs

CALS Table Viewer → 534 installs

XPath Notebook → 6K+ installs

Looking for something specific?

We’re confident we can beat any XML comparison challenge

Share this blog

Processing HTML5 in an XML Pipeline

Selecting an HTML5 Parser/Serializer

Integrating the Parser and Serializer into the pipeline

Processing HTML5

Conclusion

Share this blog

Keep Reading

Trackbacks & Pingbacks

Comments are closed.

Never miss an update on DeltaXML

Our Products

Resources

Company

Follow us

Integrate and customise comparison results within your systems and processes. Learn More

Never miss an update

XSLT / Xpath → ^{55K+ installs}

CALS Table Viewer → ^{534 installs}

XPath Notebook → ^{6K+ installs}