Understanding and explaining the differences between two large XML documents presents significant challenges. This is true, even with a fully optimised XML comparison system such as version 8.2 of XML Compare – released two weeks ago.
The focus for our XML Compare product (formerly “DeltaXML Core” in our business literature) is to generate a raw ‘Delta’ XML comparison result. This result contains embedded annotations and is optimised for further conversion. More recently, we’ve added support for the tracked change formats for a few XML editors using XML Compare’s Document Comparator. There is however a standalone alternative, the ‘DiffReport’, which can be rendered in any browser.
The DiffReport is an HTML rendering of a comparison result, produced by an XSLT filter at the end of the comparison pipeline. The XSLT code for this is distributed with XML Compare so it can be tailored by an integrator to meet specific customer needs. There are actually two versions of this: a ‘side-by-side‘ or a ‘folding‘ DiffReport.
In this post I describe how we’ve updated the folding DiffReport, allowing for a more effective initial review of XML comparison results, especially when input documents are large and/or the difference count is high.
A browser screenshot of a DiffReport output from a XML Compare comparison
A browser screenshot of a DiffReport output from a XML Compare comparison
The priority for this visualisation is to help with the scanning and analysis of a significant number of differences between two relatively large XML files, say 5MB in size. A further goal is to provide a good basis for more advanced implementations – such as an XML merge tool.
An XML Syntax View
This DiffReport shows XML syntax as opposed to a ‘What you see is what you get ‘(WYSIWYG) view, allowing any changes in any XML markup to be viewed without any additional styling information. This XML syntax view should be useful for scenarios when the comparison pipeline is being optimised, even if the final formatting will ultimately be a WYSYWIG view.
XML formatting in the DiffReport is used to show tree structure, but importantly, CSS styling is used here in preference to whitespace characters for nesting-level associated indentation and line-breaks. Whitespace characters are still explicitly preserved for cases where their context indicates that they are significant.
Because this is a syntax view, this rendering must be capable of representing lexical content that is not part of the standard XML infoset, this includes CDATA sections and internal DocType subsets. The DiffReport therefore understands the way DeltaXML’s DeltaV2 format represents these lexical artefacts and converts them back to use the appropriate syntax so that the rendered view shows well-formed XML.
Lexical artefacts are unencoded back to their normal serialised form in the DiffReport.
Information about where namespaces should be declared in the DiffReport can only be inferred from the XML infoset. Work on namespace declarations is still in progress, but the chosen approach is to declare all namespaces and their prefixes on the root element of the rendered result. This approach minimises the amount of clutter caused by repeated namespace declarations, this can still be overriden using a special parameter setting (see parameters list below).
The DiffReport layout includes a horizontal toolbar, with an XML path ‘breadcrumbs’ view shown immediately below. The main part of the view is the XML syntax view, but immediately to the left of this is a vertical panel showing a list of all the XML differences.
The toolbar has two buttons on the left for folding/expanding all the nested XML nodes in the XML document. then there’s a ‘Switch Style’ button – used to flip between the standard style and a ‘classic’ style, the final two buttons select the previous/next change in the differences list and the XML view.
The DiffReport toolbar
When we have a set of changes dispersed throughout a large document, to review these effectively it is important to be able to skip to a nearby change easily from the current change.
A change is selected either directly from the XML view, from the differences list, by pressing the previous/next buttons on the toolbar, or by using the Up or Down buttons on a keyboard. As soon as a new change is selected the differences list and XML view will scroll if the change is not already in view, also all parent elements for a change will be unfolded to ensure the change is not hidden.
Folding Tree Nodes
It is helpful to be able to ‘fold’ nodes of an XML document whose contents is not of immediate interest. Nodes are rendered as foldable blocks provided they are not found to be within mixed content. The initial view of the DiffReport shows all such nodes collapsed, except those that contain differences. Buttons on the toolbar can be used to expand or collapse all foldable nodes in the XML document in one go.
When element nodes have only have a single line of text that is relatively short, it is reasonable to leave such elements always unfolded. The DiffReport will by default leave elements unfolded with text content less than 50 characters in length – this threshold can be changed using the no-fold-size XSLT parameter.
Processing Very Large Files
When the input files for a comparison are very large, the DiffReport file will be larger still and the responsiveness of the browser can suffer. In such cases, a minimize-unchanged-display XSLT parameter can be set, this will hide XML subtrees that are the same in both input files.
Standard and Classic view styles are available in the DiffReport, styles are switched from the toolbar and are used to add meaning to the XML view.
Both views show the colour of markup for changed nodes in blue, with unchanged nodes rendered in grey. The Standard view shows XML values coloured according to their node type, as with standard XML syntax-highlighters. It then uses red and green background colours to highlight adds and deletes.
The Classic view however colours text red or green, in combination with a strikeout or underline to indicate an addition or deletion respectively. The classic view is more suitable for end-users who have difficulties with colour perception, but equally, others may prefer this view also.
1. Standard style
2. Classic style
When rendering XML syntax in HTML special attention should be paid to the preservation of whitespace characters – where they are determined to be significant. It helps that the view uses CSS styling alone for XML formatting.
Whitespace is preserved if different formatting to that of the parent element is detected.
The rendering stage (XSLT) tests the input XML first to see if it contains whitespace that appears to be for XML formatting only, if it does then whitespace is normalized by the HTML rendering-engine in the usual way, CSS is used to prevent this behaviour for cases where whitespace is obviously significant. All whitespace within elements marked with xml:space attributes is preserved, likewise, CDATA section whitespace is also kept.
To assist with normalization of whitespace, extra information is needed to help establish whether whitespace can be collapsed or removed, or whether it must be preserved. In an ideal XML comparison scenario, a referenced XML Schema or DTD would be available to XML Compare’s built in lexical preservation and normalization filters to provide this information. These filters are found in the comparator’s input pipeline. A custom filter can be added for cases where an XML Schema or DTD is not available, failing this, a ‘whitespace-detection’ filter is used to analyse whitespace patterns to make an intuitive guess as to its significance.
This extra information would allow marking of those elements where whitespace should be preserved. When lexical preservation and normalisation is not used, the DiffReport makes an informed choice on how to render the XML, based on factors such as any apparent indentation and mixed content with non-whitespace text nodes.
Differences in whitespace nodes that are deemed by the DiffReport to be non-significant are not shown in the XML syntax view, these differences are still shown in the differences list, but they are styled as disabled items. This behaviour can be overridden using the supress-formatting-only-changes parameter.
By default, when XML formatting is detected in the input documents and an element’s content contains formatting inconsistent with the formatting of its parent element it is assumed that this content has special formatting, whitespace characters in the content are therefore preserved.
When an alternate method is used within the pipeline for controlling whitespace, the auto-detect feature can be disabled by setting the smart-whitespace-normalization parameter to false. In this case, a CSS rule is applied to ensure no whitespace is normalized in the whole document.
Elements are only assumed to have a mixed content model if they have at least one text node with non-whitespace characters. All child nodes within an identified mixed content container element are rendered inline, that is, with no newlines.
XML Processing Instructions and Comments
Whitespace within XML processing-instructions is always preserved as-is. Whitespace characters inside XML comment nodes are treated slightly differently: the approach here is to keep alignment of formatted lines, but to trim leading whitespace characters that might lead to excessive indentation – given that the CSS already indents nodes anyway. The trim calculation takes into account any changes within the comments also.
XML Attribute formatting
The formatting of attribute values can improve the readability of certain types of XML when the raw syntax is viewed. The XML Compare comparator normalizes newline characters in attribute values, replacing them with a space () character, this behaviour is in conformance with the XML specification. Information on whether attributes start on a newline is also discarded by the XML processor.
The DiffReport compensates for the loss of attribute formatting by automatically starting attributes on a new line that have values longer than 80 characters. When an element has several attributes, the length of all attributes is aggregated when determining if they should each appear on a newline. The newline-att-size XSLT parameter can be used to change the threshold at which attributes are created on a new line. To remedy the loss of formatting due to lost newline characters inside attribute values, the DiffReport detects any sequence of 4 or more space characters, it then inserts a linefeed before each sequence, inferring that such sequences are a result of indentation normally found for multi-line attribute values. An example of attribute formatting behaviour is shown in the DiffReport screenshot below:
Smart attribute formatting can be exploited in the rendered view
A major strength of XML Compare is the flexibility derived from its architecture that comprises a processing pipeline built from a set of discrete filters. When the DiffReport stylesheet is used, it is placed as the final filter in the pipeline, it is important therefore that it faithfully represents the input, without attempting further refinements. Note: The exception to this is whitespace normalization by the DiffReport, but this feature can be turned off.
Any weaknesses in the configuration of the comparator pipeline should be immediately apparent in the output. A good thing in most contexts in which this rendering will be used.
Creating a DiffReport
The DiffReport output can be set from the XML Compare command-line, the built-in GUI, or by adding it’s XSLT stylesheet as the final filter in a DXP file or using the Java or .NET APIs.
Here’s an example of how the DocumentComparator in XML Compare can be set up to produce a DiffReport, note how it makes specific settings for lexical preservation and whitespace to keep as much information as possible in the result:
DocumentComparator comparator =
ResultReadabilityOptions resultReadabilityOptions = comparator.getResultReadabilityOptions();
FilterStepHelper fsHelper= comparator.newFilterStepHelper();
FilterStep diffreportStep= fsHelper.newFilterStepFromResource(
FilterChain diffreportChain= fsHelper.newFilterChain();
comparator.compare(input1, input2, resultFile);
Summary of Features
In addition to the layout changes in the DiffReport there are some more subtle changes, some of which may only be obvious for certain kinds of XML input. I’ve summarised the main features below:
- The current change can be selected from:
- the XML view
- the Differences panel
- by selecting the ‘previous’/’next’ buttons on the toolbar
- The Differences panel lists all XML changes in document order
- The XML view supports folding of nested elements – to help review long documents
- XML namespaces are declared in–place, or on the root element
- A ‘breadcrumbs’ bar shows the XML path to the current change
- Elements are unfolded and scrolled so the current change is always visible
- Initially, all elements are folded, except those containing changes
- ‘Block’ elements/nodes are indented according to nesting and shown on new lines
- When word-wrap occurs (for normalized content), indentation is maintained
- ‘Fold’ and ‘Expand’ buttons in the toolbar collapse/expand all nested XML nodes
- Elements within mixed content are shown inline
- Inputs are scanned for indentation to determine when to preserve/normalize whitespace
- xml:space attributes are observed for whitespace preservation
- whitespace in preserved within CDATA sections
- All lexical information that is preserved by the comparison pipeline will be rendered
- Two colouring styles are available:
- ‘Standard’ – background colours show adds/deletes in the XML view
- ‘Classic’ – foreground colours and text styles shows adds/deletes in the XML view
- Both colouring styles colourize the node markup:
- grey foreground – indicates an unchanged node
- blue foreground – indicates a node that contains changes
- The ‘Standard’ colouring style uses syntax-highlighting for node values
- Whitespace is preserved within XML comments and XML processing-instructions
- Excessive leading whitespace in multi-line XML comments is trimmed
Parameters available for the DiffReport
As mentioned earlier, a number of parameters can be set to override the DiffReport default behaviour, these are summarised below:
Allows whitespace normalization, except where this is most-likely to be significant.
[only applies when smart-whitespace-normalization is in effect] Hides changes to whitespace-only text nodes that appear to be for formatting-only.
To avoid clutter, namespaces are only added to the root element, by default. Override this to see namespace declarations on all elements where they are required for well-formedness (they may not be in the same position as in the input documents).
When an element has only one text node and this has a character-count less than the no-fold-size, the element will always be shown expanded. The default setting is 50 characters.
Attributes with a value with a character-count greater than the newline-att-size will be rendered on a new line, as will all following attributes. The default setting is 80 characters
By default this is disabled. Setting this parameter results in unchanged subtrees being suppressed from the DiffReport – only the top-level element of an unchanged subtree is included.