Enhanced Whitespace Handling

1. Introduction

Whitespace processing is often a key consideration when XML documents are compared. This document outlines how DeltaXML Core's whitespace-processing features have been enhanced for Release 8.2 of DeltaXML Core.

A more complete description of whitespace-processing can be found in the following guides:

2. Detail

2.1. Ignorable Whitespace

LexicalPreservation

'Ignorable Whitespace' refers to text nodes with whitespace-only characters found in the XML tree in places where text is not allowed (these are most frequently added for formatting purposes). Core's LexicalPreservation filter can identify and treat such nodes specially:

  • Previously only referenced DTDs were supported, Core can now exploit a referenced XML Schema to identify ignorable whitespace nodes.
  • If 'preserveIgnorableWhitespace' is set to 'true', text nodes identified as 'ignorableWhitespace' are now wrapped in a 'preserve:ignorable' element - previously they were kept but not marked in any special way.
  • There are now 'ProcessingMode' and 'OutputType' LexicalPreservation options for specifying how ignorable whitespace nodes are treated in the output pipeline.
  • A 'grammar' attribute is added to the root element of each input XML document, with a value of 'dtd' or 'schema'. This is added to indicate whether the LexicalPreservation filter exploits DTD or XML Schema information respectively for whitespace processing.
  • To prevent normalization by a subsequent NormalizeSpace filter, a 'mixed-content' attribute with a value of 'true' is now added to any element where text nodes are allowed by the DTD or Schema.

NormalizeSpace

As in previous releases, the 'NormalizeSpace' filter will not normalize whitespace-only text nodes if a 'mixed-content' attribute with a value of 'true' is found on the parent element. All other whitespace-only text nodes are treated as 'ignorable whitespace' and removed. Behaviour has been enhanced in the following ways for cases where a DTD or XML Schema for the input XML has not been loaded.

  • For each element name found in the input XML, a 'pre-normalization' filter determines whether 'non-whitespace' text can occur in immediate child text nodes. All elements identified as such are marked with a mixed-content="true" attribute.
  • If no DTD or XML Schema information was added by LexicalPreservation, and no 'grammar' attribute is found on the root element of the input XML, the 'pre-normalization' filter will add a 'grammar' attribute with a value of 'inferred'.

2.2. Preserving Whitespace in Text Content

In previous releases, the NormalizeSpace filter reduced any sequence of whitespace characters in significant text content to a single space character unless an xml:space="preserve' attribute was found on an ancestor element. This behaviour has now been extended as follows.

  • The 'pre-normalization' filter analyses whitespace occurring in significant text content in the input XML. A 'deltaxml:space="preserve" attribute is then added to the parent element if the whitespace characteristics are significantly different to that which would be expected for a normally indented text node in formatted XML.
  • Whitespace-only text nodes that are significant (i.e. not 'ignorable') are considered to have special meaning and are therefore marked by the pre-normalization filter with a deltaxml:space="preserve" attribute to prevent normalization.

2.3. Keeping Whitespace Metadata

As described in the previous sections, information is added by the LexicalPreservation and NormalizeSpace filters to the input XML to assist with whitespace processing; This information, kept as 'grammar', 'mixed-content' and 'space' attributes, may in some cases be useful for formatting the comparison result (our own 'folding' DiffReport exploits this).

For this reason there is now a new LexicalPreservation 'PreserveContentModel' setting:

  • When the setPreserveContentModel setting is 'true', the 'grammar', 'mixed-content' and 'space' attributes are preserved throughout the pipeline, otherwise they will be removed.
  • The PreserveContentModel setting works by selecting which namespace is used for the 'content model' attributes. The 'deltaxml' namespace is used when PreserveContentModel is 'false' otherwise, the 'preserve' namespace is used.
  • Attibutes used for content model information are likely to removed by the Core pipeline if they are in the 'deltaxml' namespace, but will always be preserved if in the 'preserve' namespace.
  • The option of using 'preserve' and 'deltaxml' namespaces* also provides backwards compatibility with the 'mixed-content' attribute which can be added by a custom input filter.

* The 'preserve' and 'deltaxml' namespaces are respectively:

  • http://www.deltaxml.com/ns/preserve
  • http://www.deltaxml.com/ns/well-formed-delta-v1

3. Whitespace Processing In Practice

The type of comparator used for comparison determines whether whitespace-processing filters are controlled implicitly through comparator properties or by adding the filters explicitly. This is summarised below.

Filters affecting whitespace processing - '✓' indicates implicit control via comparator properties

#Filter nameDocumentComparatorPipelinedComparatorS9PipelinedComparator
1LexicalPreservation
2lexical-whitespace.xsl
3DocumentComparator ExtensionPoint---
4whitespace-detection.xsl
5NormalizeSapce
5Comparator
6LexicalPreservation (out-filters)
DocumentComparator (or DCP)

The whitespace processing changes are fully integrated into the DocumentComparator.

PipelinedComparatorS9 (or DXP)

For LexicalPreservation, whitespace processing is managed automatically through LexicalPreservationConfig properties.

Because the NormalizeSpace filter is added to the pipeline explicitly, the 'pre-normalization' filter should be added immediately before this (unless an XML Schema or DTD will always be loaded). This is added as the resource XSLT filter 'whitespace-detection.xsl'.

PipelinedComparator

Here, the LexicalPreservation filter is added explicitly. It should therefore be followed immediately by the 'lexical-whitespace.xsl' resource XSLT filter.

The NormalizeSpace filter is also added explicitly. It should therefore be immediately preceded by the 'whitespace-detection.xsl' resource XSLT filter.