The processing of input whitespace before comparison is extremely important to get right and the best thing to do depends largely upon what type of XML files you are comparing and what you want to do with the result.
The W3C XML specification talks about how an XML processor should treat whitespace in documents; : "An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content."
When comparing two XML trees, it is important to consider the implications of including whitespace nodes in the comparison. Whitespace nodes can take up a huge percentage of space in an in-memory tree, particularly if the documents were "pretty-printed". This affects not only the memory consumption of the comparison process but also affects the speed and accuracy of comparison. Unless there is a good reason for retaining whitespace in the inputs, it is a good idea to remove it prior to comparison.
When using the LexicalPreservation filter (standard for PipelinedComparatorS9 and DocumentComparator), the PreserveItem IgnorableWhitespace can be used to explicitly exclude or include ignorable whitespace. See the Lexical Preservation Guide for more detail.
Release 8.2 of DeltaXML Core provides enhanced features for managing both ignorable and 'non-ignorable' whitespace. The Enhanced Whitespace Handling guide provides further information on the new features of the lexical preservation and normalization filters, along with a description of the additonal built-in filter for inferring whitespace significance.
Because whitespace has many uses within an XML document, we need to find a way of referring to different types of whitespace nodes so that we can potentially treat them differently. This section will use examples to show where whitespace nodes can occur.
These whitespace nodes are typically used to pretty-print XML files to make them easier to read. The following example highlights inter-element carriage returns (CR) and tabs (T) that are typically used to pretty print the XML. Note that any whitespace can be used, tabs are often replaced by multiple space characters.
<document-root>CR T<title>Document Title</title>CR T<paragraph>Some paragraph text</paragraph>CR </document-root>
An element is said to contain mixed content if it can contain PCDATA and, optionally, element content. This definition occurs in the W3C XML Specification . Whitespace nodes typically appear in between consecutive elements and are usually significant as they define space in between words. If the highlighted whitespace node in the following example (space shown as _) were to be removed, the two bold words would appear to be joined together in a typical renderer.
<!DOCTYPE document-root [ <!ELEMENT document-root (title, paragraph*)> <!ELEMENT title (#PCDATA)> <!ELEMENT paragraph (#PCDATA, b)*> ]> <document-root> <title>Document Title </title> <paragraph>Some paragraph text with <b>two<b>_<b>bold</b> words.</paragraph> </document-root>
It is important to note that PCDATA whitespace nodes also qualify as mixed content whitespace. Because it is not inter-element whitespace, it may need to be treated as significant. The following example highlights space characters (shown as _ characters) that appear in PCDATA-only content:
<!DOCTYPE document-root [ <!ELEMENT document-root (title, paragraph*)> <!ELEMENT title (#PCDATA)> <!ELEMENT paragraph (#PCDATA, b)*> ]> <document-root> <title>____</title> <paragraph>Some paragraph text with <bold>bold</bold> words.</paragraph> </document-root>
Please see the mixed content section below to see how these types of whitespace are differentiated by an XML Parser.
As mentioned in the introduction, the best way of handling whitespace depends on what type of XML files you are comparing and what the result file will be used for. Listed below are some typical scenarios along with recommendations on what to do with whitespace.
If your XML files are data files as opposed to documents, it is highly likely that almost all whitespace nodes are insignificant. It is also likely that there will be a DTD or XML Schema associated with the files that defines the data format. In this scenario the best practice would be to remove input whitespace using the NormalizeSpace Java filter.
If your XML files are documents, e.g. DocBook, DITA or OpenDocument then some of the whitespace will be significant and some of it won't. Typically, whitespace that occurs where PCDATA is allowed should be treated as significant but other inter-element whitespace should not. If you are processing these files in order to highlight changes, you will probably not be too concerned about how the resultant XML is output, i.e. there is no need to indent or pretty-print. The best approach for this scenario is to remove inter-element whitespace and normalize significant whitespace by setting the 'IgnorableWhitespace' PreserveItem in LexicalPreservation false and using the NormalizeSpace Java filter.
If you edit your files in an XML view in an XML editor, you will understand that inter-element whitespace is significant for human-readable documents. If you are comparing documents and then continuing to work on the result files in an editor, you will probably want the result whitespace to look as close to the input whitespace as is possible. In this scenario, you will probably not want to use the NormalizeSpace filter but you need to be aware that differences in whitespace nodes will be highlighted in the delta file and, if you are post-processing it with further output filters, you may need to decide how to handle modified whitespace. Please see the handling whitespace differences section below for an example approach.
This filter is a Java implementation of an XML filter that normalizes whitespace and PCDATA nodes that occur in XML documents. The DocumentComparator invokes the NormalizeSpace filter automatically when the ModifiedWhitespaceBehaviour property of ResultReadabilityOptions is set to 'NORMALIZE'. For other comparators, this filter should be added explicitly to the input pipeline when normalization is required.
Normalization is the process of removing inter-element whitespace nodes (known as 'ignorable whitespace') or converting multiple consecutive whitespace characters in a PCDATA node into a single space character. You can read more about the filter in the DeltaXML Core API documentation . It is important to note that, by default, when NormalizeSpace receives a characters() SAX event that contains only whitespace characters, it will ignore it, thus removing it from the input completely. In order for this behaviour to be changed to normalizing the characters rather than removing them, the containing element must be defined as containing mixed content. See the mixed content section for how this can be achieved.
The W3C specification for XPath  defines the
normalize-space() string function that can be used in XSLT filters for
normalizing space on a string. The NormalizeSpace filter follows this definition, except that
it does not remove leading and trailing spaces because this can lead to the removal of
significant whitespace characters. Consider the string highlighted in the following
<para>This paragraph contains <b>bold</b> words and <i>italic</i> words.</para>
If the leading and trailing spaces were removed from this string, as would be the case if you
were to use the
normalize-space() function on it, there would no longer be spaces
between words that appear in the bold and italic elements before and after it.
The NormalizeSpace filter should be preceded by the built-in XSLT 'whitespace-detection.xsl' filter or a user-defined equivalent (whitespace-detection.xsl should not be added for the DocumentComparator as this is used implicitly). . The responsibility of this filter is to ensure that mixed-content elements (that may contain significant whitespace-nodes) or elements that should be not normalized at all are properly marked.
Treatment of whitespace is determined in the following ways:
The NormalizeSpace filter treats
xml:space attributes in accordance with their
definition in the XML specification . If you want to use normalize
space in most of your document but keep all spaces within specific subtrees, you can add the
xml:space="preserve" attribute to those elements where you want whitespace to
remain untouched, if you don't want these to be persisted in the output you can use
deltaxml:space="preserve" attributes instead . Note that
attributes are often added as a default attribute when parsing a document that refers to a DTD
or XML Schema with certain elements in document formats, e.g.
The W3C XML Specification includes a section on attribute value
normalization . In order to comply with this, NormalizeSpace leaves
attribute values alone. If you want attribute values to be normalized (regardless of whether
they were defined as CDATA or NMTOKEN), you can configure NormalizeSpace to do so by passing a
true to the
setnormalizeAttValues() method. Note that when
using the DocumentComparator, the NormalizeSpace filter is used internally so this method is
For NormalizeSpace to correctly handle mixed content whitespace-only nodes (see the definition above), it is important to note that it must be defined correctly. There are four ways of defining whitespace nodes as appearing in mixed content;
deltaxml:mixed-content="true"to mark elements where mixed content can occur
Consider the whitespace node that appears in the following example (marked with _):
<root> <para>This paragraph has <b>bold words and</b>_<i>italic words</i>.</para> </root>
The following examples show how to use each of the two options to inform NormalizeSpace to treat this space as mixed content rather than inter-element (ignorable) whitespace and therefore not remove it.
Example 1: Use a DTD
This example uses an inline DTD but the same effect can be achieved using an external document.
<!DOCTYPE root [ <!ELEMENT root (para*)> <!ELEMENT para (#PCDATA, b, i)* ]> <root> <para>This paragraph has <b>bold words and</b> <i>italic words</i>.</para> </root>
Example 2: Use the deltaxml attribute
<root xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"> <para deltaxml:mixed-content="true">This paragraph has <b>bold words and</b> <i>italic words</i>.</para> </root>
deltaxml:mixed-content attribute only applies to the
element on which it appears, it does not apply to child elements. If you want
them to be treated as mixed content, you must explicitly add the attribute to them as
The example above would typically not be something seen by a user or author of the content. It is possible to dynamically add attributes such as this using an XSLT input filter. We would recommend such filters be created by a mechanistic analysis of a DTD or other schema associated with the content being processed.
The way in which differences in whitespace should be handled is context dependent. The DocumentComparator has a ModifiedWhitespaceBehaviour property (a ResultReadabilityOptions setting) for controlling this. For other comparators a custom filter is required.
For the purposes of this discussion we shall focus on one example. Suppose we have a document where whitespace is being used to indent the XML for readability of multi-line paragraphs, and that we want the existing document's indentation to be kept. In this context, the precise location of the line breaks and indents within two versions of a paragraph may change, which may require us to align a single space with a 'line break' and/or 'indent'. Such changes in whitespace are typically irrelevant and can be removed from the result by a post processing filter.
Removing changed whitespace from the output appears to be straightforward. All that needs to be done is to identify modified text whose change is only in whitespace and then select either the 'A' or 'B' (old or new) version of the input. However, this approach does not handle the cases where:
The following XSLT template can be used to perform the whitespace modification on text
notInsidePreservationSubtree function is used to determine whether
this node can have changes in its whitespace removed, and the
variable specifies whether to keep the first or second (i.e. 'A' or 'B') document's
<xsl:template match="deltaxml:textGroup[@deltaxml:deltaV2='A!=B'] [matches(string(.), '^\s+$')] [deltaxml:notInsidePreservationSubtree(.)]"> <xsl:copy> <xsl:attribute name="deltaxml:ignore-changes" select="if ($whitespace-mode = 'keepA') then 'A' else 'B'" /> <xsl:apply-templates select="@* except @deltaxml:ignore-changes, node()" /> </xsl:copy> </xsl:template>
Note that in order for such processing to have an affect, the whitespace within the input documents should not be normalized.
When elements are marked with either the deltaxml:mixed-content or deltaxml:space attribute, we are helping define their content and how they should be processed. A deltaxml:grammar attribute is also added to the root element to describe where this information came from, it may have the values: 'inferred', 'dtd' or 'schema'. This information is used by the whitespace-detection.xsl and NormalizeSpace filters, but it may also be useful at other stages in the pipeline, or at the serialization stage.
Such attributes in the 'deltaxml' namespace are likely to be removed either within the pipeline (by NormalizeSpace for example) or as part of the clean-up filter. To ensure these attributes are persisted throughout but keep the same behaviour, they can instead be placed in the 'preserve' namespace. This behvaiour is controlled by the PreserveContentModel setting of the LexicalPreservationConfig class and also the 'preserve-content-model' parameter of the whitespace-detection.xsl filter, but may also be exploited by custom filters.
Note that preserved 'ContentModel' attributes are kept in the comparison output, even if they were only added to one of the input documents.
An example of how this content model information can be exploited can be found in the built-in 'dx2-deltaxml-folding-html.xsl' filter used to produce the folding DiffReport view. This information helps determine the layout and CSS properties for the HTML rendered view.