How to Manage White Space
Handling of "white space" is a maddeningly frequent cause of problems when handling XML. If we begin with the W3C spec [1]: "An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content."
For many applications, particularly those dealing with "document-centric" XML, the default behaviour is exactly as expected: for example, poetry keeps its line breaks. For "data-centric" applications, though, this can be infuriating, since for a "purchaseOrders.xml" file, for example, it is very common to use pretty-printing to improve readability. The XML WG concluded that the best compromise was to assume that the "data-centric" people would in general be validating their documents, and hence the second sentence in the quote above. When using a validating parser with a document which has a DTD, white space can be flagged as "ignorable" and will not be reported by the parser. Some parsers, such as Xerces-J, can also use a W3C XML Schema to determine which white space nodes can be ignored and which should be considered relevant.
What has this to do with us? To perform a comparison we first load the two documents into our highly efficient in-memory "micro-DOM" representation. When a document without an associated schema or DTD is parsed a logical "node" is created for each element, attribute, comment, processing instruction -- and PCDATA chunk (sometimes called "text node"), including white space. This means a white space node is created (in each input document) for every newline and tab. DeltaXML is then performing a comparison of node-trees, and when these trees are cluttered with irrelevant white space nodes there can be drastic effects on
-
speed;
-
memory consumption;
-
accuracy
Since these are three of the key reasons for using DeltaXML, we need to look at this rather more closely!
First, speed: even with the algorithms we use, doubling the number of nodes (which can easily happen when a document gets pretty-printed) will typically halve the speed or worse. Consider the "matching" problem when trying to align children between two documents - and now consider how much more complex this becomes with intervening whitespace nodes, with differing content, some newlines, some single characters, some nodes with multiple characters.
Second, memory consumption: this is evidently an issue. Since an optimal comparison requires both trees to be in memory simultaneously, we want to remove extraneous nodes.
Less obvious perhaps is the effect on accuracy. If (irrelevant) white space is actually different in the two input documents, you will unsurprisingly see changes reported (by a non-validating parser or one that cannot use a schema or DTD) that you actually want to exclude. More subtly, the extra "specious" white space nodes give more opportunity for a non-optimal alignment. Technically, the result will still be "correct", but it may not be "as expected".
So what options do we have? In brief:
-
Associate a DTD or schema with your document.
-
Use XSLT to strip white space.
-
Use a high-performance Java filter
The preferred solution is to use schema association, either by referencing a DTD (by DOCTYPE) or schema (by schemaLocation), or by using a feature setting on your parser. When you cannot do this, we recommend stripping the white space in your documents. The <xsl:strip-space> element is designed for this purpose. For example, we ship a simple "normalize-space.xsl" XSLT stylesheet which uses <xsl:strip-space> and also normalizes PCDATA and attribute content. You may need to process both documents first to ensure that
<a><b/></a>
matches
<a> <b/> </a>
For maximum performance and memory efficiency, try removing white space with a Java pre-process filter before it reaches the comparator:
class WhitespaceFilter extends XMLFilterImpl {
public void characters(char[] ch, int start, int length)
throws SAXException {
if (!new String(ch, start, length).trim().equals(""))
super.characters(ch, start, length);
}
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
// always ignore
}
}
Using DTD or schema association offers a simpler solution - and if you want to handle this preprocessing yourself using SAX, you may want to study the SAX ignorableWhitespace callback [2], used here, for more detail.
With the DeltaXML pipeline approach, it is straightforward to chain together as many pre-process steps as necessary. You may also want to add white space back during post-processing - see [3].
Finally, a quick note about an alternative and most unusual method for pretty-printing, allowing for easier readability, that does not change the "infoset". The fragment
<a><b/></a>
is identical, when parsed by an XML parser (validating or not) to
<a ><b/ ></a >
Here the line breaks are placed inside the begin-tags and end-tags, and so do not appear as white space nodes. This format, though, is very seldom used - perhaps because few see this as "pretty" printing!
Weblinks:
-
[1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-white-space -- W3C XML 1.0 specification on white space
-
[2] http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html#ignorableWhitespace(char[],%20int,%20int)