DeltaXML Newsletter - March 2005
Welcome to our March newsletter. This month we look at how CSW have used DeltaXML within an XML publishing and knowledge management application, in use by the UK Royal Pharmaceutical Society and others. We also have a Technical Corner article looking at the handling of white space when using DeltaXML.
DeltaXML continues to develop as a world leader in XML differencing and change control. DeltaXML technology is currently used in a wide range of fields including banking and finance, electronic and print publishing, the energy industries and aviation. A glance at our list of existing customers shows the diversity of uses to which DeltaXML technology has already been adapted. In each case it has delivered precisely what it advertises - XML change control, in XML.
We are always on the lookout for talented people to join us in the further development of DeltaXML. We work with the very best - if you know of someone who has appropriate skills to offer, we would be delighted to hear from them.
We also welcome your feedback, whether it is about aspects of DeltaXML technology or about our website or newsletter. Please get in touch if there's anything you'd like to suggest or discuss. Evaluation downloads of the DeltaXML Core API, together with full documentation and online demos, are always available at: http://www.deltaxml.com/
- The DeltaXML Team.
Contents
In this newsletter:
- Recent DeltaXML Customers
- CSW - Integrated Healthcare Publishing
- Technical Corner: Handling White Space with DeltaXML
- Diary Dates
Customer Focus: Recent DeltaXML Customers
Recent customers include:
- Ordnance Survey (UK) (visit) - Britain's national mapping agency.
CSW - Integrated Healthcare Publishing
CSW is playing a key role in the UK government's "Electronic Care Records" initiative, working to build a "data spine" connecting 55 million patient records for timely and secure delivery of clinical information. With government backing, the UK healthcare industry as a whole is embracing a standards-driven strategy based upon XML, both for real-time EPR initiatives and for more traditional publishing. CSW developed their "KMS" Knowledge Management System in collaboration with the Pharmaceutical Press – the publishing arm of the Royal Pharmaceutical Society (RPS) and are now rolling out to other leading healthcare and pharmaceutical publishers. DeltaXML plays a key role in managing the content within these systems.
Technical architects at Pharmaceutical Press were already comfortable with an XML strategy, and CSW were able to integrate their existing work with the new infrastructure to manage production of key publications earmarked for the KMS system, such as Stockley's Drug Interactions and Clarke's Analysis of Drugs and Poisons. Using KMS as a content repository and publishing framework allows for "single-source publishing" in a variety of formats - a traditional benefit of XML-based publishing. More interesting from a technical perspective is the enrichment of the data through use of clinical ontologies such as SNOMED, with a vocabulary of 500,000 clinical terms, which allow more complex "fuzzy" associations to be used. Using these tools, it is possible to build searches which return "related" information without a fixed predefinition of these relations, crucially important in this fast-changing field.
In contrast to the Pharmaceutical Press, other publishers using KMS are less familiar with XML technologies and are accustomed to working with documents submitted in Microsoft Word. Although in some cases there is an XML schema, it is generally tied to a specific presentation, which CSW refactors to a presentation-neutral grammar.
Where users more familiar with XML might be encouraged to use an XML editor such as Arbortext EPIC instead, CSW decided to build upon the existing expertise and provided an integration layer allowing documents in WordML - the Microsoft Word 2003 XML syntax - to be transformed into this format, one more appropriate for the "semantically rich" repository within KMS. This sophisticated transformation technology, built using XSLT 2.0, allows for full "round-tripping" between WordML and the repository schema without information loss - a non-trivial task.
Through careful design of MS Word "styles" it is now possible for content authors and editors to continue to work in a familiar and productive environment, with all the benefits of single-source publishing and with the use of clinical ontologies to enhance the raw content to produce output that is easy to use, accessible in many formats and very rich.
CSW architects realised at an early stage that they needed to allow for content evolution, for multiple revisions of documents within review and approval workflows. DeltaXML provided the solution they needed, allowing robust version control and auditing of documents within the repository with clear identification of changes and automated change processing. As Niki Dinsey, Lead Architect on both projects, says:
"DeltaXML has allowed KMS to produce highly precise XML deltas on publication content with the greatest of ease. We are now able to provide HTML and PDF rendered change control documents that greatly improve production processes for our customers. DeltaXML makes a complex task very simple and very very fast. Thanks DeltaXML!"
CSW is moving towards a vision of integrated health care information, combining the routine clinical record with knowledge bases held in XML and coded with SNOMED. Back in the 1980’s, CSW’s founder Dr John Chelsom gained his PhD for work on building medical reasoning systems and the application of knowledge based systems in medicine. As this vision becomes a reality, the new challenge for CSW and the publishers using KMS is to move from the publications they currently handle, with longer publication cycles of one or two years, to continuous updates, and to delivery of information directly to patient management systems. With such rapidly changing content, solid change management is proving critical to success.
Weblinks:
http://www.rpsgb.org.uk/ - The Royal
Pharmaceutical Society of Great Britain
http://www.csw.co.uk/ - CSW Group Limited
Technical Corner: Handling White Space with DeltaXML
Handling of "white space" is a maddeningly frequent cause of problems when
handling XML. If we begin with the W3C spec[1]:
"An XML processor MUST always pass all characters in a document that are not
markup through to the application. A validating XML processor MUST also inform
the application which of these characters constitute white space appearing in
element content."
For many applications, particularly those dealing with "document-centric" XML, the default behaviour is exactly as expected: for example, poetry keeps its line breaks. For "data-centric" applications, though, this can be infuriating, since for a "purchaseOrders.xml" file, for example, it is very common to use pretty-printing to improve readability. The XML WG concluded that the best compromise was to assume that the "data-centric" people would in general be validating their documents, and hence the second sentence in the quote above. When using a validating parser with a document which has a DTD, white space can be flagged as "ignorable" and will not be reported by the parser. Some parsers, such as Xerces-J, can also use a W3C XML Schema to determine which white space nodes can be ignored and which should be considered relevant.
What has this to do with us? To perform a comparison we first load the two documents into our highly efficient in-memory "micro-DOM" representation. When a document without an associated schema or DTD is parsed a logical "node" is created for each element, attribute, comment, processing instruction -- and PCDATA chunk (sometimes called "text node"), including white space. This means a white space node is created (in each input document) for every newline and tab. DeltaXML is then performing a comparison of node-trees, and when these trees are cluttered with irrelevant white space nodes there can be drastic effects on
- speed;
- memory consumption;
- accuracy.
Since these are three of the key reasons for using DeltaXML, we need to look at this rather more closely!
First, speed: even with the algorithms we use, doubling the number of nodes (which can easily happen when a document gets pretty-printed) will typically halve the speed or worse. Consider the "matching" problem when trying to align children between two documents - and now consider how much more complex this becomes with intervening whitespace nodes, with differing content, some newlines, some single characters, some nodes with multiple characters.
Second, memory consumption: this is evidently an issue. Since an optimal comparison requires both trees to be in memory simultaneously, we want to remove extraneous nodes.
Less obvious perhaps is the effect on accuracy. If (irrelevant) white space is actually different in the two input documents, you will unsurprisingly see changes reported (by a non-validating parser or one that cannot use a schema or DTD) that you actually want to exclude. More subtly, the extra "specious" white space nodes give more opportunity for a non-optimal alignment. Technically, the result will still be "correct", but it may not be "as expected".
So what options do we have? In brief:
1. Associate a DTD or schema with your document.
2. Use XSLT to strip white space.
3. Use a high-performance Java filter.
The preferred solution is to use schema association, either by referencing a
DTD (by DOCTYPE) or schema (by schemaLocation), or by using a feature setting on
your parser. When you cannot do this, we recommend stripping the white space in
your documents. The <xsl:strip-space> element is designed for this
purpose. For example, we ship a simple "normalize-space.xsl" XSLT stylesheet
which uses <xsl:strip-space> and also normalizes PCDATA and attribute
content. You may need to process both documents first to ensure that
<a><b/></a>
matches
<a> <b/> </a>
For maximum performance and memory efficiency, try removing white space with a Java pre-process filter before it reaches the comparator:
class WhitespaceFilter extends XMLFilterImpl {
public void characters(char[] ch, int start, int length)
throws SAXException {
if (!new String(ch, start, length).trim().equals(""))
super.characters(ch, start, length);
}
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
// always ignore
}
}
Using DTD or schema association offers a simpler solution - and if you want to handle this preprocessing yourself using SAX, you may want to study the SAX ignorableWhitespace callback[2], used here, for more detail.
With the DeltaXML pipeline approach, it is straightforward to chain together
as many pre-process steps as necessary.
You may also want to add white space back during post-processing.
Finally, a quick note about an alternative and most unusual method for
pretty-printing, allowing for easier readability, that does not change the
"infoset". The fragment
<a><b/></a>
is identical, when parsed by an XML parser (validating or not) to
<a
><b/
></a >
Here the line breaks are placed inside the begin-tags and end-tags, and so do not appear as white space nodes. This format, though, is very seldom used - perhaps because few see this as "pretty" printing!
Weblinks:
http://www.deltaxml.com/newsletters/DXNewsletter-2004-02.html#performance
-- Performance Tuning for DeltaXML
http://www.deltaxml.com/core/tutorial/coretutor-6-6.html
-- DeltaXML Core Tutorial on white space handling
[1]
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-white-space
-- W3C XML 1.0 specification on white space
[2]
http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html#ignorableWhitespace(char[],%20int,%20int)
Diary Dates
http://www.xtech-conference.org/ - XTech 2005, Amsterdam, 24-27 May 2005
Weblinks:
DeltaXML news: http://www.deltaxml.com/news/
Please let us know whether this newsletter has been useful to you, we welcome any suggestions about information you'd like discussed in future editions. We'll be back next month with another edition.
Copyright © 2005 DeltaXML and Monsell EDM Ltd.
Newsletter archive:
http://www.deltaxml.com/newsletters/
Newsletter subscription management:
http://lists.deltaxml.com/mailman/listinfo/open-newsletter