Loading login details...

Advanced DeltaXML Techniques

Table of Contents

Chapter 1. Configuring DeltaXML

1.1. Setting API features and properties

The DeltaXML Core API follows the SAX standard of using features and properties to select options during comparisons and recombinations. Properties and Features are used to configure the XMLComparator and XMLCombiner through the comparator.setProperty, comparator.setFeature, combiner.setProperty and combiner.setFeature methods as appropriate. The following features and properties are available:

Full details of all options are included in the DeltaXML Documentation.

Example - setting the "full delta" feature

PipelinedComparator pc= new PipelinedComparator(); 
pc.setComparatorFeature("http://deltaxml.com/api/feature/isFullDelta", true);       
pc.compare(new File(args[0]), 
           new File(args[1]), 
           new File(args[2])); 

1.2. Using XSLT - input and output filters

Customizing DeltaXML to your requirements is most simply achieved by automatically pre-processing your documents - perhaps removing extraneous whitespace or marking some elements as orderless - and then post-processing the output into the format you require. To do this you use XSL input and output filters.

Typical uses of input filters include:

Input and output filters can be combined with the DeltaXML Core differencing engine to create a processing pipeline. Conceptually, a pipeline is a sequence of processing elements each of which does something to the data it receives and then passes that data (possibly transformed in someway) onto the next element in the pipeline. As such it is a little like piping the output from one command into the input of another in Unix.

To make advanced use of the XML pipeline architecture requires a basic understanding of TrAX. The Apache Xalan TrAX pages provide an excellent overview of this standard as well as detailed working code, mostly generic code that will work with any TrAX implementation.

However, to simplify this process DeltaXML have provided the PipelinedComparator class which takes a list of input and output XSL filters. These filters are used internally to create the appropriate TrAX pipeline structures and will then pre and post process the data. An example of using a number of predefined input and output filters is presented in the following section.

1.3. Building a pipeline

Using the PipelinedComparator class it is possible to create a pipeline of input and output filters merely by specifying either the Java class and / or XSLT file that defines the filters. For example, the following example Java code instantiates a new PipelinedComparator and configures it to use to filters. The first is an input filter defined by a Java class that will normalize whitespace called NormalizeSpace and referenced using the ".class" extension. The second is an output filter that post processes the delta generated by the DeltaXML differencing engine to create a HTML version of the differences.

   PipelinedComparator pc= new PipelinedComparator(); 
   List infilters= new ArrayList(); 
   infilters.add(NormalizeSpace.class); 
   pc.setInputFilters(infilters); 
   List outFilters= new ArrayList(); 
   outFilters.add(new File("deltaxml-tables.xsl")); 
   pc.setOutputFilters(outFilters); 
   pc.compare(new File("a.xml"), 
              new File("b.xml"), 
              new File("out.html"));

Two methods are used to set up the input and output filter chains in this pipeline. These methods are:

Note that the source for Java NormalizeSpace and the XSLT source for deltaxml-tables.xsl are provided in the DeltaXML Core distribution.

1.4. Custom comparisons - XHTML, Schema, word-by-word

Using the pipeline approach it is possible to chain together transformations on both input and output to give customized comparisons for particular document types. For example, XHTML comparisons can ignore whitespace except inside <pre> elements. For Schema, the contents of a <choice> element are conceptually orderless - we flag them as such (adding a deltaxml:ordered="false" attribute - see Orderless comparisons) using an input filter. Many other such optimizations have been included.

For textual comparisons, a "word-by-word" pipeline is available which identifies changes to individual words. This creates a Microsoft Word(TM) style markup of additions, changes and deletions within PCDATA. An example of adding worb-by-word pre and post prcoessing to the PipelinedCOmparator is presented below:

   PipelinedComparator pc= new PipelinedComparator(); 
   // Set up input filters
   List infilters= new ArrayList(); 
   infilters.add(NormalizeSpace.class); 
   infilters.add(WordByWordInfilter.class); 
   pc.setInputFilters(infilters); 
   // Set up output filters
   List outFilters= new ArrayList(); 
   outFilters.add(WordByWordOutfilter1.class); 
   outFilters.add(WordByWordOutfilter2.class); 
   outFilters.add(new File("deltaxml-tables.xsl")); 
   // Initial the pipeline processing
   pc.compare(new File("a.xml"), 
              new File("b.xml"), 
              new File("out.html"));

This example also illustrates the ability to specify more than one input or output filter. Note that the filters are applied in the order in which they are defined in the lists that hold them.

If you'd like to process any of these types, or another generic document type such as SOAP or BizTalk, please contact us to discuss our currently available filters.

Chapter 2. Advanced Features

2.1. Orderless comparisons

When comparing two versions of a document, changes in ordering may be found which you need to ignore. For example, an external addressList feed may contain unsorted <person> elements, each of which contains sorted child elements. In this case, specify an orderless comparison:

<addressList deltaxml:ordered="false"> 
   <person id="1"> 
      ... 
   </person> 
   <person id="2"> 
       ... 
   </person> 
</addressList> 

DeltaXML will report no differences between this document and one having the <person> elements the other way round.

An element which you specify as orderless should not contain any text data or whitespace, it should have only elements as children. Note also that the "ordered" property is not communicated to the child elements - so you can nest ordered elements within orderless within ordered, etc. For real-world problems this is far more useful than a global "ignore order when comparing these documents" switch offered by some products.

These attributes need not be added manually - see Using filters to automate keyed and orderless comparisons. For detailed instructions on making effective use of orderless comparisons, see our white paper on Key-assisted Comparisons.

2.2. Using keys for precision control

When comparing two lists, two essential operations are involved. The first is alignment - deciding which items should be compared to which, the second is comparison. This is made more complex since XML is a tree-structured language, but at each level the same operations must be applied.

DeltaXML allows fine control over the alignment phase through use of keys. For example, when comparing legal documents you need to be sure that corresponding paragraphs are always aligned - by specifying a unique key for each paragraph, paragraph additions, deletions and changes are correctly reported. The syntax for specifying a key is straightforward:

<para deltaxml:key="para1"> 
   ... 
</para> 

These keys need not be added manually - as we will show on the next page. Detailed instructions on using keys are available in our white paper on Key-assisted Comparisons .

2.3. Using filters to automate keyed and orderless comparisons

The deltaxml:key and deltaxml:ordered features described previously allow great flexibility in managing comparisons to get the results you want. An approach taken by many users is to modify the systems producing these documents to add appropriate key/ordered attributes when the documents are generated. For other users, the document format is fixed, and cannot easily be modified - yet they still want to use these features.

By using an input filter to pre-process the incoming file, appropriate attributes can be added without changing the input documents. For our previous examples, we need an input filter which adds deltaxml:ordered="false" to all addressList items, a simple process. Adding keys typically requires a little more care. Some unique property of the element being processed must be available - this may be an attribute value, PCDATA contents of a child (or other descendant) element, a child's attributes, or any combination of these.

Chapter 3. Using DeltaXML as a Merge Tool

3.1. Merging

DeltaXML offers three types of merging:

A paper delivered by DeltaXML CTO Robin La Fontaine at XML Europe 2002 discusses the issues and details a practical solution - see Merging XML files (PDF).

If you wish to be informed of ongoing DeltaXML developments, please subscribe to our newsletter.