Managing change in an XML environment

How to Compare Orderless Elements

1 Introduction

Any element in an XML file can be declared as orderless and DeltaXML will match up and compare the child elements regardless of the order in which they appear. What counts about orderless data is membership in a set, not relative position. Arbitrary rearrangement leaves the meaning of the set unaffected. To detect changes in sets, DeltaXML must correlate matching elements in spite of rearrangements. Keys help DeltaXML perform this correlation. Keys remove any need for sorting. (Orderless comparisons have been supported since version 2.0 of DeltaXML.)

The special-purpose attributes called keys simplify DeltaXML comparisons in many situations. The structure of XML is not completely rigid; as with machine parts, a little flexibility lends strength. For instance XML permits both ordered and orderless data. Keys help DeltaXML track changes in both categories. Keys increase both the quality of the matches made and the runtime speed of DeltaXML comparisons. For orderless data keys should be provided whenever possible to get the best results.

Key attributes impose no requirements on data formats. Extensible Stylesheet Language (XSLT) scripts can generate keys on the fly, whenever needed, and discard them in post-processing. The original XML remains unaffected by their use.

One point that should be noted, before reading further, is that it is very important to remove whitespace only PCDATA nodes during orderless comparison (for example using the com.deltaxml.pipe.filters.NormalizeSpace Java XML Filter). Also, note that an orderless element must not contain PCDATA as an immediate child: it will cause an exception (com.deltaxml.api.UnorderedElementContainingPCDATAException). The reason is that while elements are separated unambiguously by tags, if two PCDATA items are put next to each other they are in effect merged into one PCDATA item, so orderless PCDATA makes no sense.

2 How to declare orderless data

Ordering is often unimportant during data capture or storage. For example, entries in an electronic address book can be stored in any order. Assume that we have the following entries stored as our address data and wish to track any modifications using DeltaXML:

Example 1: an address book XML document (documentA.xml in the sample directory)

<addressList>
   <person customerid="15">
     <name>Joe Bloggs</name>
     <email>jblogs@msn.com</email>
     <address>
       <line>1234 Green Lane</line>
       <line>London</line>
       <zip>SW1 7WJ</zip>
     </address>
   </person>
  <person customerid="14">
     <name>John Doe</name>
     <email>jdoe@hotmail.com</email>
     <address>
       <line>1234 East Beverley Drive</line>
       <line>Phoenix</line>
       <zip>AZ 12345</zip>
     </address>
   </person>
   <person customerid="62">
     <name>Joe Bloggs</name>
     <email>jblogs@hotmail.com</email>
     <address>
       <line>1234 Green Lane</line>
       <line>London</line>
       <zip>SW1 7WJ</zip>
     </address>
  </person>
</addressList>

The first step is to tell DeltaXML that <address> contents are orderless

Example 2: Adding the orderless attribute using an XSLT template (part of address-key.xsl in the sample directory)

<xsl:template match="addressList">
    <xsl:copy>
        <xsl:attribute name="deltaxml:ordered" select="'false'"/>
        <xsl:apply-templates select="@*, node()"/>
    </xsl:copy>
</xsl:template>

The attribute deltaxml:ordered="false" is the signal. It tells the DeltaXML comparison program that the set of <person> elements nested within the <addressList> element are orderless. (Actually, any direct child element of <addressList> will be considered orderless, not just <person>s.)

The second step is to provide DeltaXML with a means of matching elements across the two input files. In our example the id attribute can serve as a matching key. Two elements, one from each input file, are considered to match if, and only if, their respective keys match. When the keys match DeltaXML scans the element in question and records any changes found within its contents. Note how we record the key information using an attribute that is clearly flagged as being in the deltaxml namespace:

Example 3: Adding keys using an XSLT template (part of address-key.xsl in the sample directory)

<xsl:template match="person">
    <xsl:copy>
        <xsl:attribute name="deltaxml:key" select="@customerid"/>
        <xsl:apply-templates select="@*, node()"/>
    </xsl:copy>
</xsl:template>

While it might seem more sensible to reference the id attribute than to duplicate its value, concrete reasons exist for such a protocol. Keys can exhibit much more complexity than shown here.

The final result after applying these templates together is the following file.

Example 4: the keyed, orderless address book

<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
             deltaxml:ordered="false">
   <person customerid="15" deltaxml:key="15">
     <name>Joe Bloggs</name>
     <email>jblogs@msn.com</email>
     <address>
       <line>1234 Green Lane</line>
       <line>London</line>
       <zip>SW1 7WJ</zip>
     </address>
   </person>
  <person customerid="14" deltaxml:key="14">
     <name>John Doe</name>
     <email>jdoe@hotmail.com</email>
     <address>
       <line>1234 East Beverley Drive</line>
       <line>Phoenix</line>
       <zip>AZ 12345</zip>
     </address>
   </person>
   <person customerid="62" deltaxml:key="62">
     <name>Joe Bloggs</name>
     <email>jblogs@hotmail.com</email>
     <address>
       <line>1234 Green Lane</line>
       <line>London</line>
       <zip>SW1 7WJ</zip>
     </address>
  </person>
</addressList>

3 Comparing orderless data with keys

3.1 Ignoring data shuffling

DeltaXML is now equipped to perform an orderless comparison. The test will involve a simple rearrangement of the <person> elements:

Example 5: Rearranged address book (documentB.xml in the sample directory)

<addressList>
  <person customerid="15">
    <name>Joe Bloggs</name>
    <email>jblogs@msn.com</email>
    <address>
       <line>1234 Green Lane</line>
       <line>London</line>
       <zip>SW1 7WJ</zip>
     </address>
  </person>
  <person customerid="62">
    <name>Joe Bloggs</name>
    <email>jblogs@hotmail.com</email>
    <address>
       <line>1234 Green Lane</line>
       <line>London</line>
       <zip>SW1 7WJ</zip>
     </address>
  </person>
  <person customerid="14">
    <name>John Doe</name>
    <email>jdoe@hotmail.com</email>
    <address>
       <line>1234 East Beverley Drive</line>
       <line>Phoenix</line>
       <zip>AZ 12345</zip>
     </address>
  </person>
</addressList>

The comparison result shows no changes, as expected:

Example 6: Delta showing no differences between original and rearranged address books

<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"             
             deltaxml:deltaV2="A=B"
             deltaxml:ordered="false"
             deltaxml:version="2.0"
             deltaxml:content-type="full-context">
   <person deltaxml:key="15" customerid="15">
      <name>Joe Bloggs</name>
      <email>jblogs@msn.com</email>
      <address>
         <line>1234 Green Lane</line>
         <line>London</line>
         <zip>SW1 7WJ</zip>
      </address>
   </person>
   <person deltaxml:key="14" customerid="14">
      <name>John Doe</name>
      <email>jdoe@hotmail.com</email>
      <address>
         <line>1234 East Beverley Drive</line>
         <line>Phoenix</line>
         <zip>AZ 12345</zip>
      </address>
   </person>
   <person deltaxml:key="62" customerid="62">
      <name>Joe Bloggs</name>
      <email>jblogs@hotmail.com</email>
      <address>
         <line>1234 Green Lane</line>
         <line>London</line>
         <zip>SW1 7WJ</zip>
      </address>
   </person>
</addressList>

3.2 Detecting changes after shuffling

The preceding example showed that DeltaXML correctly ignores rearrangement of orderless sets. The next file introduces actual changes to address book:

Example 7: Address book with data change (documentC.xml in the sample directory)

<addressList>
  <person customerid="14">
    <name>John Doe</name>
    <email>jdoe@hotmail.com</email>
    <address>
       <line>1234 E. Beverley Drive</line>
       <line>Phoenix</line>
       <zip>AZ 12345 6789</zip>
     </address>
  </person>
  <person customerid="15">
    <name>Joe Bloggs</name>
    <email>jblogs@hotmail.com</email>
    <address>
       <line>1234 Green Lane</line>
       <line>London</line>
       <zip>SW1 7WJ</zip>
     </address>
  </person>
</addressList>

Now DeltaXML has something more to report, because embedded data has changed:

Example 8: Delta showing data change detected in the orderless set

<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"             
             deltaxml:deltaV2="A!=B"
             deltaxml:ordered="false"
             deltaxml:version="2.0"
             deltaxml:content-type="full-context">
   <person deltaxml:deltaV2="A!=B" deltaxml:key="15" customerid="15">
      <name deltaxml:deltaV2="A=B">Joe Bloggs</name>
      <email deltaxml:deltaV2="A!=B">
         <deltaxml:textGroup deltaxml:deltaV2="A!=B">
            <deltaxml:text deltaxml:deltaV2="A">jblogs@msn.com</deltaxml:text>
            <deltaxml:text deltaxml:deltaV2="B">jblogs@hotmail.com</deltaxml:text>
         </deltaxml:textGroup>
      </email>
      <address deltaxml:deltaV2="A=B">
         <line>1234 Green Lane</line>
         <line>London</line>
         <zip>SW1 7WJ</zip>
      </address>
   </person>
   <person deltaxml:deltaV2="A!=B" customerid="14" deltaxml:key="14">
      <name deltaxml:deltaV2="A=B">John Doe</name>
      <email deltaxml:deltaV2="A=B">jdoe@hotmail.com</email>
      <address deltaxml:deltaV2="A!=B">
         <line deltaxml:deltaV2="A!=B">
            <deltaxml:textGroup deltaxml:deltaV2="A!=B">
               <deltaxml:text deltaxml:deltaV2="A">1234 East Beverley Drive</deltaxml:text>
               <deltaxml:text deltaxml:deltaV2="B">1234 E. Beverley Drive</deltaxml:text>
            </deltaxml:textGroup>
         </line>
         <line deltaxml:deltaV2="A=B">Phoenix</line>
         <zip deltaxml:deltaV2="A!=B">
            <deltaxml:textGroup deltaxml:deltaV2="A!=B">
               <deltaxml:text deltaxml:deltaV2="A">AZ 12345</deltaxml:text>
               <deltaxml:text deltaxml:deltaV2="B">AZ 12345 6789</deltaxml:text>
            </deltaxml:textGroup>
         </zip>
      </address>
   </person>
   <person deltaxml:deltaV2="A" deltaxml:key="62" customerid="62">
      <name>Joe Bloggs</name>
      <email>jblogs@hotmail.com</email>
      <address>
         <line>1234 Green Lane</line>
         <line>London</line>
         <zip>SW1 7WJ</zip>
      </address>
   </person>
</addressList>

In this case there were changes to <person> elements. DeltaXML accurately detected and recorded the change even though the order of elements was different.

3.3 Rules for keys in orderless comparisons

The following rules apply to the use of key attributes in the orderless case:

  • Parents of orderless elements must be assigned a deltaxml:ordered="false" attribute.

  • XML elements are considered ordered by default (in the absence of this attribute).

  • Two elements that are identical in all respects except the deltaxml:ordered attribute are considered distinct and will not be compared with each other. This unusual event triggers a warning in DeltaXML verbose mode.

  • Orderless elements may not contain parsed character data.

  • Any element may be designated as orderless, regardless of the ordering status of its parent or child elements.

3.4 Rules for orderless elements lacking keys

Generally, members of orderless sets should all have a key attribute. Orderless sets with a mixture of keyed and unkeyed elements may yield obscure results during comparisons.

Using no keys at all is preferable to a mixed case. Nevertheless, DeltaXML's rules for the mixed case are simple:

  • If the comparison file contains an identical element (perfect match, but neither element has a key attribute) DeltaXML assumes these are actually the same. No change is recorded.

  • If no identical element exists, the keyless element is considered a delta change. It is recorded as an addition or deletion.

3.5 Anchoring position of one element in an orderless set

Orderless sets sometimes incorporate a header element, such as <setheader> below:

Example 9: Orderless set with header

<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
             deltaxml:ordered="false"> 
   <setheader> 
      <date>20 January 2003</date> 
      <owner>MegaCorp</owner> 
   </setheader> 
   <item> Some data </item> 
   <item> More data </item> 
   <item> Extra data </item> 
</MyDataItems> 

Technically <setheader> is a member of the orderless set <MyDataItems>. Yet unlike the <item> members, <setheader> will not move around. It has a special status within the orderless set. So we would like DeltaXML to match it across revisions, not treating changes to this one element in the same way as changes to other members. Therefore we assign a deltaxml:key attribute to the <setheader> element:

Example 10: Orderless set with keyed header

<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
             deltaxml:ordered="false"> 
   <setheader deltaxml:key="headerkey"> 
      <date>20 January 2003</date> 
      <owner>MegaCorp</owner> 
   </setheader> 
   <item> Some data </item> 
   <item> More data </item> 
   <item> Extra data </item> 
</MyDataItems>

Of course, a better design would employ keys for all elements:

Example 11: Orderless set with keyed header and keyed elements

<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
             deltaxml:ordered="false"> 
   <setheader deltaxml:key="headerkey"> 
      <date>20 January 2003</date> 
      <owner>MegaCorp</owner> 
   </setheader> 
   <item deltaxml:key="q43"> Some data </item> 
   <item deltaxml:key="342"> More data </item> 
   <item deltaxml:key="09r"> Extra data </item> 
</MyDataItems>

3.6 Ideas for designing keys

The key to an element can take many forms, including:

  • a child element

  • part of a child element

  • a combination of attributes

  • a combination of attributes and child elements

  • a random number such as a GUID

  • arbitrary data from external sources (other files, databases, user input, etc).

The choices are almost unlimited. Moreover, an element class may apply different key formulations to different instances of that class. For example, some instances could use child data while others use attribute data.

Key generation has to be equally flexible. One possibility is to manufacture keys as part of the XML file. In this case they are stored as a permanent component of the XML data. However this technique leaves behind a footprint that may be inappropriate. It is often better to factor DeltaXML-specific data out. This factoring is possible by creating DeltaXML keys on a temporary, as-needed basis. A simple XSL script can generate keys just prior to DeltaXML processing, while another strips them upon completion. A separate paper on the use of XSL Filters details this technique.

We refer to key assignment based on element data as data-driven. For example, consider the bare address list without any identifiers or keys:

Example 12: Address book with no "id" attributes or keys

<addressList>
  <person>
    <firstName>John</firstName>
    <lastName>Smith</lastName> 
    <address>
      <line>26 High Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0JK</zip>
    </address>        
  </person>
  <person>
    <firstName>Adam</firstName>
    <lastName>Smith</lastName> 
    <address>
      <line>26 High Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0JK</zip>
    </address>
  </person>
  <person>
    <firstName>David</firstName>
    <lastName>Jones</lastName>
    <address>
      <line>12 New Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0PL</zip>
    </address>
  </person>
  <person>
    <firstName>Mark</firstName>
    <lastName>Lane</lastName>
    <address>
      <line>99 Bow Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0LL</zip>
    </address>
  </person> 
</addressList>

We expect the combination of firstName, lastName, and postCode to form a unique key. Such a key can be easily generated by XSL using XPath expressions. The results might look as follows:

Example 13: Data-driven keys added to address book elements

<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
             deltaxml:ordered="false">
  <person deltaxml:key="John Smith WR13 0JK">
    <firstName>John</firstName>
    <lastName>Smith</lastName> 
    <address>
      <line>26 High Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0JK</zip>
    </address>        
  </person>
  <person deltaxml:key="Adam Smith WR13 0JK">
    <firstName>Adam</firstName>
    <lastName>Smith</lastName> 
    <address>
      <line>26 High Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0JK</zip>
    </address>
  </person>
  <person deltaxml:key="David Jones WR13 0PL">
    <firstName>David</firstName>
    <lastName>Jones</lastName>
    <address>
      <line>12 New Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0PL</zip>
    </address>
  </person>
  <person deltaxml:key="Mark Lane WR13 0LL">
    <firstName>Mark</firstName>
    <lastName>Lane</lastName>
    <address>
      <line>99 Bow Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0LL</zip>
    </address>
  </person> 
</addressList>

This new design is more natural than id attributes because the keys derive from actual data. Furthermore the filter supports key generation on the fly, leaving permanent data untouched by DeltaXML markup. The XSL definition for this is presented below:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
                xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
                version="2.0">
  <xsl:output method="xml" indent="no" />

  <xsl:template match="@*, node()">
    <xsl:copy>
      <xsl:apply-templates select="@*,node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="person">
    <xsl:copy>
      <xsl:attribute name="deltaxml:key"
                     select="concat(firstName/text(), ' ', lastName/text(), ' ', address/zip/text())"/>
      <xsl:apply-templates select="@*,node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="addressList">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ordered" select="'false'"/>
      <xsl:apply-templates select="@*,node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

In the above example, the deltaxml:ordered="false" attribute is added to the addressList element as specified in the template's match attribute. If you wanted it to apply to a range of different elements the XSL match attribute could be extended to include further XPaths, for example

<xsl:template match="addressList | elem1 | elem2/elem3">
  <xsl:copy>
    <xsl:attribute name="deltaxml:ordered" select="'false'"/>
    <xsl:apply-templates select="@*,node()"/>
  </xsl:copy>
</xsl:template> 

Here the attribute would be added to all <addressList> elements, all <elem1> elements, and <elem3> elements whose immediate parents where <elem2>.

It should be noted that we only suggest adding the deltaxml:ordered="false" element to data where order really does not matter. If you have an element whose sub-elements are a mixture of ordered and orderless elements and wish to manage these then please see our "how to" reference on "How to configure DeltaXML to handle mixed ordered and orderless data within a single element".

3.7 Nesting orderless sets

Orderless elements may nest, but child elements do not inherit the orderless quality automatically. It must be declared explicitly. For example, each person in the address list might have several phone numbers:

Example 14: Nested orderless sets

<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:ordered="false">
  <person>
    <firstName>John</firstName>
    <lastName>Smith</lastName>
    <address>
      <line>26 High Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0JK</zip>
    </address>
    <phoneNumbers deltaxml:ordered="false">
      <phone>1 234</phone>
      <phone>2 345</phone>
      <phone>3 456</phone>
    </phoneNumbers>
  </person>
  <person>
    <firstName>David</firstName>
    <lastName>Jones</lastName>
    <address>
      <line>12 New Street</line>
      <line>Newtown</line>
      <line>Malvern</line>
      <zip>WR13 0PL</zip>
    </address>
    <phoneNumbers deltaxml:ordered="false">
      <phone>4 567</phone>
      <phone>5 678</phone>
      <phone>6 789</phone>
    </phoneNumbers>
  </person>
</addressList>

Comparison works the same no matter how deeply nested the elements. The rules for keys are the same across all levels.

3.8 Should everything be treated as orderless?

The default XML behaviour is that element ordering is significant. The previous example data, for example, the order of the <line> elements in the <address> is significant. Print them out in the wrong order and your postman, or rather the automated post-office sorting machines, may have a few problems! Thus by default you should not try to treat XML data as orderless.

Another example, imagine a simple XML vector graphics format:

  <line> 
    <point x="0" y="0"/> 
    <point x="1" y="1"/> 
  </line> 
 
  <closedPolygon> 
    <point x="0" y="0"/> 
    <point x="1" y="0"/> 
    <point x="1" y="1"/> 
    <point x="0.5" y="1.5"/> 
    <point x="0" y="1"/> 
  </closedPolygon>

Now in the case of a line, perhaps it would be fine to add deltaxml:ordered="false" to the <line> element. It doesn't matter which end you start at the resulting lines are the same and so the comparator should perhaps treat them so. But the <closedPolygon> is a different matter, reorder the points and you end up with different shapes. These should not be compared as equal.

So what approach should you take? We believe that understanding the semantics of the data is important. In some cases good 'clues' are present in DTDs, XMLSchema or RNG grammars for the data, for example the lack of a * or + repetition operator is a good indication not to add orderless. But there are cases where human expertise is needed to complement the grammars.

4 Running the sample

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output files resultAB.xml and resultAC.xml.

run ant

If you don't have Ant installed, you can run the sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct slashes for your operating system).

java -jar ../../command.jar compare orderless documentA.xml documentB.xml resultAB.xml
java -jar ../../command.jar compare orderless documentA.xml documentC.xml resultAC.xml

To clean up the sample directory, run the following command in Ant.

ant clean