How to Compare Orderless Elements
Table of Contents
1. Introduction
Any element in an XML file can be declared as orderless and DeltaXML will match up and compare the child elements regardless of the order in which they appear. What counts about orderless data is membership in a set, not relative position. Arbitrary rearrangement leaves the meaning of the set unaffected. To detect changes in sets, DeltaXML must correlate matching elements in spite of rearrangements. Keys help DeltaXML perform this correlation. Keys remove any need for sorting. (Orderless comparisons have been supported since version 2.0 of DeltaXML.)
The special-purpose attributes called keys simplify DeltaXML comparisons in many situations. The structure of XML is not completely rigid; as with machine parts, a little flexibility lends strength. For instance XML permits both ordered and orderless data. Keys help DeltaXML track changes in both categories. Keys increase both the quality of the matches made and the runtime speed of DeltaXML comparisons. For orderless data keys should be provided whenever possible to get the best results.
Key attributes impose no requirements on data formats. Extensible Stylesheet Language (XSLT) scripts can generate keys on the fly, whenever needed, and discard them in postprocessing. The original XML remains unaffected by their use.
One point that should be noted, before reading further, is that it is very important to remove whitespace only PCDATA nodes during orderless comparison (for example using the normalize-space.xsl XSLT file or the more efficient com.deltaxml.pipe.filters.NormalizeSpace.class Java XML Filter).
2. How to declare orderless data
Ordering is often unimportant during data capture or storage. For example, entries in an electronic address book can be stored in any order. Assume that we have the following entries stored as our address data and wish to track any modifications using DeltaXML:
Example 1. Orderless address book
<?xml version="1.0" standalone="yes"?>
<addressList>
<person id="a1">
<firstName>John</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person id="a2">
<firstName>Adam</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person id="a3">
<firstName>David</firstName>
<lastName>Jones</lastName>
<addressLine>12 New Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0PL</postCode>
</person>
<person id="a4">
<firstName>Mark</firstName>
<lastName>Lane</lastName>
<addressLine>99 Bow Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0LL</postCode>
</person>
</addressList>
The first step is to tell DeltaXML that <address> contents are orderless
Example 2. Adding the orderless attribute
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
The attribute deltaxml:ordered="false" is the signal. It tells the DeltaXML comparison program that the set of <person> elements nested within the <addressList> element are orderless. (Actually, any direct child element of <addressList> will be considered orderless, not just <person>s.)
The second step is to provide DeltaXML with a means of matching elements across the two input files. In our example the id attribute can serve as a matching key. Two elements, one from each input file, are considered to match if, and only if, their respective keys match. When the keys match DeltaXML scans the element in question and records any changes found within its contents. Note how we record the key information using an attribute that is clearly flagged as being in the deltaxml namespace:
Example 3. Orderless address book with keys
<?xml version="1.0" standalone="yes"?>
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<person id="a1" deltaxml:key="a1">
<firstName>John</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person id="a2" deltaxml:key="a2">
<firstName>Adam</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person id="a3" deltaxml:key="a3">
<firstName>David</firstName>
<lastName>Jones</lastName>
<addressLine>12 New Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0PL</postCode>
</person>
<person id="a4" deltaxml:key="a4">
<firstName>Mark</firstName>
<lastName>Lane</lastName>
<addressLine>99 Bow Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0LL</postCode>
</person>
</addressList>
While it might seem more sensible to reference the id attribute than to duplicate its value, concrete reasons exist for such a protocol. Keys can exhibit much more complexity than shown here.
3. Comparing orderless data with keys
3.1. Ignoring data shuffling
DeltaXML is now equipped to perform an orderless comparison. The test will involve a simple rearrangement of the <person> elements:
Example 4. Rearranged address book with keys
<?xml version="1.0" standalone="yes"?>
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<person id="a3" deltaxml:key="a3">
<firstName>David</firstName>
<lastName>Jones</lastName>
<addressLine>12 New Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0PL</postCode>
</person>
<person id="a1" deltaxml:key="a1">
<firstName>John</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person id="a4" deltaxml:key="a4">
<firstName>Mark</firstName>
<lastName>Lane</lastName>
<addressLine>99 Bow Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0LL</postCode>
</person>
<person id="a2" deltaxml:key="a2">
<firstName>Adam</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
</addressList>
The comparison result shows no changes, as expected:
Example 5. No differences between original and rearranged address books
<?xml version="1.0"?> <addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:delta="unchanged"/>
This is a very short result. The DeltaXML format is a change format; it records only modifications (unless qualified by the full context delta setting). In this example DeltaXML tells us that nothing changed even though <person> elements were rearranged.
DeltaXML distinguishes ordered from orderless changes with special attributes. Ordered changes use the value "Wfmodify" for the deltaxml:delta attribute while orderless changes use the value to "WFmodifyUnordered". In both cases the WF stands for "Well-Formed".
3.2. Detecting changes after shuffling
The preceding example showed that DeltaXML correctly ignores rearrangement of orderless sets. The next file introduces actual changes to one of the elements (see the entry with the id of "a4"):
Example 6. Address book with data change
<?xml version="1.0" standalone="yes"?>
<!-- Re-ordered person elements, change to person with deltaxml:key="a4" -->
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<person id="a3" deltaxml:key="a3">
<firstName>David</firstName>
<lastName>Jones</lastName>
<addressLine>12 New Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0PL</postCode>
</person>
<person id="a1" deltaxml:key="a1">
<firstName>John</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person id="a4" deltaxml:key="a4">
<firstName>Mark</firstName>
<lastName>Lane</lastName>
<addressLine>99 Bow Street!!</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0LL!!</postCode>
</person>
<person id="a2" deltaxml:key="a2">
<firstName>Adam</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
</addressList>
Now DeltaXML has something more to report, because embedded data has changed:
Example 7. Data change detected in the orderless set
<?xml version="1.0"?>
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:delta="WFmodifyUnordered">
<person deltaxml:deltaV2="A!=B" id="a4" deltaxml:key="a4">
<firstName deltaxml:deltaV2="A=B">
Mark
</firstName>
<lastName deltaxml:deltaV2="A=B">
Lane
</lastName>
<addressLine deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">
99 Bow Street
</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">
99 Bow Street!!
</deltaxml:text>
</deltaxml:textGroup>
</addressLine>
<addressLine deltaxml:deltaV2="A=B">
Newtown
</addressLine>
<addressLine deltaxml:deltaV2="A=B">
Malvern
</addressLine>
<postCode deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">
WR13 0LL
</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">
WR13 0LL!!
</deltaxml:text>
</deltaxml:textGroup>
</postCode>
</person>
</addressList>
In this case only one <person> element changed. DeltaXML accurately detected and recorded the change. Note that here the unchanged data is not shown, but if it is needed DeltaXML can generate this using the Full Context delta.
3.3. Rules for keys in orderless comparisons
The following rules apply to the use of key attributes in the orderless case:
-
Parents of orderless elements must be assigned a deltaxml:ordered="false" attribute.
-
XML elements are considered ordered by default (in the absence of this attribute).
-
Two elements that are identical in all respects except the deltaxml:ordered attribute are considered distinct and will not be compared with each other. This unusual event triggers a warning in DeltaXML verbose mode. (Note: Warning available only in version 2.1 and higher.)
-
Orderless elements may not contain parsed character data.
-
Recombination of orderless elements is possible only if the change record has the attribute deltaxml:delta="WFmodifyUnordered".
-
Any element may be designated as orderless, regardless of the ordering status of its parent or child elements.
3.4. Rules for orderless elements lacking keys
Generally, members of orderless sets should all have a key attribute. Orderless sets with a mixture of keyed and unkeyed elements may yield obscure results during comparisons.
Using no keys at all is preferable to a mixed case. Nevertheless, DeltaXML's rules for the mixed case are simple:
-
If the comparison file contains an identical element (perfect match, but neither element has a key attribute) DeltaXML assumes these are actually the same. No change is recorded.
-
If no identical element exists, the keyless element is considered a delta change. It is recorded as an addition or deletion.
3.5. Anchoring position of one element in an orderless set
Orderless sets sometimes incorporate a header element, such as <setheader> below:
Example 8. Orderless set with header
<?xml version="1.0" standalone="yes"?>
<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<setheader>
<date>20 January 2003</date>
<owner>MegaCorp</owner>
</setheader>
<item> Some data </item>
<item> More data </item>
<item> Extra data </item>
</MyDataItems>
Technically <setheader> is a member of the orderless set <MyDataItems>. Yet unlike the <item> members, <setheader> will not move around. It has a special status within the orderless set. So we would like DeltaXML to match it across revisions, not treating changes to this one element in the same way as changes to other members. Therefore we assign a deltaxml:key attribute to the <setheader> element:
Example 9. Orderless set with keyed header
<?xml version="1.0" standalone="yes"?>
<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<setheader deltaxml:key="headerkey">
<date>20 January 2003</date>
<owner>MegaCorp</owner>
</setheader>
<item> Some data </item>
<item> More data </item>
<item> Extra data </item>
</MyDataItems>
Of course, a better design would employ keys for all elements:
Example 10. Orderless set with keyed header and keyed elements
<?xml version="1.0" standalone="yes"?>
<MyDataItems xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<setheader deltaxml:key="headerkey">
<date>20 January 2003</date>
<owner>MegaCorp</owner>
</setheader>
<item deltaxml:key="q43"> Some data </item>
<item deltaxml:key="342"> More data </item>
<item deltaxml:key="09r"> Extra data </item>
</MyDataItems>
3.6. Ideas for designing keys
The key to an element can take many forms, including:
-
a child element
-
part of a child element
-
a combination of attributes
-
a combination of attributes and child elements
-
a random number such as a GUID
-
arbitrary data from external sources (other files, databases, user input, etc).
The choices are almost unlimited. Moreover, an element class may apply different key formulations to different instances of that class. For example, some instances could use child data while others use attribute data.
Key generation has to be equally flexible. One possibility is to manufacture keys as part of the XML file. In this case they are stored as a permanent component of the XML data. However this technique leaves behind a footprint that may be inappropriate. It is often better to factor DeltaXML-specific data out. This factoring is possible by creating DeltaXML keys on a temporary, as-needed basis. A simple XSL script can generate keys just prior to DeltaXML processing, while another strips them upon completion. A separate paper on the use of XSL Filters details this technique.
We refer to key assignment based on element data as data-driven. For example, consider the bare address list without any identifiers or keys:
Example 11. Address book with no "id" attributes or keys
<?xml version="1.0" standalone="yes"?>
<addressList>
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person>
<firstName>Adam</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person>
<firstName>David</firstName>
<lastName>Jones</lastName>
<addressLine>12 New Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0PL</postCode>
</person>
<person>
<firstName>Mark</firstName>
<lastName>Lane</lastName>
<addressLine>99 Bow Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0LL</postCode>
</person>
</addressList>
We expect the combination of firstName, lastName, and postCode to form a unique key. Such a key can be easily generated by XSL using XPath expressions. The results might look as follows:
Example 12. Data-driven keys added to address book elements
<?xml version="1.0" standalone="yes"?>
<addressList
xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<person deltaxml:key="John Smith WR13 0JK">
<firstName>John</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person deltaxml:key="Adam Smith WR13 0JK">
<firstName>Adam</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
</person>
<person deltaxml:key="David Jones WR13 0PL">
<firstName>David</firstName>
<lastName>Jones</lastName>
<addressLine>12 New Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0PL</postCode>
</person>
<person deltaxml:key="Mark Lane WR13 0LL">
<firstName>Mark</firstName>
<lastName>Lane</lastName>
<addressLine>99 Bow Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0LL</postCode>
</person>
</addressList>
This new design is more natural than id attributes because the keys derive from actual data. Furthermore the filter supports key generation on the fly, leaving permanent data untouched by DeltaXML markup. The XSL definition for this is presented below:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
<xsl:output method="xml" indent="no" />
<xsl:template match="*|@*">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="person">
<xsl:copy>
<xsl:attribute name="deltaxml:key">
<xsl:value-of select="@customerid"/>
</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="addressList">
<xsl:copy>
<xsl:attribute name="deltaxml:ordered">false</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
In the above example, the deltaxml:ordered="false" attribute is added to the addressList element as specified in the template's
match attribute. If you wanted it to apply to a range of different elements the XSL match attribute could be extended to include further XPaths, for example
<xsl:template match="addressList|elem1|elem2/elem3">
<xsl:copy>
<xsl:attribute name="deltaxml:ordered">false</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
Here the attribute would be added to all addressList elements, all elem1 elements, and elem3 elements whose immediate parents where elem2.
It should be noted that we only suggest adding the deltaxml:ordered="false"element to data where order realy does not matter. If you have an element whose sub-elements are a mixture of ordered and orderless elements and wish to manage these then please see our "how to" reference on "How to configure DeltaXML to handle mixed ordered and orderless data within a single element".
3.7. Nesting orderless sets
Orderless elements may nest, but child elements do not inherit the orderless quality automatically. It must be declared explicitly. For example, each person in the address list might have several phone numbers:
Example 13. Nested orderless sets
<?xml version="1.0" standalone="yes"?>
<addressList xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:ordered="false">
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<addressLine>26 High Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0JK</postCode>
<phoneNumbers deltaxml:ordered="false">
<phone>1 234</phone>
<phone>2 345</phone>
<phone>3 456</phone>
</phoneNumbers>
</person>
<person>
<firstName>David</firstName>
<lastName>Jones</lastName>
<addressLine>12 New Street</addressLine>
<addressLine>Newtown</addressLine>
<addressLine>Malvern</addressLine>
<postCode>WR13 0PL</postCode>
<phoneNumbers deltaxml:ordered="false">
<phone>4 567</phone>
<phone>5 678</phone>
<phone>6 789</phone>
</phoneNumbers>
</person>
</addressList>
Comparison works the same no matter how deeply nested the elements. The rules for keys are the same across all levels.
3.8. Should eveything be treated as orderless?
The default XML behaviour is that element ordering is significant. The previous example data, for example, the order of the <*Line> elements in the <address> is significant. Print them out in the wrong order and your postman, or rather the automated post-office sorting machines, may have a few problems! Thus by default you should not try to treat XML data as orderless.
Another example, imagine a simple XML vector graphics format:
<line>
<point x="0" y="0"/>
<point x="1" y="1"/>
</line>
<closedPolygon>
<point x="0" y="0"/>
<point x="1" y="0"/>
<point x="1" y="1"/>
<point x="0.5" y="1.5"/>
<point x="0" y="1"/>
</closedPolygon>
Now in the case of a line, perhaps it would be fine to add deltaxml:ordered="false" to the <line> element. It doesn't matter which end you start at the resulting lines are the same and so the comparator should perhaps treat them so. But the closedPolygon is a different matter, reorder the points and you end up with different shapes. These should not be compared as equal.
So what aproach shoudl you take? We believe that understanding the semantics of the data is important. In some cases good 'clues' are present in DTDs, XMLSchema or RNG grammars for the data, for example the lack of a * or + repetition operator is a good indication not to add orderless. But there are cases where human expertise is needed to complement the grammars.