Using Keys with Ordered Data

Adding keys to your data allows you to control the way DeltaXML aligns the elements at each level in the documents you are comparing. Keys are useful both for ordered and orderless data. This document describes how to use them with ordered data.

1. Comparing ordered data with keys

Ordered data comparison is often improved by using keys, which serve to control the alignment in DeltaXML comparisons. Even without keys, DeltaXML will always produce correct difference files. The issue is that several correct answers may be possible. We prefer that which best matches human understanding.

The next example illustrates this idea. Paragraphs form an ordered collection of data. Suppose that we have a small section of a book like this:

Example 1: First book draft (documentA.xml in the sample directory)

<book> 
  <p>The first advantage of DeltaXML is ..</p> 
  <p>The second advantage of DeltaXML is ..</p> 
</book>

Now we create an introductory paragraph and place it at the start of the file, while modifying the other paragraphs:

Example 2: Second book draft (documentB.xml in the sample directory)

<book> 
  <p>DeltaXML has many advantages:</p> 
  <p>The most important advantage of DeltaXML is ..</p> 
  <p>And the next advantage of DeltaXML is ..</p> 
</book>

We made three changes. Yet because we did not use key attributes, DeltaXML produces a more verbose, though still correct output:

Example 3: Unkeyed comparison of first and second book drafts showing mismatches

<book xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"      
      deltaxml:deltaV2="A!=B"
      deltaxml:version="2.0"
      deltaxml:content-type="full-context">
   <p deltaxml:deltaV2="A!=B">
      <deltaxml:textGroup deltaxml:deltaV2="A!=B">
         <deltaxml:text deltaxml:deltaV2="A">The first advantage of DeltaXML is ..</deltaxml:text>
         <deltaxml:text deltaxml:deltaV2="B">DeltaXML has many advantages:</deltaxml:text>
      </deltaxml:textGroup>
   </p>
   <p deltaxml:deltaV2="A!=B">
      <deltaxml:textGroup deltaxml:deltaV2="A!=B">
         <deltaxml:text deltaxml:deltaV2="A">The second advantage of DeltaXML is ..</deltaxml:text>
         <deltaxml:text deltaxml:deltaV2="B">The most important advantage of DeltaXML is ..</deltaxml:text>
      </deltaxml:textGroup>
   </p>
   <p deltaxml:deltaV2="B">And the next advantage of DeltaXML is ..</p>
</book>

All paragraphs were mismatched. While the delta file is correct, we know that DeltaXML should also have matched the paragraphs in a different way. Yet without further information, DeltaXML cannot always tell which paragraphs should be aligned.

NOTE: With DeltaXML Core version 4 or above, there is a new Enhanced Matcher which is capable of achieving a very much better match between elements based on their content. This is especially effective for documents when using the word-by-word option. However, for the purposes of this example, in order to show how keys work, we have not used the word-by-word feature.

A better comparison requires hints. The hints take the form of key attributes. Here key attributes are applied to the paragraphs:

Example 4: First book draft with keys (documentA-keyed.xml in the sample directory)

<book xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"> 
  <p deltaxml:key="P1">The first advantage of DeltaXML is ..</p> 
  <p deltaxml:key="P2">The second advantage of DeltaXML is ..</p> 
</book>

Example 5: Second book draft with keys (documentB-keyed.xml in the sample directory)

<book xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"> 
  <p>DeltaXML has many advantages:</p> 
  <p deltaxml:key="P1">The most important advantage of DeltaXML is ..</p> 
  <p deltaxml:key="P2">And the next advantage of DeltaXML is ..</p> 
</book>

Running DeltaXML under these conditions produces a more natural result:

Example 6: Keyed comparison of first and second book drafts showing no mismatches

<book xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"      
      deltaxml:deltaV2="A!=B"
      deltaxml:version="2.0"
      deltaxml:content-type="full-context">
   <p deltaxml:deltaV2="B">DeltaXML has many advantages:</p>
   <p deltaxml:deltaV2="A!=B" deltaxml:key="P1">
      <deltaxml:textGroup deltaxml:deltaV2="A!=B">
         <deltaxml:text deltaxml:deltaV2="A">The first advantage of DeltaXML is ..</deltaxml:text>
         <deltaxml:text deltaxml:deltaV2="B">The most important advantage of DeltaXML is ..</deltaxml:text>
      </deltaxml:textGroup>
   </p>
   <p deltaxml:deltaV2="A!=B" deltaxml:key="P2">
      <deltaxml:textGroup deltaxml:deltaV2="A!=B">
         <deltaxml:text deltaxml:deltaV2="A">The second advantage of DeltaXML is ..</deltaxml:text>
         <deltaxml:text deltaxml:deltaV2="B">And the next advantage of DeltaXML is ..</deltaxml:text>
      </deltaxml:textGroup>
   </p>
</book>

Paragraphs with the same key values have been matched up. The first paragraph is now shown as added, and the others as modified. This delta file corresponds to the nested edits, as understood by a human. Note that not all the paragraphs have keys, and DeltaXML will work with whatever keys are provided. It is not necessary to key all the paragraphs although obviously this should be done for best results as further edits are made.

1.1. What data can be used for keys?

You can use any text data as a key and often there will be information within your document or data file that is suitable. For example, if there are ID attributes these can be used. In this case, all you need to do is to copy the value of the ID attribute into the deltaxml:key attribute value. You can also construct the key value from two or more existing attributes, or from some other content. If you do not want the keys in your final result, they can be stripped out, they are only needed for the comparison process.

It is best to keep the keys unique across child elements of a particular type, e.g. all <p> elements in a <section>. This is not essential but will give more predictable results.

1.2.  Key rules in ordered comparisons

These basic rules apply to the use of keys in ordered comparisons:

  • DeltaXML never records a key change, because it considers elements with different keys as different elements.
  • An element lacking a key will never match any keyed element.
  • Key strings must be normalized: no leading or trailing whitespace; at most single embedded spaces between words.
  • Matching of keyed elements takes precedence over matching of unkeyed elements.
  • Elements with keys can be nested, and keys can be used at any level.
  • The deltaxml:key attribute may be used to identify ordered data only in DeltaXML Version 2.1 or higher. Earlier DeltaXML versions allow keys for orderless elements only.
  • Version 4 and higher of DeltaXML have an enhanced matcher which will give improved results when it is not possible to use keys.

2. Running the sample

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output files unkeyed-result.xml and keyed-result.xml.

run ant

If you don't have Ant installed, you can run the sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct directory separators for your operating system).

java -jar ../../command.jar compare ordered documentA.xml documentB.xml unkeyed-result.xml
java -jar ../../command.jar compare ordered documentA-keyed.xml documentB-keyed.xml keyed-result.xml

To clean up the sample directory, run the following command in Ant.

ant clean