Detecting and Handling Moves

1. Handling moves

This "how to" document discusses how you can detect a move rather than a delete and an add. It is included in the Core release in the samples/HandleMoves sub-directory.

In some situations, an element is moved from one place in an XML file to another. When comparing files where such a move has occurred, DeltaXML will not recognise the move as such, but will identify the change as a 'delete' (denoted by deltaxml:deltaV2="A") from the original position and an 'add' of an element (denoted by deltaxml:deltaV2="B") in the new position.

Although DeltaXML does not identify a move directly, it does generate a delta file in XML and it is possible to process this delta file to match delete/add or A/B pairs and so identify moves. This is described below, but first it is worth considering why DeltaXML adopts this approach, and some of the inherent complexities of a 'move' operation.

2. Background

There are a number of reasons for DeltaXML adopting the approach of identifying additions and deletions but not moves.

The first reason is that, in most situations, the delete and add behaviour is what is required. This is almost invariably the case where the XML represents data rather than a text document.

Secondly, the DeltaXML delta file contains no pointers or XPath expressions: all the elements have the same set of ancestors as in the original file(s). In order to represent a move, an XPath expression or a pointer is needed. This adds a level of complexity and makes processing of the delta files more difficult. It was a design goal of the delta format to keep it as simple as possible and also that the delta should reflect the structure of the files being compared.

A third reason relates to the control complexity that is inherent in identifying moves. Here are some considerations:

  • how far away can an element be moved? Anywhere in the file? Anywhere in the same parent? grandparent?
  • how can we detect a move when the element has also been changed?
  • what happens when an element is moved and also copied?
  • when an element is deleted once and added in many places, is that a move?
  • when an element is added once and deleted in many places, is that a move? which elements correspond?

This leads us to the conclusion that, for reliable move detection, each element should be identified so that moves could be detected accurately. In this case it is in fact quite possible to use DeltaXML and, by simply processing the delta file, to detect and mark these moves.

3. How to identify moves by processing the delta file

If you have, in your files, a reliable way to detect which elements are the same in the two files, then moves can be identified by processing the delta file. For example, if you have an ID attribute on elements, these can be used in this way. Let's look at an example where a set of work and home contacts are represented in an XML file.

Example 1: an XML file representing a set of contacts (r1.xml in the sample directory)

<records xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"> 
 <work-contacts> 
  <contact ID="johnsmith" deltaxml:key="johnsmith"> 
   <name>John Smith</name> 
   <phone type="office">+44 200 1234 567</phone> 
   <phone type="fax">+44 200 1234 568</phone> 
   <phone type="mobile">+44 200 1234 569</phone> 
  </contact> 
  <contact ID="markjones" deltaxml:key="markjones"> 
   <name>Mark Jones</name> 
   <phone type="office">+44 200 1234 599</phone> 
   <phone type="fax">+44 200 1234 599</phone> 
   <phone type="mobile">+44 200 1234 500</phone> 
  </contact> 
 </work-contacts> 
 <home-contacts> 
 </home-contacts> 
</records>

And, in a modified version we have moved John Smith to be a home contact, and we have updated his phone number.

Example 2: an updated version of the contacts file (r2.xml in the sample directory)

<records xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"> 
 <work-contacts> 
  <contact ID="markjones" deltaxml:key="markjones"> 
   <name>Mark Jones</name> 
   <phone type="office">+44 200 1234 599</phone> 
   <phone type="fax">+44 200 1234 599</phone> 
   <phone type="mobile">+44 200 1234 500</phone> 
  </contact> 
 </work-contacts> 
 <home-contacts> 
  <contact ID="johnsmith"  deltaxml:key="johnsmith"> 
   <name>John Smith</name> 
   <phone type="office">+44 200 1234 555</phone> 
   <phone type="fax">+44 200 1234 568</phone> 
   <phone type="mobile">+44 200 1234 569</phone> 
  </contact> 
 </home-contacts> 
</records>

When these are compared, we get the following delta file.

Example 3: a delta file representing the changes between the contact list versions

<records deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context">
 <work-contacts deltaxml:deltaV2="A!=B">
  <contact deltaxml:deltaV2="A" ID="johnsmith" deltaxml:key="johnsmith">
   <name>John Smith</name>
   <phone type="office">+44 200 1234 567</phone>
   <phone type="fax">+44 200 1234 568</phone>
   <phone type="mobile">+44 200 1234 569</phone>
  </contact>
  <contact deltaxml:deltaV2="A=B" ID="markjones" deltaxml:key="markjones">
   <name>Mark Jones</name>
   <phone type="office">+44 200 1234 599</phone>
   <phone type="fax">+44 200 1234 599</phone>
   <phone type="mobile">+44 200 1234 500</phone>
  </contact>
 </work-contacts>
 <home-contacts deltaxml:deltaV2="A!=B">
  <contact deltaxml:deltaV2="B" ID="johnsmith" deltaxml:key="johnsmith">
   <name>John Smith</name>
   <phone type="office">+44 200 1234 555</phone>
   <phone type="fax">+44 200 1234 568</phone>
   <phone type="mobile">+44 200 1234 569</phone>
  </contact>
 </home-contacts>
</records>

The move is obvious by inspection here because we have an element with an ID="johnsmith" and a deltaxml:delta="B" attribute on it, and another with the same ID attribute and a deltaxml:delta="A" attribute. By processing the delta file, these situations can be detected and the elements marked to indicate that they represent a move operation. Note that the phone number is not identified as changed, but the new number is present in the moved data, so the data is correct.

The above example also demonstrates the use of keys in the matching process. We would recommend that moved elements should also be keyed for consistent move results. For details of keying please see Using Keys with Ordered Data.

From DeltaXML Core Release 5.3.3 onwards it is possible to call the compare function from within an XSLT stylesheet, and this would enable an actual comparison to be made for these moved objects during the processing of the delta file. An XSLT stylesheet for processing this example data is presented below.

Example 4: an XSLT stylesheet that detects moves in the contact list (handle-moves.xsl in the sample directory)

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
  xmlns:deltajava="java:com.deltaxml.ext.xslt.saxon.Comparison"
  version="2.0">
  
  <!-- This filter detects and processes contact 'moves'.  A move is defined
    as exactly one added and one deleted contact with the same ID attribute.
    Keys are used to provide efficiency and count adds/deletes.
    We support moves anywhere within the document.  -->
  
  <xsl:key name="deletedContacts" match="contact[@deltaxml:deltaV2='A']" use="@ID"/>
  <xsl:key name="addedContacts" match="contact[@deltaxml:deltaV2='B']" use="@ID"/>
  
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@*, node()"/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="contact[@deltaxml:deltaV2='B']
                              [count(key('deletedContacts', @ID)) eq 1 and count(key('addedContacts', @ID)) eq 1]">
    <xsl:copy>
      <!-- if you don't care about changes to moved contacts you could just apply-templates now,
            otherwise need to re-compare the added/deleted contact using the compare extension function -->
      <xsl:variable name="a" select="key('deletedContacts', @ID)" as="node()"/>
      <xsl:variable name="b" select="." as="node()"/>
      <xsl:variable name="delta" as="node()"
          select="deltajava:compare($a, $b, resolve-uri('process-moves.dxp', static-base-uri()))"/>
      <xsl:apply-templates select="$delta/*/@*"/>
      <xsl:comment>This contact element was detected as a move</xsl:comment>
      <xsl:apply-templates select="$delta/*/node()"/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="contact[@deltaxml:deltaV2='A']
                              [count(key('deletedContacts', @ID)) eq 1 and count(key('addedContacts', @ID)) eq 1]">
    <xsl:comment>A contact element with id <xsl:value-of select="@ID"/> was at this location but has been moved</xsl:comment>
  </xsl:template>
</xsl:stylesheet>

When this stylesheet is used to process the previous delta, the improved delta with the moves detected and compared becomes:

<records xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:deltaV2="A!=B" 
         deltaxml:version="2.0" deltaxml:content-type="full-context">
  <work-contacts deltaxml:deltaV2="A!=B">
    <!--A contact element with id johnsmith was at this location but has been moved-->
    <contact deltaxml:deltaV2="A=B" ID="markjones" deltaxml:key="markjones">
      <name>Mark Jones</name>
      <phone type="office">+44 200 1234 599</phone>
      <phone type="fax">+44 200 1234 599</phone>
      <phone type="mobile">+44 200 1234 500</phone>
    </contact>
  </work-contacts>
  <home-contacts deltaxml:deltaV2="A!=B">
    <contact deltaxml:deltaV2="A!=B" ID="johnsmith" deltaxml:key="johnsmith">
      <!--This contact element was detected as a move-->
      <name deltaxml:deltaV2="A=B">John Smith</name>
      <phone deltaxml:deltaV2="A!=B" type="office">
        <deltaxml:textGroup deltaxml:deltaV2="A!=B">
          <deltaxml:text deltaxml:deltaV2="A">+44 200 1234 567</deltaxml:text>
          <deltaxml:text deltaxml:deltaV2="B">+44 200 1234 555</deltaxml:text>
        </deltaxml:textGroup>
      </phone>
      <phone deltaxml:deltaV2="A=B" type="fax">+44 200 1234 568</phone>
      <phone deltaxml:deltaV2="A=B" type="mobile">+44 200 1234 569</phone>
    </contact>
  </home-contacts>
</records>

DeltaXML allows keys to control the comparison process and these would typically be useful for detecting moves, as shown in the use of keys in the above example.

4. How to identify moves by pre-processing the data

Another approach to this problem is to generate, from the original data, a file containing the elements that you know may have moved and that you wish to be compared.

Using the above example, we could filter the original files to generate these files. Here we have removed the <work-contacts> and <home-contacts> wrapper elements and just listed the <contact> elements. We have added a path attribute (or we could have used a sub-element) to record the original ancestors.

<records xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" 
    deltaxml:ordered="false"> 
 <contact ID="johnsmith" deltaxml:key="johnsmith" path="records/work-contacts/"> 
   <name>John Smith</name> 
   <phone type="office">+44 200 1234 567</phone> 
   <phone type="fax">+44 200 1234 568</phone> 
   <phone type="mobile">+44 200 1234 569</phone> 
  </contact> 
  <contact ID="markjones" deltaxml:key="markjones" path="records/work-contacts/"> 
   <name>Mark Jones</name> 
   <phone type="office">+44 200 1234 599</phone> 
   <phone type="fax">+44 200 1234 599</phone> 
   <phone type="mobile">+44 200 1234 500</phone> 
  </contact> 
</records>

And, in the modified version we have this:

<records xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" 
    deltaxml:ordered="false"> 
 <contact ID="markjones" deltaxml:key="markjones" path="records/work-contacts/"> 
   <name>Mark Jones</name> 
   <phone type="office">+44 200 1234 599</phone> 
   <phone type="fax">+44 200 1234 599</phone> 
   <phone type="mobile">+44 200 1234 500</phone> 
  </contact> 
 <contact ID="johnsmith"  deltaxml:key="johnsmith" path="records/home-contacts/"> 
   <name>John Smith</name> 
   <phone type="office">+44 200 1234 555</phone> 
   <phone type="fax">+44 200 1234 568</phone> 
   <phone type="mobile">+44 200 1234 569</phone> 
  </contact> 
</records>

Note that we have added deltaxml:ordered="false" to the root element because the <contact> elements could appear in any order. An alternative is to sort them by key and this may be easier to check. When these are compared we will see the changes to each contact in terms of its contents and its position in the original file.

<records deltaxml:deltaV2="A!=B" deltaxml:ordered="false" deltaxml:version="2.0" 
 deltaxml:content-type="full-context">
 <contact deltaxml:deltaV2="A!=B" deltaxml:key="johnsmith" ID="johnsmith">
  <deltaxml:attributes deltaxml:deltaV2="A!=B">
   <dxa:path deltaxml:deltaV2="A!=B">
    <deltaxml:attributeValue deltaxml:deltaV2="A">records/work-contacts/</deltaxml:attributeValue>
    <deltaxml:attributeValue deltaxml:deltaV2="B">records/home-contacts/</deltaxml:attributeValue>
   </dxa:path>
  </deltaxml:attributes>
  <name deltaxml:deltaV2="A=B">John Smith</name>
  <phone deltaxml:deltaV2="A!=B" type="office">
   <deltaxml:textGroup deltaxml:deltaV2="A!=B">
    <deltaxml:text deltaxml:deltaV2="A">+44 200 1234 567</deltaxml:text>
    <deltaxml:text deltaxml:deltaV2="B">+44 200 1234 555</deltaxml:text>
   </deltaxml:textGroup>
  </phone>
  <phone deltaxml:deltaV2="A=B" type="fax">+44 200 1234 568</phone>
  <phone deltaxml:deltaV2="A=B" type="mobile">+44 200 1234 569</phone>
 </contact>
 <contact deltaxml:deltaV2="A=B" ID="markjones" deltaxml:key="markjones" path="records/work-contacts/">
  <name>Mark Jones</name>
  <phone type="office">+44 200 1234 599</phone>
  <phone type="fax">+44 200 1234 599</phone>
  <phone type="mobile">+44 200 1234 500</phone>
 </contact>
</records>

This delta provides details only of the changed contacts and those that have been moved. The moves are indicated by a change in the path attribute.

DeltaXML provides a versatile solution to handling additions and deletions and can be configured to show moves as indicated above.

5. Running the sample

The sample code shows how to detect and handle moves. Two different ways of achieving the same result are provided. The first method shown is how to use the Pipelined Comparator and a DXP file to configure the pipeline. The second method shows how to use the Java API and the Document Comparator.

If you have Ant installed, use the build script provided to run the sample. Simply type the following command.

ant run

If you don't have Ant installed, you can run the sample from a command line by issuing commands from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).

5.1. Pipelined Comparator

The input data files, r1.xml and r2.xml, are located in the pipelined directory.

If you have Ant installed, you may use the build script provided to run the sample by simply typing the following command. This will run the pipeline and produce the output file result.xml in the pipelined directory.

ant run-dxp

If you don't have Ant installed, you can run the sample from a command line by issuing the following command from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).

java -jar ../../command.jar compare moves pipelined/r1.xml pipelined/r2.xml pipelined/result.xml

5.2. Document Comparator

The input files for the Document Comparator sample are in the document directory. They are called doc1.xml and doc2.xml and are in DocBook format.

To run just the Document Comparator sample if you have Ant installed, simply type the following command to run the pipeline and produce the output file result.xml in the document directory.

ant run-dc

If you don't have Ant installed, you can run the sample from a command line by issuing the following command from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).

mkdir bin
javac -cp bin:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/HandleMovesSample.java
java -cp bin:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.HandleMovesSample document/doc1.xml document/doc2.xml document/result.xml

To clean up the sample directory, run the following Ant command.

ant clean