Ignoring Changes and Creating a Merged Document

1. Introduction

The XML files that you are comparing may contain data that you expect to change. You may wish to ignore these changes. From release 5.1 of DeltaXML Core, XSLT filters are provided to allow you to ignore selected changes. This makes it easy to generate some forms of output from the delta file. The last section in this document describes some use cases for this which allow you to "merge" two documents in a controlled way.

1.1. What does "ignore" really mean?

First, we need to ask the question: What is meant by "ignore"?

Consider this very simple example of attribute change:

Input 1:

<x y='1'/> 

Input 2:

<x y='2'/>

Ignore could mean:

  1. remove it completely from the result: <x/>
  2. prefer the 'A' or 'old' value: <x y='1'/>
  3. prefer the 'B' or 'new' value: <x y='2'/>
  4. take the average of any values with numerical/time data types: <x y='1.5'/>
  5. put in a difference marker: <x y='changed'/>
  6. find some way to represent them both: <x y='1|2'/> 

All of these approaches are possible using an output filter, however this document will concentrate on a generic approach and describe filters included in DeltaXML Core since release 5.1 which implement the first three strategies above.

2. Example data

This document discusses how you might handle merges using two sets of input data; one data-centric and one document-centric. Two practical solutions are presented, one for each input data set, with each solution using a different comparator and method for customising a comparison:

  • Pipelined Comparator (DXP) - Uses a filter pipeline defined by an XML file called a 'DXP' to customise the comparison.
  • Document Comparator - Uses Java API calls to customise a pre-existing pipeline with a number of extension points. The Document Comparator provides a solution tailored to comparing structured documents.

2.1. Pipelined Comparator

Imagine comparing the following two inputs, with the intention of ignoring the change made to the lastUpdated attribute:

Example 1.1: a small address book as an XML file (documentA.xml in the sample directory)

<addressBook>
  <person lastUpdated="01012008">
    <log/>
    <name>Joe Blogs</name>
    <telephone>01234 567890</telephone>
    <email>joe@blogs.com</email>
  </person>
</addressBook>

Example 2.1: an updated version of the address book (documentB.xml in the sample directory)

<addressBook>
  <person lastUpdated="01022008">
    <log>
      <lastLoggedIn>01032008</lastLoggedIn>
    </log>
    <name>Joe Blogs</name>
    <telephone>01235 467890</telephone>
    <email>joe@blogs.co.uk</email>
  </person>
</addressBook>

DeltaXML Core will produce the following delta:

<addressBook xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
             xmlns:dxx="http://www.deltaxml.com/ns/xml-namespaced-attribute"
             xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute"
             deltaxml:deltaV2="A!=B"
             deltaxml:version="2.0"
             deltaxml:content-type="full-context">
   <person deltaxml:deltaV2="A!=B">
      <deltaxml:attributes deltaxml:deltaV2="A!=B">
         <dxa:lastUpdated deltaxml:deltaV2="A!=B">
            <deltaxml:attributeValue deltaxml:deltaV2="A">01012008</deltaxml:attributeValue>
            <deltaxml:attributeValue deltaxml:deltaV2="B">01022008</deltaxml:attributeValue>
         </dxa:lastUpdated>
      </deltaxml:attributes>
      <log deltaxml:deltaV2="A!=B">
         <lastLoggedIn deltaxml:deltaV2="B">01032008</lastLoggedIn>
      </log>
      <name deltaxml:deltaV2="A=B">Joe Blogs</name>
      <telephone deltaxml:deltaV2="A!=B">
         <deltaxml:textGroup deltaxml:deltaV2="A!=B">
            <deltaxml:text deltaxml:deltaV2="A">01234 567890</deltaxml:text>
            <deltaxml:text deltaxml:deltaV2="B">01235 467890</deltaxml:text>
         </deltaxml:textGroup>
      </telephone>
      <email deltaxml:deltaV2="A!=B">
         <deltaxml:textGroup deltaxml:deltaV2="A!=B">
            <deltaxml:text deltaxml:deltaV2="A">joe@blogs.com</deltaxml:text>
            <deltaxml:text deltaxml:deltaV2="B">joe@blogs.co.uk</deltaxml:text>
         </deltaxml:textGroup>
      </email>
   </person>
</addressBook>

This is the changes represented in our deltaV2 format. While this may look overly complicated for such a simple change, it makes our job of processing it a lot easier. A side-effect of attribute changes being represented as elements is the addition of the dxa namespace, this is due to the namespace of a non-qualified attribute not being that of the document but an anonymous one and so this anonymous namespace needs to be represented. The implication of this is that when promoting this attribute we need to make sure that attribute gets placed in the correct namespace.

2.2. Document Comparator

Imagine comparing the following two inputs, with the intention of ignoring the change made to the revision attribute of the author, and also the date elements:

Example 1.2: the author information from a DocBook file (document/documentA.xml in the sample directory)

<article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink"
         version="5.0">
  <info>
    <title>Ignore Changes Sample</title>
      <author revision="1.0">
        <personname>Joe Bloggs</personname>
        <address>
          <phone>+44 200 1234 567</phone> 
          <email>joe@blogs.com</email>
        </address>
        <personblurb><info></info><para></para></personblurb>
      </author>
  </info>
  <sect1>
    <title>Ignore Changes</title>
    <para><date>20141229</date>The input document for the ignore changes sample.</para>
  </sect1>
</article>

Example 2.2: an updated version of the author information with changed telephone numbers and updated dates (document/documentB.xml in the sample directory)

<article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink"
         version="5.0">
  <info>
    <title>Ignore Changes Sample</title>
      <author revision="1.1">
        <personname>Joe Bloggs</personname>
        <address>
          <phone>+44 200 1235 890</phone> 
          <email>joe@blogs.co.uk</email>
        </address>
        <personblurb><info><date>01032008</date></info><para></para></personblurb>
      </author>
  </info>
  <sect1>
    <title>Ignore Changes</title>
    <para><date>20150105</date>The input document for the ignore changes sample.</para>
  </sect1>
</article>

DeltaXML Core will produce the following delta:

<article xmlns="http://docbook.org/ns/docbook"
  xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:deltaV2="A!=B"
  deltaxml:word-by-word="false" version="5.0" deltaxml:version="2.1"
  deltaxml:content-type="full-context">
  <preserve:xmldecl xmlns:preserve="http://www.deltaxml.com/ns/preserve" deltaxml:ignore-changes="B"
    deltaxml:deltaV2="A=B" xml-version="1.0" encoding="UTF-8"/>
  <info deltaxml:deltaV2="A!=B">
    <title deltaxml:deltaV2="A=B">Ignore Changes Sample</title>
    <author deltaxml:deltaV2="A!=B">
      <deltaxml:attributes deltaxml:deltaV2="A!=B">
        <dxa:revision xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute"
          deltaxml:deltaV2="A!=B">
          <deltaxml:attributeValue deltaxml:deltaV2="A">1.0</deltaxml:attributeValue>
          <deltaxml:attributeValue deltaxml:deltaV2="B">1.1</deltaxml:attributeValue>
        </dxa:revision>
      </deltaxml:attributes>
      <personname deltaxml:deltaV2="A=B">Joe Bloggs</personname>
      <address deltaxml:deltaV2="A!=B">
      <phone deltaxml:deltaV2="A!=B">
        <deltaxml:textGroup deltaxml:deltaV2="A!=B">
          <deltaxml:text deltaxml:deltaV2="A">+44 200 1234 567</deltaxml:text>
          <deltaxml:text deltaxml:deltaV2="B">+44 200 1235 890</deltaxml:text>
        </deltaxml:textGroup>
        </phone>
        <email deltaxml:deltaV2="A!=B">
          <deltaxml:textGroup deltaxml:deltaV2="A!=B">
            <deltaxml:text deltaxml:deltaV2="A">joe@blogs.com</deltaxml:text>
            <deltaxml:text deltaxml:deltaV2="B">joe@blogs.co.uk</deltaxml:text>
          </deltaxml:textGroup>
          </email></address>
      <personblurb deltaxml:deltaV2="A!=B">
        <info deltaxml:deltaV2="A!=B">
          <date deltaxml:deltaV2="B">01032008</date>
        </info>
        <para deltaxml:deltaV2="A=B"/>
      </personblurb>
    </author>
  </info>
  <sect1 deltaxml:deltaV2="A!=B">
    <title deltaxml:deltaV2="A=B">Ignore Changes</title>
    <para deltaxml:deltaV2="A!=B">
      <date deltaxml:deltaV2="A!=B">
        <deltaxml:textGroup deltaxml:deltaV2="A!=B">
          <deltaxml:text deltaxml:deltaV2="A">20141229</deltaxml:text>
          <deltaxml:text deltaxml:deltaV2="B">20150105</deltaxml:text>
        </deltaxml:textGroup>
      </date>The input document for the ignore changes sample.</para>
  </sect1>
</article>

This is the changes represented in our deltaV2 format. While this may look overly complicated for such a simple change, it makes our job of processing it a lot easier. A side-effect of attribute changes being represented as elements is the addition of the dxa namespace, this is due to the namespace of a non-qualified attribute not being that of the document but an anonymous one and so this anonymous namespace needs to be represented. The implication of this is that when promoting this attribute we need to make sure that attribute gets placed in the correct namespace.

3. Marking data that needs to be ignored

Next we need to mark our data to be ignored, this is achieved by placing the deltaxml:ignore-changes attribute on the following:

  • to ignore an attribute change: on the appropriate child of deltaxml:attributes which is representing the attribute you wish to ignore,
  • to ignore a sub-tree change: on the top most node in the sub-tree with a deltaxml:deltaV2 attribute,
  • to ignore a text change: on the deltaxml:textGroup.

By placing the deltaxml:ignore-changes='B,A' attribute, you’re instructing apply-ignore-changes XSLT to change the delta of the modification to be unchanged and to copy the new (B) version. If there is no new version (i.e. in the case of a deletion) the old (A) version is used. This behaviour can be controlled by using a different value for the deltaxml:ignore-changes attribute, the legal values are shown below:

deltaxml:ignore-changes Value Description
"B,A" or "true" Default. Copy new value if it exists, otherwise copy old value.
"A,B" Copy old value if it exists, otherwise copy new value.
"A" Copy old value if it exists, otherwise don’t output
"B" Copy new value if it exists, otherwise don’t output
"" Don’t copy under any circumstances (but process the subtree if present).

The ignore-changes attribute can be added using an XSLT stylesheet.

Note that if you want to ignore specific changes to comments or processing instructions, you will need to change the lexical preservation settings on the Comparator. See the Preserving Processing Instructions and Comments sample for more information.

3.1. Pipelined Comparator

An example for ignoring changes to the lastUpdated attribute and lastLoggedIn element is included below.

Example 3.1: an XSLT stylesheet to mark parts of the address book to be ignored (mark-ignore-changes.xsl in the sample directory)

<xsl:stylesheet version="2.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute"
                xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
  
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="deltaxml:attributes/dxa:lastUpdated">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="'B,A'"/>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="log/lastLoggedIn[@deltaxml:deltaV2]">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="'B,A'"></xsl:attribute>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

3.2. Document Comparator

An example for ignoring changes to the version attribute and date elements is included below.

Example 3.2: an XSLT stylesheet to mark parts of the DocBook document to be ignored (document/mark-ignore-changes.xsl in the sample directory)

<xsl:stylesheet version="2.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute"
                xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
                xmlns:docbook="http://docbook.org/ns/docbook"
  >
  
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="deltaxml:attributes/dxa:revision">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="'true'"/>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="docbook:personblurb/docbook:info[@deltaxml:deltaV2]">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="'true'"></xsl:attribute>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="docbook:para/docbook:date[@deltaxml:deltaV2]">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="''"></xsl:attribute>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

After the delta has been marked with the changes that should be ignored, using a filter similar to the one above, running apply-ignore-changes.xsl and then propagate-ignore-changes.xsl will process the delta, ignoring the marked data. The filter dx2-extract-version-moded.xsl is imported by apply-ignore-changes.xsl. All of these filters are supplied with versions of DeltaXML Core 5.1 and later.

The examples used in this document are available for your own experimentation in the samples directory of the DeltaXML Core release in versions 5.1 and above. The sample shows how to ignore both element and attribute change and provides two examples - one using the Pipelined Comparator and one using the Document Comparator - of how to construct the pipeline of appropriate output filters described here.

4. Running the sample code

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output files result.xml and document/result.xml.

ant run

If you don't have Ant installed, you can run the sample from a command line by issuing commands from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).

4.1. Pipelined Comparator

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output file result.xml.

ant run-dxp

If you don't have Ant installed, you can run the sample from a command line by issuing the following command from the sample directory (ensuring that you use the correct directory separators for your operating system).

java -jar ../../command.jar compare ignore documentA.xml documentB.xml result.xml

4.2. Document Comparator

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output file document/result.xml.

ant run-dc

If you don't have Ant installed, you can run the sample from a command line by issuing the following command from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).

mkdir bin
javac -cp bin:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/IgnoreChangesSample.java
java -cp bin:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.IgnoreChangesSample document/documentA.xml document/documentB.xml document/result.xml

To clean up the sample directory, run the following Ant command.

ant clean

5. Ignore processing in further detail

This section provides some rules and further details about how ignore change processing and particularly how the apply-ignore-changes.xsl filter works.

Every element in the post-comparison XML tree has an 'effective' deltaxml:deltaV2 attribute which (a) specifies which of the inputs it was present in and (b) whether or not the elements were identical, if present in both inputs. The word effective is used because if you are in an unchanged, added or deleted sub-tree the deltaV2 attribute may only be on an ancestor element.

An element may also have an ancestor ignore-changes attribute, the closest ancestor is used when determining whether an element is included in the result.

Like most filters, some data flows through unaffected. In this case, if an element does not have an ancestor ignore-changes attribute it is copied to the result as-is.

When it does have an ancestor ignore-changes attribute, the following table specifies whether that element appears in the result:

delta/ignore-changes '' A B A,B B,A/true
A - -
B - -
A=B -
A!=B -

The only difference in behaviour for A,B vs. B,A occurs at the leaves of the XML tree (i.e. for changed text and attributes).  When there are two possible text values in a textGroup or two possible attribute values then the choice between these settings determines which of two values is used in the result.

5.1. Ignore changes and attributes

There are some issues related to the closest ancestor rule outlined above when considering attributes.  Attributes need to be attached to their parent element.  If the ignore-change settings specify that an element is not included, neither are any of its attributes irrespective of their ignore change settings. Here is an example:

<x deltaxml:deltaV2='A!=B' deltaxml:ignore-changes=''>
  <deltaxml:attributes deltaxml:deltaV2='A!=B'>
    <dxa:y deltaxml:deltaV2='A!=B' deltaxml:ignore-changes='B'>
      <deltaxml:attribute deltaxml:deltaV2="A">12</deltaxml:attribute>
      <deltaxml:attribute deltaxml:deltaV2="B">24</deltaxml:attribute>
    </dxa:y>
  </deltaxml:attributes>
</x>

Normally we would expect y='24' to appear in the result if we look solely at the attribute and its local ignore-changes and deltaV2 attributes. However, the ignore-changes setting on the element x means that the attribute has lost its associated parent element and therefore cannot appear in the result.

5.2. Ignore changes and element removal

It is possible to use ignore changes at the element level as well as for simple attribute and text data. This is used for merging as discussed below and can also be used to remove elements from the result.  Here are two examples, firstly removing a child element:

<x deltaxml:ignore-changes="true" deltaxml:deltaV2="A!=B">
  <y deltaxml:deltaV2="A">
     <z deltaxml:ignore-changes='B'/>
  </y>
</x>

In the above example the ignore-changes setting prevents the z element appearing in the result.  Note that as well as occurring at the bottom of a hierarchy this can also appear with a hierarchy,  here is another example:

<chapter deltaxml:deltaV2="A!=B">
  <section deltaxml:deltaV2="A" deltaxml:ignore-changes='B'>
    <pagebreak deltaxml:ignore-changes='A'/>
  </section>
</chapter>

The ignore-changes settings preclude the section appearing in the result, but the same is not true for the pagebreak element, which is effectively promoted in this result of the filter:

<chapter deltaxml:deltaV2="A!=B">
  <pagebreak deltaxml:ignore-changes='A=B'/>
</chapter>

6. How to merge two documents using deltaxml:ignore-changes

In the examples above, the deltaxml:ignore-changes attribute is applied to individual elements or attributes in the data. However, it can also be applied to a subtree or indeed the entire document. When applied to a subtree, changes in that subtree are removed and therefore the only deltaxml:deltaV2 attribute will be located at the top of the subtree. If other parts of the file are not marked and processed then any deltaV2 markup remains, describing the changes to those parts of the tree.

If, for example, you place the attribute deltaxml:ignore-changes="B,A" on the root element, then you will get a merge of the two documents, with the B values of data (attributes, text etc.) being used in precedence to the A values when both are present.  The following 'mark' stylesheet will match the root element and add this attribute:

<xsl:stylesheet version="2.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
  
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@*, node()"/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="/*">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="'B,A'"/>
      <xsl:apply-templates select="@*, node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

When this is processed with the usual 'mark', 'apply', 'propagate' chain of filters the result will only have a deltaxml:deltaV2='A=B'attribute on the root element of the result tree and all ofher change markup will have been removed.  Subsequent use of the clean-house.xsl filter could then be used to remove this attribute and other delta attributes to give a result very close in style to the original inputs but with as much content from both inputs as is possible included in the result.