Managing change in an XML environment

How to Ignore Formatting Element Changes

1 Introduction

There are many XML languages that describe documents available; XHTML, DocBook and ODF to name a few. One common feature of these XML languages is that some of their elements are used not to define structure but to mark text as having a certain format. Elements such as <strong> or <em> in XHTML, <emphasis> in DocBook and <text:span> in ODF are examples of such elements.

DeltaXML Core makes no distinction between structural elements and formatting elements when comparing two versions of a document. Because of this, changes to formatting elements can generate more change than expected in a delta file. Consider the following examples of a very simple documentation language.

Example 1: A simple XML document (input1.xml in the sample directory)

<document>
  <para>This paragraph will have words made bold in the following version.</para>
  <para>In this sentence, new words will be added and some made bold.</para>
</document>

Example 2: The same document with text changes and text formatting added (input2.xml in the sample directory)

<document>
  <para>This paragraph will have <bold>words made bold</bold> in the following version.</para>
  <para>In this sentence, <bold>new bold words</bold> will be added and some made bold.</para>
</document>

When these files are compared, they generate a delta file that shows a lot of change.

Example 3: The delta file without taking formatting elements into account (text is being compared word by word)

<document xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
          deltaxml:deltaV2="A!=B"
          deltaxml:version="2.0"
          deltaxml:content-type="full-context">
  <para deltaxml:deltaV2="A!=B">
    This paragraph will have 
    <deltaxml:textGroup deltaxml:deltaV2="A">
      <deltaxml:text deltaxml:deltaV2="A">words</deltaxml:text>
    </deltaxml:textGroup>
    <bold deltaxml:deltaV2="B">words made bold</bold> 
    <deltaxml:textGroup deltaxml:deltaV2="A">
      <deltaxml:text deltaxml:deltaV2="A">made bold </deltaxml:text>
    </deltaxml:textGroup>
    in the following version.
  </para>
  <para deltaxml:deltaV2="A!=B">
    In this sentence, 
    <deltaxml:textGroup deltaxml:deltaV2="A">
      <deltaxml:text deltaxml:deltaV2="A">new</deltaxml:text>
    </deltaxml:textGroup>
    <bold deltaxml:deltaV2="B">new bold words</bold> 
    <deltaxml:textGroup deltaxml:deltaV2="A">
      <deltaxml:text deltaxml:deltaV2="A">words </deltaxml:text>
    </deltaxml:textGroup>
    will be added and some made bold.
  </para>
</document>

Although this is a correct representation of what has changed between the two versions of the document, it may not be intuitive to somebody making changes in a WYSIWYG editor. For a document editor, the most important changes are textual changes, not the format changes and this delta file shows more text change than actually occurred. DeltaXMLCore includes some XSLT filters to improve this result by taking into account those elements that are merely used for textual formatting.

2 Marking up formatting elements

The first step in improving this result is to mark up the elements that are used for text formatting so that DeltaXMLCore can identify them. This is achieved by adding the deltaxml:format="true" attribute to those elements. In the example documents above, the <bold> element needs to be marked in this way. The following XSLT template could be used to do this.

Example 4: an XSLT template to mark bold elements (defined in mark-formatting.xslin the sample directory)

<xsl:template match="bold">
  <xsl:copy>
    <xsl:attribute name="deltaxml:format" select="'true'"/>
    <xsl:apply-templates select="node()"/>
  </xsl:copy>
</xsl:template>

3 Flattening formatting elements

The next step is to flatten the marked elements to remove the structure they create in the document. This, together with a word by word comparison (see the word by word sample for more details), allows the text to be given a higher matching priority than formatting. To flatten the elements, the document is processed using the dx2-format-infilter.xsl included in DeltaXMLCore to produce the following document:

Example 5: the document with flattened formatting elements (indentation to improve readability)

<document xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
  <para>
    This paragraph will have 
    <deltaxml:format-start>
      <deltaxml:element>
        <bold/>
      </deltaxml:element>
    </deltaxml:format-start>
    words made bold
    <deltaxml:format-end/> 
    in the following version.
  </para>
  <para>
    In this sentence, 
    <deltaxml:format-start>
      <deltaxml:element>
        <bold/>
      </deltaxml:element>
    </deltaxml:format-start>
    new bold words
    <deltaxml:format-end/>
    will be added and some made bold.
  </para>
</document>

The <deltaxml:element> element is used as a wrapper to store the formatting element in, along with any attributes it may have.

Now that the formatting has been flattened, all text within the paragraph is at the same level of hierarchy and so can be compared in a more appropriate way.

4 Reconstructing formatting

Once the comparison has taken place, the flattened formatting elements need to be reconstructed. Because it may not be possible to reconstruct elements from both versions of the document (there is no guarantee that the nesting will be correct for a well-formed XML document), only the formatting elements from the second input are reconstructed. This means that any flattened formatting elements with a delta value of 'A' are ignored. This has the effect of ignoring all formatting changes and showing the formatting of the second document in the result but with textual changes marked.

To reconstruct the formatting, the document should be processed with the dx2-format-outfilter.xsl XSLT stylesheet, included with DeltaXMLCore. Since we are ignoring certain changes, there may be cases where a delta value of 'A!=B' is no longer correct for a paragraph (the only changes were to formatting). in order to correct these delta values, the final filter from the ignore-changes solution, propagate-ignore-changes.xsl, should be run.

The final result for the example files above is as follows:

Example 6: the final result

<document xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
          deltaxml:deltaV2="A!=B"
          deltaxml:version="2.0"
          deltaxml:content-type="full-context">
  <para deltaxml:deltaV2="A=B">
    This paragraph will have <bold>words made bold</bold> in the following version.
  </para>
  <para deltaxml:deltaV2="A!=B">
    In this sentence, <bold deltaxml:deltaV2="A!=B">new 
    <deltaxml:textGroup deltaxml:deltaV2="B">
      <deltaxml:text deltaxml:deltaV2="B">bold </deltaxml:text>
    </deltaxml:textGroup>
    words</bold> will be added and some made bold.
  </para>
</document>

Note that the first paragraph is marked as unchanged because there have been no textual changes. the second paragraph shows the new word added within the newly bolded text.

5 Summary

  • Changes to formatting elements is shown as a structural change unless they are marked.
  • Formatting elements can be marked and flattened using filters provided as part of DeltaXMLCore.
  • Once documents have been compared, flattened formatting is reconstructed.
  • The result is a delta file that focuses on textual change.

6 Running the sample

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output files no-format-result.xml and format-result.xml.

run ant

If you don't have Ant installed, you can run the sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct slashes for your operating system).

java -jar ../../command.jar compare format input1.xml input2.xml no-format-result.xml mark-formatting=false
java -jar ../../command.jar compare format input1.xml input2.xml format-result.xml mark-formatting=true

To clean up the sample directory, run the following command in Ant.

ant clean