Managing change in an XML environment

How to Use Word By Word Text Comparison

1 Introduction

When comparing documents containing text, DeltaXML Core treats each block of text as a single node. This can lead to large amounts o change when in fact only certain words within the text have been changed. Consider the following document and the changes made to it:

Example 1: an XML document containing text (input1.xml in the sample directory)

<document>
  <para>A pangram uses all the letters of the alphabet. It is often used to test typewriters 
    or computer keyboards, for example: The quick brown fox jumps over the lazy dog.</para>
</document>

Example 2: a modified version of the document (input2.xml in the sample directory)

<document>
  <para>A pangram uses all the letters of the alphabet and is often used to test typewriters
    or computer keyboards, for example: A quick movement of the enemy will jeopardize six gunboats.</para>
</document>

If these are inputs are compared as they are, we get the following result (the actual delta file is converted to a colour-coded result to make it easier to read)

Example 3: the result of comparing the documents above

<document>
  <para>A pangram uses all the letters of the alphabet. It is often used to test typewriters or computer keyboards, 
        for example: The quick brown fox jumps over the lazy dog.A pangram uses all the letters of the alphabet and is 
        often used to test typewriters or computer keyboards, for example: A quick movement of the enemy will jeopardize
        six gunboats.</para>
</document>

2 Comparing text word by word

While the result above is technically correct, it is not particularly useful for displaying what has actually changed. A much better approach would be to compare the text on a word by word basis.

2.1 Word by word filters

DeltaXML Core includes Java filters to split text into individual words before comparing them and also to convert the split words back into larger chunks of text. This allows the comparison to show only those words that have changed. These filters are com.deltaxml.pipe.filters.dx2.wbw.WordInfilter and com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter. The following result shows the effect they have on the sample data:

Example 4: the comparison result with word by word filters added

<document>
  <para>A pangram uses all the letters of the alphabet.alphabetItand is often used to test typewriters or computer 
        keyboards, for example: TheA quick brownmovementfoxofjumps over the lazyenemydog.will jeopardize six gunboats.</para>
</document>

2.2 Punctuation definition

While this has improved the result and shows the changes in more detail, it could still be improved further. One problem is that it shows 'alphabet' as changing to 'alphabet.' when all that is added is the full stop at the end. This is because the word infilter only differentiates between whitespace and non-whitespace text by default and so the full stop is seen as being part of the word. The solution to this problem is to define a list of punctuation characters so that the infilter can differentiate them from text. This can be achieved by adding an attribute called deltaxml:punctuation to the XML document whose value is a space-separated list of characters.

The following XSLT template would add this attribute to the root element of our input documents:

Example 5: a template to define punctuation on the root element (defined in define-punctuation.xsl in the sample directory)

<xsl:template match="/*">
  <xsl:copy>
    <xsl:apply-templates select="@*"/>
    <xsl:attribute name="deltaxml:punctuation" select="'. , ;'"/>
    <xsl:apply-templates select="node()"/>
  </xsl:copy>
</xsl:template>

When the punctuation definition is added to the pipeline, the result becomes:

Example 6: the comparison result with punctuation defined

<document>
  <para>A pangram uses all the letters of the alphabet.Itand is often used to test typewriters or computer keyboards,
        for example: TheA quick brownmovementfoxofjumps over the lazyenemydogwill jeopardize six gunboats.</para>
</document>

2.3 Orphaned Words

The next improvement to make is to the changed sentence at the end of the text. This is made up of fragmented added/deleted words alongside the occasional unchanged word that was common between the two sentences. This is not easy to read due to the fragmentation. The OrphanedWordOutfilter is a post-processing filter that will detect unchanged words in the middle of changes and duplicate them as an added and deleted copy of the same word. In the example above, the words 'quick' and 'the' would be treated in this way. Applying this filter gives the following result: 

Example 7: the comparison result with orphaned words detected

<document>
  <para>A pangram uses all the letters of the alphabet.Itand is often used to test typewriters or computer keyboards, 
        for example: TheAquickquickbrownmovementfoxofjumps over thethelazyenemydogwill jeopardize six gunboats.</para>
</document>

This filter doesn't look like it has made a huge improvement to the result but the effect can only be seen once the final filter has been applied.

2.4 Unchanged spaces

The final improvement to make is to handle the unchanged spaces that appear throughout the fragmented final sentence. These are present because spaces between words typically match up as unchanged items regardless of the matching of the words themselves. The WordSpaceFixup filter works in a similar way to the OrphanedWordOutfilter but on spaces instead of words. It repeats these unchanged spaces as added/deleted spaces and will also gather together the added and deleted items so that they read better (it is this step that would not make such an improvement if the OrphanedWordOutfilter had not been applied). Applying the WordSpaceFixup filter gives the following result:

<document>
  <para>A pangram uses all the letters of the alphabet. It and is often used to test typewriters or computer keyboards, 
        for example: The quick brown fox jumps over the lazy dogA quick movement of the enemy will jeopardize six gunboats.</para>
</document>

This final result has the advantage of showing the changes in more detail but also being readable.

3 Filter ordering

The order of the word by word filters is important. Because they convert each word, space and potentially punctuation character into an element that contains the word, they can dramatically increase the size of the XML document tree. Processing this large tree will then often consume much larger amounts of memory than before if the processing involves holding the entire tree in memory (as it does with XSLT filtering). For that reason, the word filters are written as Java streaming filters that do not load the tree into memory. They are also placed as close as possible on either side of the Java-implemented comparison stage.

3.1 Input Filters

The WordInfilter is typically used as the last input filter before the comparison on either input filter chain. Punctuation needs to be defined in the inputs in a previous filter if required.

3.2 Output Filters

The minimal requirement for output filters if to use the WordOutfilter as the first output filter. This will leave changed text fragmented by the unchanged spaces in between the words. A better option is to use WordSpaceFixup as the first output filter, followed by the WordOutfilter. This will stop changed text being broken by unchanged words and , where there are no orphaned words, will gather together the added and deleted text into a continuous chunk of text.

If the orphaned word processing is required, OrphanedWordOutfilter should be used as the first output filter, followed by WordSpaceFixup and then WordOutfilter.

All other output filters (particularly XSLT filters) should go after the word filters.

4 Configuring the Orphaned Words Filter

The OrphanedWordsOutfilter has two configuration parameters that change its behaviour, orphanedLengthLimit and orphanedThresholdPercentage.

4.1 The orphanedLengthLimit Parameter

This parameter specifies the maximum number of consecutive unchanged words that could be treated as orphaned words. Its default value is 2 and it should be kept as a fairly small number to avoid all words being treated as orphans and the advantage of word by word comparison being lost.

4.2 The orphanedThresholdPercentage Parameter

This specifes a threshold that is used in a calculation that assesses the size of the unchanged word section in relation to the changed words either side of it. The default value is 20 i.e. the unchanged words can count for no more than 20 percent of the total count of changed and unchanged words. The calculated value that is compared against this percentage is:

unchanged words / (changed words before + unchanged words + changed words after) * 100

This calculated value must be less than the value of orphanedThresholdPercentage for the unchanged words to be treated as orphans.

5 Running the sample

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output files non-wbw-result.html, wbw-result.html, punctuation-result.html, orphaned-words-result.html and full-result.html.

ant run

If you don't have Ant installed, you can run the sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct slashes for your operating system).

java -jar ../../command.jar compare wbw input1.xml input2.xml non-wbw-result.html
java -jar ../../command.jar compare wbw input1.xml input2.xml wbw-result.html word-by-word=true
java -jar ../../command.jar compare wbw input1.xml input2.xml punctuation-result.html word-by-word=true punctuation=true
java -jar ../../command.jar compare wbw input1.xml input2.xml orphaned-words-result.html word-by-word=true punctuation=true orphaned-words=true
java -jar ../../command.jar compare wbw input1.xml input2.xml full-result.html word-by-word=true punctuation=true orphaned-words=true space-fixup=true

If you wish to see the xml delta file result rather than an html result for any of the samples, simply add the parameter 'convert-to-html=false' to the end of any of the commands

To clean up the sample directory, run the following Ant command.

ant clean