Word by Word Text Comparison

1. Introduction

When comparing documents containing text, DeltaXML Core treats each block of text as a single node. This can lead to large amounts of change when in fact only certain words within the text have been changed. Consider the following document and the changes made to it:

Example 1: an XML document containing text (input1.xml in the sample directory)

<document>
  <para>A pangram uses all the letters of the alphabet. It is often used to test typewriters 
    or computer keyboards, for example: The quick brown fox jumps over the lazy dog.</para>
</document>

Example 2: a modified version of the document (input2.xml in the sample directory)

<document>
  <para>A pangram uses all the letters of the alphabet and is often used to test typewriters
    or computer keyboards, for example: A quick movement of the enemy will jeopardize six gunboats.</para>
</document>

If these inputs are compared as they are, we get the following result (the actual delta file is converted to a colour-coded result to make it easier to read)

Example 3: the result of comparing the documents above

<document>
  <para>A pangram uses all the letters of the alphabet. It is often used to test typewriters or computer keyboards, 
        for example: The quick brown fox jumps over the lazy dog.A pangram uses all the letters of the alphabet and is 
        often used to test typewriters or computer keyboards, for example: A quick movement of the enemy will jeopardize
        six gunboats.</para>
</document>

The included sample code shows how the same comparison can be run using either the Pipelined Comparator or the Document Comparator. The description of specific filters here applies only to the Pipelined Comparator; the equivalent 'word-by-word' and 'orphaned word' features are built-in to the Document Comparator and are controlled via its API or a DCP XML configuration file. Because the Document Comparator has word-by-word comparison enabled by default, for cases where word-by-word must be disabled, a special 'disable-word-by-word.xsl' is used to add a 'deltaxml:word-by-word="false"' attribute to the root element of the input files.

2. Comparing text word by word

While the result above is technically correct, it is not particularly useful for displaying what has actually changed. A much better approach would be to compare the text on a word by word basis.

2.1. Word by word filters

DeltaXML Core includes Java filters to split text into individual words before comparing them and also to convert the split words back into larger chunks of text. This allows the comparison to show only those words that have changed. These filters are com.deltaxml.pipe.filters.dx2.wbw.WordInfilter and com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter. The following result shows the effect they have on the sample data:

Example 4: the comparison result with word by word filters added

<document>
  <para>A pangram uses all the letters of the alphabet. It and is often used to test typewriters or computer 
        keyboards, for example: TheA quick brown fox jumps overmovement of the lazy dogenemy will jeopardize six gunboats.</para>
</document>

2.2. Word by word attributes

When word by word filters are in place, the default is for the word by word feature to affect all parts of the document. However, it is possible to specify which parts of a document are affected or unaffected through the use of a deltaxml:word-by-word attribute. This attribute has permitted values of true or false and may be attached to any element in an input document.

The deltaxml:word-by-word attribute affects the element in which it occurs, any descendant elements of this element are also affected unless they themselves have a deltaxml:word-by-word attribute, which would override the inherited behaviour.

2.3. Orphaned Words

The next improvement to make is to the changed sentence at the end of the text. This is made up of fragmented added/deleted words alongside the occasional unchanged word that was common between the two sentences. This is not easy to read due to the fragmentation. The OrphanedWordOutfilter is a post-processing filter that will detect unchanged words in the middle of changes and duplicate them as an added and deleted copy of the same word. In the example above, the words 'quick' and 'the' would be treated in this way. Applying this filter gives the following result: 

Example 7: the comparison result with orphaned words detected

<document>
  <para>A pangram uses all the letters of the alphabet. It and is often used to test typewriters or computer keyboards, 
        for example: The quick brown fox jumps over the lazy dogA quick movement of the enemy will jeopardize six gunboats.</para>
</document>

This final result has the advantage of showing the changes in more detail but also being readable.

3. Filter ordering

The order of the word by word filters is important. Because they convert each word, space and potentially punctuation character into an element that contains the word, they can dramatically increase the size of the XML document tree. Processing this large tree will then often consume much larger amounts of memory than before if the processing involves holding the entire tree in memory (as it does with XSLT filtering). For that reason, the word filters are written as Java streaming filters that do not load the tree into memory. They are also placed as close as possible on either side of the Java-implemented comparison stage.

3.1. Input Filters

The WordInfilter is typically used as the last input filter before the comparison on either input filter chain. Punctuation needs to be defined in the inputs in a previous filter if required.

3.2. Output Filters

The minimal requirement for output filters is to use the WordOutfilter as the first output filter. This will highlight changes at the finer granularity and , where there are no orphaned words, will gather together the added and deleted text into a continuous chunk of text.

If the orphaned word processing is required, OrphanedWordOutfilter should be used as the first output filter, followed by WordOutfilter.

All other output filters (particularly XSLT filters) should go after the word filters.

4. Configuring the Orphaned Words Filter

The OrphanedWordOutfilter has two configuration parameters that change its behaviour; orphanedLengthLimit and orphanedThresholdPercentage.

4.1. The orphanedLengthLimit Parameter

This parameter specifies the maximum number of consecutive unchanged words that could be treated as orphaned words. Its default value is 2 and it should be kept as a fairly small number to avoid all words being treated as orphans and the advantage of word by word comparison being lost.

4.2. The orphanedThresholdPercentage Parameter

This specifes a threshold that is used in a calculation that assesses the size of the unchanged word section in relation to the changed words either side of it. The default value is 20 i.e. the unchanged words can count for no more than 20 percent of the total count of changed and unchanged words. The calculated value that is compared against this percentage is:

unchanged words / (changed words before + unchanged words + changed words after) * 100

This calculated value must be less than the value of orphanedThresholdPercentage for the unchanged words to be treated as orphans.

5. Running the sample

The sample code (in the samples/WordByWord directory of the DeltaXML Core release) illustrates three methods for performing word by word text comparison. The first method uses the Pipelined Comparator, and specifies input and output filters with a DXP file (identified by the wbw configuration id). The remaining two methods use the Document Comparator, the first of these uses a DCP file (configuration-id dcp-wbw) to configure the comparator pipeline, whilst the second uses the Java API.

If you're using any Java version of DeltaXML Core and have Apache Ant installed, use the build script provided to run the complete sample. Simply type the following command. This will generate output for the three methods in the results directory.

ant run

If you don't have Ant installed, you can run the samples from a command line by issuing commands from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).

5.1. Pipelined Comparator (DXP)

To run just the Pipelined Comparator sample and produce the output files non-wbw-result.html, wbw-result.html and orphaned-words-result.html in the result/PipelinedComparator directory, type the following command.

ant run-dxp

When using any Java version of DeltaXML Core, the commands to run only the Pipelined Comparator sample code are as follows.

java -jar ../../command.jar compare wbw input1.xml input2.xml non-wbw-result.html
java -jar ../../command.jar compare wbw input1.xml input2.xml wbw-result.html word-by-word=true
java -jar ../../command.jar compare wbw input1.xml input2.xml orphaned-words-result.html word-by-word=true orphaned-words=true

If using the .NET version of DeltaXML Core, a batch file can be used. The command to run only the Pipelined Comparator sample is as follows.

run.bat

Alternatively, for the .NET version of DeltaXML Core, the commands to run only the Pipelined Comparator (DXP) sample.

..\..\bin\deltaxml.exe compare wbw input1.xml input2.xml non-wbw-result.html
..\..\bin\deltaxml.exe compare wbw input1.xml input2.xml wbw-result.html word-by-word=true
..\..\bin\deltaxml.exe compare wbw input1.xml input2.xml orphaned-words-result.html word-by-word=true orphaned-words=true

5.2. Document Comparator (DCP)

To run just the Document Comparator DCP sample and produce the output files non-wbw-result.html, wbw-result.html and orphaned-words-result.html in the result/DCP directory, type the following command.

ant run-dcp

The commands to run only the Document Comparator DCP sample code are the same as for the DXP sample above, but the 'dcp-wbw' configuration-id is used instead of 'wbw', for example:

java -jar ../../command.jar compare dcp-wbw input1.xml input2.xml non-wbw-result.html

If using the .NET version of DeltaXML Core, a batch file can be used. The command to run only the Document Comparator sample is as follows.

run-dcp.bat

Alternatively, for the .NET version of DeltaXML Core, a command to run only the DCP sample would be.

..\..\bin\deltaxml.exe compare dcp-wbw input1.xml input2.xml non-wbw-result.html

 


If you wish to see the xml delta file result rather than an html result for the Pipelined Comparator samples, simply add the parameter 'convert-to-html=false' to the end of any of the commands.

5.3. Document Comparator (Java API)

To run just the Document Comparator API sample and produce the output files non-wbw-result.html, wbw-result.html and orphaned-words-result.html in the result/DocumentComparator directory, type the following command.

ant run-dc

The commands to compile and run only the Document Comparator sample code are as follows.

mkdir bin
javac -cp bin:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/DocumentComparatorSample.java
java -cp bin:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.DocumentComparatorSample

 


If you wish to see the XML delta file result rather than an HTML result for the Comparator samples above, simply add the parameter 'convert-to-html=false' to the end of any of the commands.

To clean up the sample directory, run the following Ant command.

ant clean

5.4. Document Comparator (C# API)

We provide Visual Studio solution (.sln) files for the C# samples in the dotnet-api directory. The Visual Studio project (.csproj) files may be found within each of the three samples directories.