Character by Character Comparison

1. Introduction

This "how to" document discusses how you can perform character by character comparison, using the sample 'cbc-compare.xsl' post comparison output filter. Two practical solutions are presented here, with each using a different comparator and method for customising a comparison:

  • Pipelined Comparator (DXP) - Uses a filter pipeline defined by an XML file called a 'DXP' to customise the comparison.
  • Document Comparator - Uses Java API calls to customise a pre-existing pipeline with a number of extension points.

The document comprises three sections:

  1. Background: some background to the character by character processing, including why it is implemented as a post comparison output filter and at what stage in the output filtering it should be applied;
  2. Tutorial: a discussion on what results to expect from the character by character processing, how to make use of the code and how to run the sample; and
  3. Design: an overview of the character by character design, focusing on where it would be appropriate to add user specific character by character result adjustment filters.

Before continuing it is worth mentioning this sample's scope. Specifically it is designed to process modified deltaxml:textGroups and produce new deltaxml:textGroups that contain the character level comparison results. Other changes to the input files are not handled by this sample. It is expected that the character by character comparison code presented here will be incorporated into a general comparison pipeline, which handles other types of change.

2. Background

One significant problem with performing character by character comparison, is that it is relatively easy to accidentally align two completely different words or phrases which happen to share some characters. Previous experiments with adding character by character comparison as part of the main DeltaXML comparison have produced poor results. Therefore, for this sample we have taken a different approach. Here we perform the character by character comparison as a post comparison filter on regions of text that have already been aligned, i.e. any modified deltaxml:textGroup elements.

It would have been possible to choose to apply the character by character analysis to modified words, instead of text groups, but this would not enable simple word splitting and joining to be detected. Further, it would require the character by character comparison to happen before other word and block level filters had improved the alignment of the text to be compared. For example, the dx2-threshold.xsl filter separates out blocks of text that have been aligned by the main comparison but would be better viewed as separate blocks.

It is recommended that character by character comparison be performed once other filters that adjust the alignment of deltaxml:textGroups have been run. In general, this approach should reduce the number of modified text groups and increase the likelihood that these modifications contain similar text which can sensibly be character by character compared.

3. Tutorial

3.1. Expected results

The sample output file contains many examples of the output of character by character processing, along with an explanation of why those results were produced. For the purposes of this document, we will choose to discuss a few of these examples.

Consider the following two paragraphs:

  1. The quick brown fox jumps over the lazy dog
  2. The dirty brown foxes jump over the lazy hogs

The raw character by character comparison of these two paragraphs would produce:

  • The qudickrty brown foxes jumps over the lazy dhogs

Here the words 'quick' and 'dirty' have been aligned due to their sharing an 'i'. In order to prevent this type of result, the raw comparison is filtered. We have implemented a simple 'maximum number of changes' filter, which limits the number of changes (insertions and deletions) to two. The comparison between the words 'quick' and 'dirty' produces four changes (two insertions and two deletions), therefore these words are separated as illustrated by the following filter output:

  • The quickdirty brown foxes jumps over the lazy doghogs

The observant reader will have noticed that the words 'dog' and 'hogs' have also been judged to have too many changes, and thus are separated. This is because there are three changes (two insertions and one deletion), which is one more than is allowed by the 'max number of changes' filter. This is arguably inappropriate, but when further examples of character by character comparison are considered (such as those in the sample inputs/output) this measure generally produces better results than setting the tolerance to three changes. The sample output document also contains a discussion on the potential impact of replacing the max-change filter with a filter that uses the proportion of unchanged characters as a threshold to determine whether a region of text has too many changes.

Phrase comparison: The previous example illustrated how two strings that happened to be single words can be character by character compared. In general, the character by character comparison is going to be given two 'phrases' - sequences of characters that may contain several words. Therefore, it is possible to detect and represent word splitting and joining. For example:

  • Comparing 'in to' and 'into' results in 'in to'
  • Comparing 'into' and 'in to' results in 'in to'

When the phrases become longer, it is worth splitting then into 'sub-phrases' or 'text regions' that can be independently analyzed. This sample uses shared white-space characters to split the 'raw' output of the character by character comparison into 'text regions', so that the 'amount of change' filtering can be independently applied to each 'text' region, rather than the phrase as a whole. For example:

  • Comparing 'one green car' and 'two red cars' results in 'twonegreenred cars'

However, if you turn off the text region 'chunking' then the above phrase has five changes, which is higher than the two permitted, so the output would be filtered to:

  • Comparing 'one green car' and 'two red cars' results in 'one green cartwo red cars'

3.2. Using the pipelined comparator

The character by character comparison implemented by this sample requires several files in order to work. First it requires the 'cbc-compare.xsl' output filter, which exploits a (built-in) XSLT extension function to invoke an inner character by character comparison pipeline 'cbc-pipeline.dxp'; this in turn requires the 'dx2-red-green-outfilter.xsl' and all the filters in the 'cbc-xsl' directory.

Assuming that you have your own pipeline 'myPipeline.dxp' in a directory called 'myPipeline', then you would:

  1. copy the 'cbc-compare.xsl', 'cbc-pipeline.dxp', and the 'dx2-red-green-outfilter.xsl' files to the 'myPipeline' directory;
  2. recursively copy the 'cbc-xsl' directory to the directory to the 'myPipeline' directory;
  3. add the 'cbc-compare.xsl' filter to the 'myPipeline.dxp' file in a similar manner to which it is added to the 'cbc-compare.dxp' pipeline (in this sample).

3.3. Using the document comparator

The document comparator solution for this sample uses an outer pipeline with just one filter named 'cbc-compare-dc.xsl', this invokes the character by character comparison via a custom XSLT extension function implemented here in Java. The remaining .xsl files used are for the inner pipeline and are the same as for the inner pipeline described above for the pipelined comparator.

The main difference between this solution and that for the pipelined comparator is that the document comparator uses two .java source files 'CBCComparator.java' and 'CBCCompare.java' (instead of .dxp files) to define the outer and inner pipelines respectively.

N.B. If you wish to use the document comparator sample in your own code, you must first compile the two .java files (you can use 'ant compile' to do that) and then include the compiled .class files on your classpath. Failing to do this will result in ClassNotFoundException being thrown.

3.4. Running the sample (in Java)

The sample code illustrates two methods of performing character by character comparison. The first method uses the Pipelined Comparator, and specifies input and output filters with a DXP file. The second method uses the API exposed by the Document Comparator.

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output-dxp.xml and output-dc.xml files and overwrite the output-dxp.html and output.html files. Here, the output-dxp.xml and output-dc.xml files represent the result of the comparison, which has then been converted into HTML markup.

ant run

Pipelined Comparator

To run just the Pipelined Comparator sample and produce the output files

ant run-dxp

Document Comparator

To run just the Document Comparator sample and produce the output files

ant run-dc

Note that the HTML markup script is not formally part of the character by character sample, but is included to simplify the presentation of the sample results. It converts the deltaxml:textGroup additions and deletions into HTML insertions and deletions respectively, and removes all other DeltaXML markup. It is not intended to be used as a general purpose DeltaV2 to HTML filter, as discussed in the 'extra' directory's Read Me document.

Running the sample from a command line

If you don't have Ant installed, you can run the sample from a command line by issuing the following command from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).

Pipelined Comparator
java -jar ../../command.jar compare cbc input1.html input2.html output-dxp.xml
Document Comparator
mkdir bin
javac -cp bin:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/CBCCompare.java
javac -cp bin:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/CBCComparator.java
java -cp bin:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.CBCCompare inpu1.xml input2.xml output-dc.xml

Having produced the comparison result it may be worth adding the HTML markup, as discussed above. This can be achieved by issuing the following command.

java -jar ../../saxon9pe.jar -config:extra/saxon.config \
  -xsl:extra/text-groups-to-html.xsl -s:output.xml -o:output.html

To clean up the sample directory, run the following Ant command.

ant clean

3.5. Running the sample (in .NET)

The sample can be run via a run.bat batch file, so long as this is issued from the sample directory. Alternatively, as this batch file contains a single command, the command can be executed directly:

..\..\bin\deltaxml.exe compare cbc input1.html input2.html output.xml

Having produced the comparison result it may be worth adding the HTML markup, as discussed above. This can be achieved by processing this command with a simple XSLT-2 transformation script text-groups-to-html.xsl, which is located in the extra subdirectory of this sample.

4. Design

The character by character sample is built around the 'cbc-compare.xsl' script, which prepares the input for comparison, calls the character by character comparison pipeline 'cbc-pipeline.dxp' or compare method from 'CBCComparator.java', and then returns the content of the comparison as the result.

From the user perspective it is the 'cbc-pipeline.dxp'  pipeline or CBCComparator.java' that is intended to be modified to change the character by character comparison behavior. It runs the following post comparison filters:

  1. dx2-red-green-outfilter.xsl - to normalise the raw character-by-character output.
  2. cbc-xsl/chunking.xsl - to split the compared text into regions that are separated by shared whitespace.
  3. cbc-xsl/merge-adjacent-changes.xsl - to produce inserted, deleted, and unchanged strings.
  4. cbc-xsl/max-changes.xsl - to filter out changes that should not be represented at the character level.
  5. cbc-xsl/convert-to-deltaV2.xsl - to convert the result back into deltaxml:textGroups
  6. dx2-red-green-outfilter.xsl - to normalise the filtered character-by-character output

Note: only the chunking and max-change filters, (2 and 4 in the above list), are intended to be removed or updated.

The remainder of this section discuses removing the chunking filter, the purpose of the max-change filter, and how to adapt the max-change filter to use a percentage based threshold.

Removing the chunking scheme: In order to remove the chunking scheme the cbc-pipeline.dxp or CBCComparator.java files need to be modified, say by setting the default value of the chunking parameter to 'false'. Alternatively, both this parameter and the chunking filter can be removed from the pipeline. The remaining filters in the pipeline have been designed to work on either chunked or non-chunked comparison output.

Maximum number of changes mechanism: The cbc-xsl/max-changes.xsl filter currently determines whether the character by character comparison has produced too much change, by checking if the number of insertions and deletions within a 'region of text' (a chunk if chunking is enable, otherwise the whole phrase) is greater than two. In such cases, the text region is converted into a single deletion followed by a single insertion, as the character by character change has been deemed to be inappropriate at this point. Otherwise the text region is left unchanged.

Updating maximum number of changes filter: The decision algorithm of this filter is implemented by the following template.

<xsl:template match="*[delete or insert or unchanged]">
  <xsl:variable name="number-of-changes" select="count(*[self::insert or self::delete])" as="xs:integer"/>
  <xsl:variable name="max-changes" select="xs:integer($max-number-of-changes)" as="xs:integer"/>
  
  <xsl:choose>
    <xsl:when test="$number-of-changes gt $max-changes">
      <xsl:call-template name="convertToOneDeleteAndInsert" />
    </xsl:when>
    <xsl:otherwise>
      <xsl:apply-templates select="." mode="copy"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

In order to change the algorithm convert the first logical test to say when the algorithm has detected too much change. For example, the measure of an acceptable change could be converted to the percentage of changed text, using the formula:

100 * Sum(unchanged characters)/Sum(total characters)

In order to implement this change the previous XSL template could be adapted as follows (where the $max-number-of-changes variable is now interpreted as the threshold of change percentage).

<xsl:template match="*[delete or insert or unchanged]">
  <xsl:variable name="all-chars" select="sum(for $n in (node()) return string-length(string($n)))" as="xs:integer"/>
  <xsl:variable name="unchanged-chars" select="sum(for $n in (unchanged) return string-length(string($n)))" as="xs:integer"/>
  <xsl:variable name="threshold-percentage" select="xs:integer($max-number-of-changes)" as="xs:integer"/>
  
  <xsl:choose>
    <xsl:when test="$threshold-percentage gt (100*$unchanged-chars) idiv $all-chars">
      <xsl:call-template name="convertToOneDeleteAndInsert" />      
    </xsl:when>
    <xsl:otherwise>
      <xsl:apply-templates select="." mode="copy"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Here the appropriate threshold (ratio between common and total characters) should be greater than 20%, otherwise the words 'bat' and 'car' will match and produce the following change 'bcatr' (100 * 1/5), but less than 33%, otherwise comparing 'use' and 'using' would result in the two words being repeated, rather than 'useing' (100 * 2/6). However, setting the threshold below 33% would mean that comparing the words 'overstated' and 'understand' would result in 'ovunderstatend' as this represents 43% (100 * 6/14) unchanged text content. Therefore this percentage measure may not provide a significant improvement over the simple counting mechanism.