This "how to" document discusses how you can perform character by character comparison, using the sample 'cbc-compare.xsl' post comparison output filter. Two practical solutions are presented here, with each using a different comparator and method for customising a comparison:
The document comprises three sections:
Before continuing it is worth mentioning this sample's scope. Specifically it is designed to
deltaxml:textGroups and produce new
deltaxml:textGroups that contain the character level comparison results. Other
changes to the input files are not handled by this sample. It is expected that the
character by character comparison code presented here will be incorporated into a general
comparison pipeline, which handles other types of change.
One significant problem with performing character by character comparison, is that it is
relatively easy to accidentally align two completely different words or phrases which happen
to share some characters. Previous experiments with adding character by character comparison
as part of the main DeltaXML comparison have produced poor results. Therefore, for this sample
we have taken a different approach. Here we perform the character by character comparison as a
post comparison filter on regions of text that have already been aligned, i.e. any modified
It would have been possible to choose to apply the character by character analysis to modified words, instead of text groups, but this would not enable simple word splitting and joining to be detected. Further, it would require the character by character comparison to happen before other word and block level filters had improved the alignment of the text to be compared. For example, the dx2-threshold.xsl filter separates out blocks of text that have been aligned by the main comparison but would be better viewed as separate blocks.
It is recommended that character by character comparison be performed once other filters that
adjust the alignment of
deltaxml:textGroups have been run. In general, this
approach should reduce the number of modified text groups and increase the likelihood that
these modifications contain similar text which can sensibly be character by character
The sample output file contains many examples of the output of character by character processing, along with an explanation of why those results were produced. For the purposes of this document, we will choose to discuss a few of these examples.
Consider the following two paragraphs:
The raw character by character comparison of these two paragraphs would produce:
Here the words 'quick' and 'dirty' have been aligned due to their sharing an 'i'. In order to prevent this type of result, the raw comparison is filtered. We have implemented a simple 'maximum number of changes' filter, which limits the number of changes (insertions and deletions) to two. The comparison between the words 'quick' and 'dirty' produces four changes (two insertions and two deletions), therefore these words are separated as illustrated by the following filter output:
The observant reader will have noticed that the words 'dog' and 'hogs' have also been judged to have too many changes, and thus are separated. This is because there are three changes (two insertions and one deletion), which is one more than is allowed by the 'max number of changes' filter. This is arguably inappropriate, but when further examples of character by character comparison are considered (such as those in the sample inputs/output) this measure generally produces better results than setting the tolerance to three changes. The sample output document also contains a discussion on the potential impact of replacing the max-change filter with a filter that uses the proportion of unchanged characters as a threshold to determine whether a region of text has too many changes.
Phrase comparison: The previous example illustrated how two strings that happened to be single words can be character by character compared. In general, the character by character comparison is going to be given two 'phrases' - sequences of characters that may contain several words. Therefore, it is possible to detect and represent word splitting and joining. For example:
When the phrases become longer, it is worth splitting then into 'sub-phrases' or 'text regions' that can be independently analyzed. This sample uses shared white-space characters to split the 'raw' output of the character by character comparison into 'text regions', so that the 'amount of change' filtering can be independently applied to each 'text' region, rather than the phrase as a whole. For example:
However, if you turn off the text region 'chunking' then the above phrase has five changes, which is higher than the two permitted, so the output would be filtered to:
The character by character comparison implemented by this sample requires several files in
order to work. First it requires the '
cbc-compare.xsl' output filter, which
exploits a (built-in) XSLT extension function to invoke an inner character by character
comparison pipeline '
cbc-pipeline.dxp'; this in turn requires the
dx2-red-green-outfilter.xsl' and all the filters in the '
Assuming that you have your own pipeline '
myPipeline.dxp' in a directory called
myPipeline', then you would:
cbc-pipeline.dxp', and the '
dx2-red-green-outfilter.xsl' files to the '
cbc-xsl' directory to the directory to the 'myPipeline' directory;
cbc-compare.xsl' filter to the '
myPipeline.dxp' file in a similar manner to which it is added to the '
cbc-compare.dxp' pipeline (in this sample).
The document comparator solution for this sample uses an outer pipeline with just one filter
cbc-compare-dc.xsl', this invokes the character by character comparison
via a custom XSLT extension function implemented here in Java. The remaining .xsl files used
are for the inner pipeline and are the same as for the inner pipeline described above for the
The main difference between this solution and that for the pipelined comparator is that the
document comparator uses two .java source files '
CBCCompare.java' (instead of .dxp files) to define the outer and inner
N.B. If you wish to use the document comparator sample in your own code, you must first compile the
two .java files (you can use '
ant compile' to do that) and then include the compiled
.class files on your classpath. Failing to do this will result in
The sample code illustrates two methods of performing character by character comparison. The first method uses the Pipelined Comparator, and specifies input and output filters with a DXP file. The second method uses the API exposed by the Document Comparator.
If you have Ant installed, use the build script provided to run the sample. Simply type the
following command to run the pipeline and produce the
output-dc.xml files and overwrite the
output.html files. Here, the
output-dc.xml files represent the result of the comparison, which has then been
converted into HTML markup.
To run just the Pipelined Comparator sample and produce the output files
To run just the Document Comparator sample and produce the output files
Note that the HTML markup script is not formally part of the character by character sample,
but is included to simplify the presentation of the sample results. It converts the
deltaxml:textGroup additions and deletions into HTML insertions and deletions
respectively, and removes all other DeltaXML markup. It is not intended to be used as a
general purpose DeltaV2 to HTML filter, as discussed in the '
Read Me document.
If you don't have Ant installed, you can run the sample from a command line by issuing the following command from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).
java -jar ../../command.jar compare cbc input1.html input2.html output-dxp.xml
mkdir bin javac -cp bin:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/CBCCompare.java javac -cp bin:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/CBCComparator.java java -cp bin:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.CBCCompare inpu1.xml input2.xml output-dc.xml
Having produced the comparison result it may be worth adding the HTML markup, as discussed above. This can be achieved by issuing the following command.
java -jar ../../saxon9pe.jar -config:extra/saxon.config \ -xsl:extra/text-groups-to-html.xsl -s:output.xml -o:output.html
To clean up the sample directory, run the following Ant command.
The sample can be run via a
run.bat batch file, so long as this is issued from
the sample directory. Alternatively, as this batch file contains a single command, the command
can be executed directly:
..\..\bin\deltaxml.exe compare cbc input1.html input2.html output.xml
Having produced the comparison result it may be worth adding the HTML markup, as discussed
above. This can be achieved by processing this command with a simple XSLT-2 transformation
text-groups-to-html.xsl, which is located in the
subdirectory of this sample.
The character by character sample is built around the '
which prepares the input for comparison, calls the character by character comparison pipeline
cbc-pipeline.dxp' or compare method from '
and then returns the content of the comparison as the result.
From the user perspective it is the '
cbc-pipeline.dxp' pipeline or
CBCComparator.java' that is intended to be modified to change the character by
character comparison behavior. It runs the following post comparison filters:
dx2-red-green-outfilter.xsl- to normalise the raw character-by-character output.
cbc-xsl/chunking.xsl- to split the compared text into regions that are separated by shared whitespace.
cbc-xsl/merge-adjacent-changes.xsl- to produce inserted, deleted, and unchanged strings.
cbc-xsl/max-changes.xsl- to filter out changes that should not be represented at the character level.
cbc-xsl/convert-to-deltaV2.xsl- to convert the result back into
dx2-red-green-outfilter.xsl- to normalise the filtered character-by-character output
Note: only the chunking and max-change filters, (2 and 4 in the above list), are intended to be removed or updated.
The remainder of this section discuses removing the chunking filter, the purpose of the max-change filter, and how to adapt the max-change filter to use a percentage based threshold.
Removing the chunking scheme: In order to remove the chunking scheme the
CBCComparator.java files need to be modified, say by setting the default value of the chunking parameter to
'false'. Alternatively, both this parameter and the chunking filter can be removed from the
pipeline. The remaining filters in the pipeline have been designed to work on either chunked
or non-chunked comparison output.
Maximum number of changes mechanism: The
cbc-xsl/max-changes.xsl filter currently determines whether the character by character comparison has produced
too much change, by checking if the number of insertions and deletions within a 'region of
text' (a chunk if chunking is enable, otherwise the whole phrase) is greater than two. In such
cases, the text region is converted into a single deletion followed by a single insertion, as
the character by character change has been deemed to be inappropriate at this point. Otherwise
the text region is left unchanged.
Updating maximum number of changes filter: The decision algorithm of this filter is implemented by the following template.
<xsl:template match="*[delete or insert or unchanged]"> <xsl:variable name="number-of-changes" select="count(*[self::insert or self::delete])" as="xs:integer"/> <xsl:variable name="max-changes" select="xs:integer($max-number-of-changes)" as="xs:integer"/> <xsl:choose> <xsl:when test="$number-of-changes gt $max-changes"> <xsl:call-template name="convertToOneDeleteAndInsert" /> </xsl:when> <xsl:otherwise> <xsl:apply-templates select="." mode="copy"/> </xsl:otherwise> </xsl:choose> </xsl:template>
In order to change the algorithm convert the first logical test to say when the algorithm has detected too much change. For example, the measure of an acceptable change could be converted to the percentage of changed text, using the formula:
100 * Sum(unchanged characters)/Sum(total characters)
In order to implement this change the previous XSL template could be adapted as follows
$max-number-of-changes variable is now interpreted as the threshold of
<xsl:template match="*[delete or insert or unchanged]"> <xsl:variable name="all-chars" select="sum(for $n in (node()) return string-length(string($n)))" as="xs:integer"/> <xsl:variable name="unchanged-chars" select="sum(for $n in (unchanged) return string-length(string($n)))" as="xs:integer"/> <xsl:variable name="threshold-percentage" select="xs:integer($max-number-of-changes)" as="xs:integer"/> <xsl:choose> <xsl:when test="$threshold-percentage gt (100*$unchanged-chars) idiv $all-chars"> <xsl:call-template name="convertToOneDeleteAndInsert" /> </xsl:when> <xsl:otherwise> <xsl:apply-templates select="." mode="copy"/> </xsl:otherwise> </xsl:choose> </xsl:template>
Here the appropriate threshold (ratio between common and total characters) should be greater
than 20%, otherwise the words 'bat' and 'car' will match and produce the following change
ba t' (100 * 1/5), but less than 33%, otherwise comparing 'use' and 'using' would
result in the two words being repeated, rather than 'us e' (100 * 2/6). However, setting the threshold below 33% would mean that
comparing the words 'overstated' and 'understand' would result in ' oversta ted' as this represents 43% (100 * 6/14) unchanged text content. Therefore this
percentage measure may not provide a significant improvement over the simple counting