Image and Binary Comparison

1. Introduction

Many document formats provide mechanisms to refer to or include images and possibly other forms of binary content in documents. A typical example is the image element in html where the src attribute is typically a URI which refers to an image. Many applications or publishing processes which handle documentation formats support a range of image types, examples include: .gif, .jpg and .png.

1.1. Referencing

Most formats also support the use of both relative and absolute referencing. So for example it is possible to refer to an image using a relative URI, for example:

<image src="diagram1.png"/>

In this case the location of the image is relative to the location of the file containing the image element and src URI.

It is also possible to use an absolute URI, these come in various forms or 'schemes', for example:

<image src="http://www.company.com/images/diagram2.gif" alt="flow diagram"/>

or

<image src="file:///c:/Users/Joe/images/diagram3.png" alt="flow diagram"/>

2. Comparing Images

Looking solely at an XML file perspective it is possible to determine when the value of an attribute has changed. However, it is possible to do more operations that would be useful to a comparison user when you can:

  1. access a filesystem or similar hierarchical store to determine or resolve file locations, and then
  2. read the files to analyze their contents

With filesystem and similar access it is possible to resolve a relative URI into an absolute URI.

Consider the following example:

  • file1.html is located inside the /Users/Joe/Documents directory and contains the following image reference:
    <image src="images/pic1.png"/>
  • file2.html is located in the same directory, but contains
    <image src="../Documents/images/pic1.png"/>

If you are given just the two files without any knowledge of their location, it's only possible to say that the src attribute has changed. However, with the knowledge of where the files are located (Joe's Documents directory) it is possible to resolve the URIs and determine that both src attributes are actually referring to the same file.

The above example demonstrates that image attribute change does not necessarily imply image change. The converse however is also true, it is possible to have an unchanged attribute value where the image does change. This can occur for example where the two xml input files are stored in different locations in the tree (not the same directory) and each has its associated images with local relative references.

To summarize our processing approach:

  1. If we don't have access to the filesystem or navigation tree we can only compare attribute values
  2. When we do have tree access, we resolve the references relative to the base of the two input files.
    1. When the references resolve to the same location we know the image is the same at that point and the comparison result can either contain the absolute reference or one of the two input relative references, but with the proviso that when relative references are used the result file should be located in the tree such that the relative references still work.
    2. If the absolute references resolve to different locations then the images could be identical copies or they could be different. We perform a byte-by-byte comparison of the images. If we determine that every byte is identical we can then say that the images are identical and we only need to provide one of them in the result. If they differ, or if we cannot fully compare them byte-wise we will report them as changed and provide both image elements in the result (one marked with an A or deleted delta and the other marked B or added).

We have tried to provide both a conservative implementation, in that we will always assume change, unless we can be absolutely certain that the images or other binary content is identical. At the same time we would like an optimal and fast implementation. Here are some implementation notes:

  • If we have file system access we can ask for the sizes of the files (without reading their entire contents) and if they differ we assume that they are different without reading their content.
  • If there are any failures in the process (file permissions etc.) we assume the worst and that the files will differ.
  • The byte code comparison extension function is fail fast, it will report not-equal when the first byte that differs is found. Correspondingly it can only report equal when the last bytes of both files are read.

One final aspect of the image comparison process that should be considered is how the base URI (or xml:base) of the two input files is determined so that the relative references can be resolved. Where the compare function uses java.io.File, String/URI or similar inputs the code has or can easily determine the URI or systemId of the inputs. When other forms of inputs are used there are often ways of providing a systemId (eg: javax.xml.transform.stream.StreamSource#setSystemId). The sample pipeline uses an input filter (add-xml-base.xsl) to determine the base, and adds it to the root element of the input using the xml:base attribute. This is done for both input files and the filter should be as early in the input filter chain as possible. It is also possible to provide information about the baseURIs of the input files to the output filter which is handling the image processing.

3. The test cases

For this core sample we have needed to use a slightly more complex structure than some of our other samples. We needed to create a testcase (test2) where the input files are located in different subdirectories. The test cases are written in xhtml so that the results and inputs can be easily viewed in a browser.

There is one further aspect to test2 that is worth considering. The two input files that are used in the comparison are actually identical byte-for-byte copies of one another, the actual differences that will appear in the result come about because of differences in the associated referenced files.

4. Running the sample

The sample code illustrates two methods of performing image comparison. The first method uses the Pipelined Comparator, and specifies input and output filters with a DXP file. The second method uses the API exposed by the Document Comparator.

If you have Ant installed, use the build script provided to run the sample. This will generate output for the two methods in the results directory. Simply type the following command to run the comparison pipeline:

ant run

To run just the Pipelined Comparator sample and produce the output files in the PipelinedComparatorResult directory, type the following command.

ant run-dxp

This ant build script will generate a jar file that acts as a kind of extension or 'plugin' to command.jar. Please examine the script to see how this mechanism works. It is particulary useful when DXP pipelines make use of Java code as either filters or XSLT extension functions and allows the simpler 'java -jar' invocation to be used.

To run just the Document Comparator API sample and produce the output files in the DocumentComparatorResult directory, type the following command.

ant run-dc

If you don't have Ant installed, you can run the Document Comparator API sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).

javac -cp class:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/ImageCompare.java
    java -cp class:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.ImageCompare test1/f1.xhtml test1/f2.xhtml test1/DocumentComparatorResult/f1-f2-result.html
    java -cp class:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.ImageCompare test2/doc1.xhtml test2/doc2.xhtml test2/DocumentComparatorResult/doc1-doc2-result.html

To clean up the sample directory, run the following command in Ant.

ant clean

5. Applying the sample to other formats and data

This sample for xhtml has an output filter that uses two templates to match images both with unchanged and modified source attributes:

match="xhtml:image[@src]" ...
match="xhtml:image[deltaxml:attributes/dxa:src[@deltaxml:deltaV2='A!=B']]" ...

It is a requirement to match the element containing the attribute so that it and its other attributes can be duplicated when there is an image change. We would recommend the xhtml-binary-image-compare.xsl filter be modified by changing the match statements on both major templates to include any new image related elements and attributes in a consistent way.