Many document formats provide mechanisms to refer to or include images and possibly other
forms of binary content in documents. A typical example is the image element in html where the
src attribute is typically a URI which refers to an image. Many applications or publishing
processes which handle documentation formats support a range of image types, examples
Most formats also support the use of both relative and absolute referencing. So for example it is possible to refer to an image using a relative URI, for example:
In this case the location of the image is relative to the location of the file containing the image element and src URI.
It is also possible to use an absolute URI, these come in various forms or 'schemes', for example:
<image src="http://www.company.com/images/diagram2.gif" alt="flow diagram"/>
<image src="file:///c:/Users/Joe/images/diagram3.png" alt="flow diagram"/>
Looking solely at an XML file perspective it is possible to determine when the value of an attribute has changed. However, it is possible to do more operations that would be useful to a comparison user when you can:
With filesystem and similar access it is possible to resolve a relative URI into an absolute URI.
Consider the following example:
file1.htmlis located inside the
/Users/Joe/Documentsdirectory and contains the following image reference:
file2.htmlis located in the same directory, but contains
If you are given just the two files without any knowledge of their location, it's only
possible to say that the
src attribute has changed. However, with the knowledge of where the
files are located (Joe's
Documents directory) it is possible to resolve the URIs and determine
src attributes are actually referring to the same file.
The above example demonstrates that image attribute change does not necessarily imply image change. The converse however is also true, it is possible to have an unchanged attribute value where the image does change. This can occur for example where the two xml input files are stored in different locations in the tree (not the same directory) and each has its associated images with local relative references.
To summarize our processing approach:
We have tried to provide both a conservative implementation, in that we will always assume change, unless we can be absolutely certain that the images or other binary content is identical. At the same time we would like an optimal and fast implementation. Here are some implementation notes:
One final aspect of the image comparison process that should be considered is how the base
xml:base) of the two input files is determined so that the relative references can be
resolved. Where the compare function uses
java.io.File, String/URI or similar inputs the code
has or can easily determine the URI or systemId of the inputs. When other forms of inputs are
used there are often ways of providing a systemId (eg:
javax.xml.transform.stream.StreamSource#setSystemId). The sample pipeline uses an input filter
to determine the base, and adds it to the root element of the input using the
xml:base attribute. This is done for both input files and the filter should be as early in the input
filter chain as possible. It is also possible to provide information about the baseURIs of the
input files to the output filter which is handling the image processing.
For this core sample we have needed to use a slightly more complex structure than some of our other samples. We needed to create a testcase (test2) where the input files are located in different subdirectories. The test cases are written in xhtml so that the results and inputs can be easily viewed in a browser.
There is one further aspect to test2 that is worth considering. The two input files that are used in the comparison are actually identical byte-for-byte copies of one another, the actual differences that will appear in the result come about because of differences in the associated referenced files.
The sample code illustrates two methods of performing image comparison. The first method uses the Pipelined Comparator, and specifies input and output filters with a DXP file. The second method uses the API exposed by the Document Comparator.
If you have Ant installed, use the build script provided to run the sample.
This will generate output for the two methods in the
results directory. Simply type the
following command to run the comparison pipeline:
To run just the Pipelined Comparator sample and produce the output
files in the
PipelinedComparatorResult directory, type the following command.
This ant build script will generate a jar file that acts as a kind of extension or 'plugin' to command.jar. Please examine the script to see how this mechanism works. It is particulary useful when DXP pipelines make use of Java code as either filters or XSLT extension functions and allows the simpler 'java -jar' invocation to be used.
To run just the Document Comparator API sample and produce the output
files in the
DocumentComparatorResult directory, type the following command.
If you don't have Ant installed, you can run the Document Comparator API sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct directory and class path separators for your operating system).
javac -cp class:../../deltaxml.jar:../../saxon9pe.jar -d bin ./src/java/com/deltaxml/samples/ImageCompare.java java -cp class:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.ImageCompare test1/f1.xhtml test1/f2.xhtml test1/DocumentComparatorResult/f1-f2-result.html java -cp class:../../deltaxml.jar:../../saxon9pe.jar:../../icu4j.jar:../../resolver.jar:./bin/com/deltaxml/samples/ com.deltaxml.samples.ImageCompare test2/doc1.xhtml test2/doc2.xhtml test2/DocumentComparatorResult/doc1-doc2-result.html
To clean up the sample directory, run the following command in Ant.
This sample for xhtml has an output filter that uses two templates to match images both with unchanged and modified source attributes:
match="xhtml:image[@src]" ... match="xhtml:image[deltaxml:attributes/dxa:src[@deltaxml:deltaV2='A!=B']]" ...
It is a requirement to match the element containing the attribute so that it and its other attributes can be duplicated
when there is an image change. We would recommend the
xhtml-binary-image-compare.xsl filter be modified by
changing the match statements on both major templates to include any new image related elements and attributes in a consistent way.