The Document Comparator is a top-level component in DeltaXML Core, first introduced
in version 7.0. It provides a set of document comparison features that are customisable to
meet your specific needs. Designed for integration with other systems, the Document Comparator
is accessed through the
DocumentComparator class in the Java and .NET APIs for
DeltaXML Core. A
DocumentComparator class can also be created using DCP
XML via the
This guide serves as an overview of the Document Comparator. It explains when you should use it, its key capabilities, how it can be configured and extended, how comparisons are run, and how the comparison result can be used.
The DeltaXML Core User Guide provides more general information on DeltaXML Core as a whole. It covers subjects such as comparison concepts, system configuration, and licensing. More detailed information on specific Document Comparator features and how they can be customized can be found in the respective Java API documentation and .NET API documentation. The DCP Schema Guide and DCP User Guide together describe how the Document Comparator Pipeline (DCP) XML format is used to customize the Document Comparator.
Besides the Document Comparator, the other main component in DeltaXML Core API is the
Pipelined Comparator, implemented through the
The Pipelined Comparator is a flexible XML comparison toolkit. It provides an XML comparator placed within a pipeline to which XSLT or Java filters can be added at any point within the input or output streams. A set of documented filters is supplied for use in the pipeline.
The Document Comparator is an extension of the Pipelined Comparator designed specifically for comparing documents. As a more specialist solution, it therefore incorporates many filters directly into the pipeline that would otherwise have to be added manually. It also allows for more sophisticated capabilities to be exposed through a simple API. Considerable extensibility is still available should it be required, using DCP or through the API, but the default settings are tailored for document use so that the user is freed from having to understand all the settings.
If the XML you're comparing is predominantly document-based then the Document Comparator would in most cases be preferred to the Pipelined Comparator. That said, the flexibility of the Pipelined Comparator may still offer benefits if you have experience and resources available from previous projects based on this component.
While XML is used for conveying information for a range of diverse purposes, the Document Comparator is optimized specifically for XML content that is predominantly narrative in nature, such as an article or book. To understand why a specialist comparison is needed for this type of XML, it's helpful to first look at the different characteristics of XML elements typically encountered in a document:
The list of document element characteristics above is grouped according to whether they're related to the structure or content of a document. A document comparison can be customized so that these characteristics are identified for the elements of a specific document type, and therefore improve the effectiveness of the result.
Comparisons using the Document Comparator can be invoked either from the command-line or GUI using DCP
or by using the .NET or Java API via the
compare method of a
DocumentComparator object instance. This method is overloaded to cater for a
wide range of use cases.
Before invoking the compare method, the
DocumentComparator instance is
configured for the specific comparison required. Options can either be set on the
DocumentComparator object directly using Java, or through a DCP XML file using a
Using Java Only
// inhibit 'Cannot find CatalogManager.properties' warning System.setProperty("xml.catalog.files", ""); // create DocumentComparator instance DocumentComparator dc= new DocumentComparator(); // setting a readability option dc.getResultReadabilityOptions().setElementSplittingEnabled(false); // initialize files File f1= new File("input/file1.xml"); File f2= new File("input/file2.xml"); File result= new File("output/result.xml"); // invoke comparison dc.compare(f1, f2, result);
Using Java with DCP XML for configuration
// inhibit 'Cannot find CatalogManager.properties' warning System.setProperty("xml.catalog.files", ""); // reference an existing DCP configuration file File dcpFile= new File("configurations/standard-1.dcp"); // initialize configured DocumentComparator instance DCPConfiguration dcpConfig= new DCPConfiguration(dcpFile); dcpConfig.generate(); // initialize files File f1= new File("input/file1.xml"); File f2= new File("input/file2.xml"); File result= new File("output/result.xml"); // invoke comparison dcpConfig.getDocumentComparator().compare(f1, f2, result);
Under certain conditions document comparisons can take a while to complete. In such cases,
the end-user can be informed of progress using a 'listener'. A listener is an instantiated
object that implements the
DocumentProgressListener interface; this can exploit
data passed in call-back methods from the DocumentComparator to relay progress back to the
user. The listener is associated with the
DocumentComparator using its
When developing new filters to add to a pipeline, it is often useful to isolate the behaviour of each filter. To achieve this, the DocumentComparator can serialize the output from every filter in the pipeline to a separate file. These files are named to match filter steps and arranged in a directory structure to reflect that of the filter chains in the pipeline.
By default, pipeline diagnosis is disabled. To enable it, the
property of the
DocumentComparator is set to
true. This property can
also be controlled as a DeltaXML Core configuration property using its fully qualified name
'com.deltaxml.cores9api.DocumentComparator.debugFiles'. See the Core user guide for more
information on this property.
The Document Comparator comprises a standard comparator engine that lies at the centre of a processing pipeline that links together a series of input and output filters. The filters are simple transforms that each modify the input for a specific purpose to produce the required output. Filters can be implemented in XSLT or Java.
The document comparison pipeline
For customization, filters can either be internal and controlled via
DocumentComparator properties, or they can be external XSLT or Java resources.
The external filters are managed via the Document Comparator API as
FilterChain objects that are added at named extension points in the
comparison pipeline. The extension points allow external filters to fit in with the behaviour
of internal filters and are illustrated in the diagram below:
Extension points allow the insertion of custom filters into the pipeline
FilterChain consists of a sequence of 1 or more
FilterChain is a type of
FilterStep, it can in turn be
added to another
A filter, normally coded in XSLT, is typically dedicated to a single task such as adding an attribute for elements that match a certain pattern. More detail on writing and adding a filter to the comparison pipeline can be found at Appendix I - Custom Filters.
Customization improves comparisons for a range of purposes that can be broadly categorized as: Input, Readability and Output. The sections below describe each of these categories in turn.
The Document Comparator, provides 'out of the box' comparison behaviour designed for general input document characteristics; but, as outlined in the Document Comparison section, a comparison tailored to the element characteristics of a specific document type will yield significantly improved results.
deltaxml:word-by-word="false"attribute is found on the element. The Word By Word Text Comparison sample shows the result of an XSLT filter being used to disable the word-by-word feature. Note that word-by-word exploits word-splitting based on the
icu4jJava library, from the International Components for Unicode (ICU).
xml:space="preserve"attribute is in scope on a containing element. Text nodes that contain only whitespace characters are treated differently when they are found to be children of mixed-content elements. The Managing Whitespace guide covers this subject in more detail.
deltaxml:formatting="true"attribute are 'flattened' for comparison purposes and then reconstructed post-comparison. This allows changes to formatting elements to be marked differently and therefore treated specially in an output filter. The default behaviour is to only show the formatting elements from the 'B' document; this is demonstrated in the Formatting Element Changes sample. The
OutputFormatConfiguration.ModifiedFormatOutputproperty defaults to 'B', but has alternative values of 'A' or 'CHANGE'. When ModifiedFormatOutput is set to'CHANGE', the formatting elements for both documents are represented using special DeltaV2.1 format elements.
DocumentComparatoris used to customize or disable this functionality. For HTML table elements to be processed specially, they must be in the XHTML namespace.
For CALS table processing, the
TableConfiguration object has a set of additional properties:
The standard behaviour is for the comparator to match elements based on their document order. This behaviour can be tailored to match different elements via any of the following methods.
deltaxml:ordered="false"attribute on an element whose child element can appear in any order without affecting their meaning.
deltaxml:key="value"attribute, where the value is a unique identifier for aligning a specific element occurring in both input documents.
The result of orderless comparisons is controlled by the 'OrderlessPresentationMode' property of the OutputFormatConfiguration class. See the API documentation for full details on this.
The principle of using
deltaxml:key="value" attributes with ordered data and
mixed ordered/orderless data is shown in the Using
Keys with Ordered Data sample and Mixed
Ordered and Orderless Data guide respectively.
deltaxml:ignore-changesattribute, where the attribute value determines the keep/override behaviour. A practical example of how this feature is used can be found in the Ignoring Changes and Creating a Merged Document sample.
Note 1: The sample referenced above uses the Pipelined Comparator, so although the
filters are the same, they are added to the pipeline and run differently when
using the Document Comparator.
Note 2: This feature is not intended for elements marked as 'formatting elements' - use the 'Ignoring Formatting Changes' feature instead.
The above description of standard input behaviours described the attributes or 'markers' that can be added to customize behaviour. With the exception of Ignored/Merged Changes, these are all added prior to comparison, i.e. in 'marker' input filters.
In some contexts, a non-optimized comparison result may be cluttered or hard to interpret.
The Document Comparator exploits a number of features to enhance the readability of results -
without affecting their correctness. These features can be fine-tuned to specific content via
ResultReadabilityOptions class, the default property values for this are
given in Appendix II. An overview of the readability options is given below:
Elements from two input documents may align mainly due to structure, with minimal shared text content. Moreover, text matches that do occur may be coincidental.
In such cases, it can cause confusion if the result is represented as a single element with many text changes, interspersed with matching text. Here the strategy is to split the modified element into two when the amount of shared text falls below a given percentage.
Note that, to preserve change information on keyed elements, element splitting is never applied to elements with a
ChangeGatheringEnabledproperty can be set to false to disable this behaviour at the element level, however, word level change gathering is always enabled.
A further reason for customization is to produce an output format designed for a specific purpose or tool.
LexicalPreservationConfigproperty that is initially set to
This standard lexical preservation should meet the requirements for many cases, but customization is available for specific output formats. This is described in the Lexical Preservation guide.
paragraphis present in the 'A' document but not in the 'B' document (i.e. the paragraph was deleted as a whole). This is initialized to
ModifiedFormatOutputproperty of the
The 'raw' result of the Document Comparator is the 'Delta'. It has the look and feel of the original input documents, but with annotations added to describe the differences. This is the standard output, but other output formats can be produced:
OutputFormatConfigurationobject has a
ResultFormatproperty that is used to set the tracked changes format. This object has other properties for setting author and date-time data in the tracked-changes format.
The raw Delta output from the Document Comparator needs to be transformed to a suitable output format to allow the end-user to interpret the result most effectively. This format can be for viewing in a web browser, opening in an XML authoring tool, opening in a specialist reviewing tool, or perhaps ultimately printing to paper.
The Output Formats section above provides more detail on the pre-defined output formats available from the Document Comparator.
Changes can be viewed in standard XML Authoring tools using the 'Tracked Changes' format options
Most filters are coded in XSLT and will normally exploit DeltaXML's internal Saxon processor
which supports XSLT 2.0. Here is a sample input filter (XSLT 2.0) that performs an 'Identity
li elements with
xml:id attributes and adding an
xsl:key attribute to hold the
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@* | node()"/> </xsl:copy> </xsl:template> <xsl:template match="li[@xml:id]"> <xsl:copy> <xsl:attribute name="deltaxml:key" select="@xml:id"/> <xsl:apply-templates select="@* | node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
Once written, the XSLT filter needs to be added to the Document Comparator pipeline using the
DocumentComparator API, here's some sample Java code showing this:
DocumentComparator dc= new DocumentComparator(saxonProcessor); FilterStepHelper fsh= dc.newFilterStepHelper(); FilterChain outFc = fsh.newFilterChain(); FilterStep fsSBS= fsh.newFilterStepFromResource( "xsl/side-by-side/dx2-side-by-side.xsl", "side-by-side"); outFc.addStep(fsSBS); dc.setExtensionPoint(ExtensionPoint.OUTPUT_FINAL, outFc);
Note that in the above example the FilterStep is created using the
newFilterStepFromResource method to access one of the XSLT filter files
included within DeltaXML Core; the
newFilterStep method is used to access
Screenshot of a side-by-side DiffReport created with the dx2-side-by-side.xsl output filter:
Filters can also be added using a DCP pipeline definition, the following DCP uses the 'folding' DiffReport stylesheet to format the output:
<documentComparator version="1.0" id="dcp-folding" description="Render result as folding html view." > <extensionPoints> <outputExtensionPoints> <finalPoint> <filter> <resource name="xsl/dx2-deltaxml-folding-html.xsl"/> </filter> </finalPoint> </outputExtensionPoints> </extensionPoints> </documentComparator>
Screenshot of a 'folding' DiffReport created with the dx2-deltaxml-folding-html.xsl output filter:
The Document Comparator is pre-configured using default settings. The default settings are summarized in the table below, more details can be found in the API documentation, links to the corresponding classes can be found in the 'Class' column.
|LexicalPreservationConfig||PresetPreservationMode||ROUND_TRIP||(see API docs)|
|ElementSplittingThreshold||10||1 to 100|
|OrphanedWordLengthLimit||2||1 to unlimited|
|OrphanedWordMaxPercentage||20||1 to 100|
|ModifiedWhitespaceBehaviour||AUTOMATIC||(See API docs)|
|ModifiedAttributeMode||AUTOMATIC||(See API docs)|
|OrderlessPresentationMode||B_DELETES||(See API docs)|
The Samples section of DeltaXML Core's documentation provides links to all samples and guides, these are designed for use with one or more of DeltaXML Core's comparators. The following list highlights samples of particular relevance to the Document Comparator.