DCP User Guide

1. Introduction

1.1. The Document Comparator Pipeline

DeltaXML Core's Document Comparator specializes in the comparison of XML documents with narrative content - as opposed to more data-centric XML documents. Features such as table-processing and formatting-element handling exploit a relatively complex XML processing system; this is broken down into a number of smaller, simpler components (also referred to as filters) arranged in a pipeline.

1.2. What is DCP?

The DCP (Document Comparator Pipelines) format is an XML language used for configuring the Document Comparator component and its built-in pipeline. It is the counterpart to the DXP format, which is used for configuring the Pipelined Comparator.

With DCP you define chains of XML processing filters that are inserted at specified extension points in the Document Comparator's built-in pipeline. You can set properties of the DocumentComparator and low-level built-in components; DCP does not require any knowledge of Java/C# programming.

The DCPConfiguration Class

The ability to embed DCP processing is also available for you to use in your applications. The DCPConfiguration class provides the potential of DCP in a wide range of Java/.NET applications. This will simplify configuration and enable flexibility in the use of DeltaXML Core's Document Comparator. Details of this class can be found in the Java API documentation and .NET API documentation, a working example is included in the Folding DiffReport with DCP sample.

Using the DCPConfiguration API, DCP capabilities can be integrated directly into a GUI or command-line interface. Examples for each of these are included in the DeltaXML Core distribution:

Command-line
When command.jar (or deltaxml.exe for .NET) is invoked it shows a list of DCP files and their descriptions. DCP files are then selected by an end-user by specifying the 'configuration-id' which corresponds to the 'id' attribute on the documentComparator root element in the DCP file. Command-line parameters are used to control comparison settings for a specific configuration.
Graphical User Interface (GUI)
This simple GUI (not included for .NET) is invoked using the deltaxml-gui.sh startup script, or the deltaxml-gui.exe directly if using the Windows distribution of Core. A configuration dialog is used to control different aspects of the comparison. From here, a drop-down list of the available DCP configurations is shown. Once a configuration is selected from this list, specific parameters are set from a properties grid to fine-tune the comparison.

Screenshot of a DCP configuration loaded into the GUI

1.3. When to use DCP

DCP allows a Document Comparator pipeline to be specified in declarative XML. Comparisons based on this pipeline can be initialized through the command-line, the GUI, or through the DCPConfiguration class's simple high-level Java/.NET API. As such, DCP can be used in most cases where you would use Java/C#, because of its declarative nature, DCP files should be easier to maintain than the equivalent Java or C#.

XPath expressions embedded within DCP allow for relatively sophisticated conditional processing. However, in more complex cases where the processing pipeline is dependent on many external factors, projects may benefit from the flexibility and extra diagnostics and testing that low-level coding in Java/C# brings.

1.4. Editing DCP with the DCP Schema

The XML vocabulary used for DCP is defined in the DCP XML Schema and is summarized in the DCP Schema Guide. XML Schema (XSD) 1.0 and 1.1 versions of the DCP schema are included with the Core distribution. Auto-completion and context-assistance features can be exploited when editing a DCP file in your XML editor, by associating the DCP file with the schema, many XML editors observe the 'xsi:noNamespaceSchemaLocation' attribute that provides a 'hint' to the XSD file location. The XSD 1.1 version of the schema is preferred as this provides additional checking, for example, type-checking on values referenced using 'parameterRef' attributes.

An example showing the XSD schema file associated with a DCP XML file:

<documentComparator  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="../schemas/core-dcp-v1_0.xsd" 
    version="1.0" 
    id="example" 
    description="Example of a DCP definition">...

1.5. Summary of DCP

Here is a quick summary of DCP:

  • It is a tool customization/extension language, not a general purpose XML pipelining language.
  • DCP is a data-driven way of configuring and extending a DocumentComparator object which can then be used by a Java/C# program.
  • DCP defines extensions to the Document Comparator pipeline using filters at specified extensionPoints.
  • Using DCP is generally much simpler than Java/C# programming.
  • All Document Comparator features - barring low-level components such as progress listeners - are accessible via DCP.
  • Parameters with default values can be defined in DCP, such values can then be overridden externally.
  • All significant values within DCP can reference declared parameters instead of literal values.
  • XPath 2.0 expressions that reference declared parameters can be used instead of literal values.

2. The Document Comparator Pipeline Model

Underlying the DCP is a model. The example below illustrates how key parts of the model are used to produce a solution for comparing documents of two custom types: 'major' and 'minor'.

2.1. An example

In this particular example, we will:

  • Add XML attributes on the input pipeline so whitespace-normalization is optimized for the type of XML.
  • Add XML attributes on the input pipeline to mark formatting-only elements.
  • Optionally convert the XML delta format of the comparison output to a folding-html rendering.
  • Enable all lexical preservation - and keep information on changes.
  • Define parameters so different behaviours can be achieved using the same DCP but with different parameter overrides.

Details of the features and concepts used in this example are described after the example DCP.

DCP for example pipeline

<documentComparator 
    version="1.0" 
    id="example" 
    description="Example of a DCP definition" >

  <pipelineParameters>    
    <stringParameter name="orphan-threshold" defaultValue="20"/>
    <stringParameter name="document-type" defaultValue="major"/>
    
    <booleanParameter name="orphaned-words" defaultValue="false"/>
    <booleanParameter name="normalize-whitespace" defaultValue="true"/>
    <booleanParameter name="formatting-elements" defaultValue="true"/>
    <booleanParameter name="render-as-folding-html" defaultValue="true"/>
  </pipelineParameters>
  
  <advancedConfig>
    <outputProperties>
      <property name="indent" literalValue="no"/>
    </outputProperties>
    <parserFeatures>
      <feature
        name="http://apache.org/xml/features/nonvalidating/load-external-dtd"
        literalValue="false"/>
    </parserFeatures>
  </advancedConfig>
  
  <standardConfig>
    <resultReadabilityOptions>
      <modifiedWhitespaceBehaviour xpath="if ($normalize-whitespace) 
                                          then 'normalize' else 'show'"/>
      <orphanedWordDetectionEnabled parameterRef="orphaned-words"/>
      <orphanedWordLengthLimit literalValue="2"/>
      <orphanedWordMaxPercentage parameterRef="orphan-threshold"/>
      <elementSplittingEnabled literalValue="false"/>
    </resultReadabilityOptions>
    
    <lexicalPreservation>
      <defaults>
        <retain literalValue="true"/>
        <processingMode literalValue="change"/>
      </defaults>
    </lexicalPreservation>
    
  </standardConfig>
  
  <extensionPoints>
    <inputPreFlatteningPoint>
      <filter when="$formatting-elements 
                    and $document-type eq 'major'">
        <file path="mark-major-formatting.xsl" relBase="dxp"/>
      </filter>
      <filter when="$formatting-elements 
                    and $document-type eq 'minor'">
        <file path="mark-minor-formatting.xsl" relBase="dxp"/>
      </filter>
      <filter if="normalize-whitespace">
        <file path="mark-mixed-content.xsl" relBase="dxp"/>
      </filter>
      <filter if="normalize-whitespace">
        <file path="mark-ws-preserved.xsl" relBase="dxp"/>
      </filter>
    </inputPreFlatteningPoint>

    <outputExtensionPoints>
      <finalPoint>
        <filter if="render-as-folding-html">
          <resource name="xsl/dx2-deltaxml-folding-html" />
          <parameter name="smart-whitespace-normalization"
            xpath="not($normalize-whitespace</span><![CDATA[)"/>
        </filter>
      </finalPoint>
    </outputExtensionPoints>
  </extensionPoints>
  
</documentComparator>
    

2.2. Document Comparator

The root element for the DCP is documentComparator. The description and id attributes here can be used by applications to summarize a DCP and help select it from a set of other DCP files.

A fullDescription child element could also be used here to provide a longer description of the DCP for use by external systems.

<documentComparator 
    version="1.0" 
    id="example" 
    description="Example of a DCP definition" >...

2.3. Pipeline Parameters

By using parameters we allow a DCP-defined pipeline to be reconfigured by an external system or even an end-user, avoiding the need to construct several similar pipeline definitions. There are potential performance benefits also, because only a single set of XSLT filters needs to be compiled ready for running different types of comparison.

Extract from the example showing DCP pipeline parameter declarations

  <pipelineParameters>    
    <stringParameter name="orphan-threshold" defaultValue="20"/>
    <stringParameter name="document-type" defaultValue="major"/>
    ...
    <booleanParameter name="formatting-elements" defaultValue="true"/>
    ...
  </pipelineParameters>

The pipelineParameters element contains a set of named string and boolean parameters (elements stringParameter and booleanParameter), these parameters set the default behaviour of this example. In this case, some parameters are referenced directly using attributes, while others are referenced as XPath variables within attributes containing XPath expressions.

Note: For advanced use, XPath expressions in the form of XSLT attribute value templates can be embedded in the defaultValue attribute of stringParameter elements. Here, XPath variables may reference previously defined parameters.

When using the Java/C# API, the DCPConfiguration object can be initialized with two maps supplied as arguments, one map for string parameters and the other for boolean parameters. A setParams method can be called on this object to supply a new set of parameter value overrides.

2.4. Setting DCP Property Values

Throughout the DCP file, one of three possible attributes must be used to set DCP properties of filter parameter values on an element. In the example XML snippet below, the resultReadabilityOptions child properties have values set using all three of these in turn:

1      <modifiedWhitespaceBehaviour xpath="if ($normalize-whitespace) 
                                           then 'normalize' else 'show'"/>
2      <orphanedWordDetectionEnabled parameterRef="orphaned-words"/>
3      <orphanedWordLengthLimit literalValue="2"/>

The attributes:

  1. xpath contains an XPath expression that references the 'normalize-whitespace' boolean parameter as an XPath variable to conditionally set the value to 'normalize' or 'show'.
  2. parameterRef contains the name of the boolean parameter 'orphaned-words' that is used to set this value.
  3. literalValue contains '2', the actual value for the property

One of these three attributes must always be used when setting a DCP property or filter parameter. They are mutually exclusive, so validation of the DCP will fail if you attempt to use more than one of these attributes on the same element.

Attribute Value Templates

For attributes other than 'xpath' and 'when', the '{' and '}' characters have special significance, they are treated as XSLT 'attribute value templates' (AVTs). So, if you need to use these characters literally, they should be escaped as '{{' and '}}' respectively.

Note: filter names, paths, URLs or classes can potentially be set using AVTs. However, these are only evaluated with the initialising set of parameters in the evaluation context, because all filters are loaded only once. Filter parameters though are re-evaluated each time the parameter set changes.

2.5. Advanced Configuration

The advancedConfig element is used to set properties and features of low-level components used by the Document Comparator. In this example, to prevent indentation of the XML output, the child outputProperties element is used to set the 'indent' property of the built-in Saxon Serializer instance to 'no'. Also, to prevent issues when DTDs are not available, we can prevent the parser from attempting to load the DTD; here, the relevant parserFeatures apache property is set to 'false'.

  <advancedConfig>
    <outputProperties>
      <property name="indent" literalValue="no"/>
    </outputProperties>
    <parserFeatures>
      <feature
        name="http://apache.org/xml/features/nonvalidating/load-external-dtd"
        literalValue="false"/>
    </parserFeatures>
  </advancedConfig>

Child elements of the advancedConfig element determine factors such as how DTDs or schemas are loaded and used, what collations are used for sorting and how XML is serialized. Full details can be found in the referenced documentation in the table below:

Properties and features managed via the advancedConfig element

ElementShort DescriptionReference
outputPropertiesxsl:output instructionSaxon Serializer.setOutputProperty
parserFeaturesParser Featureshttps://xerces.apache.org/xerces2-j/features.html
parserPropertiesParser Propertieshttps://xerces.apache.org/xerces2-j/properties.html
transformerConfigurationPropertiesSaxon Configuration Optionsnet.sf.saxon.lib/FeatureKeys

2.6. Standard Configuration

The standardConfig element is used for setting properties that would otherwise be set via the DocumentComparator API. In this example, the child resultReadabilityOptions and lexicalPreservation elements are used to configure corresponding properties available in the DocumentComparator class.

  <standardConfig>
    <resultReadabilityOptions>
      ...
      <orphanedWordMaxPercentage parameterRef="orphan-threshold"/>
    </resultReadabilityOptions>
    
    <lexicalPreservation>
      <defaults>
        <retain literalValue="true"/>
        <processingMode literalValue="change"/>
      </defaults>
    </lexicalPreservation>
    
  </standardConfig>

To help illustrate the relationship between DCP and the DocumentComparator API, here is the equivalent Java code for setting the 'orphan-threshold' value:

DocumentComparator dc= new DocumentComparator();
int orphanThreshold= 20;
dc.getResultReadabilityOptions().setOrphanedWordMaxPercentage(orphanThreshold);

2.7. Extension Points

The extensionPoints element contains elements defining all filters to be inserted in to the XML processing pipeline. The parents of each filter element determines the extension point at which filters should be inserted. With the exception of the 'inputPreFlatteningPoint' element, this extension point element needs a further parent element to specify the general extensionPoints group within the pipeline, that is: both-inputs, input-A, input-B or output.

  <extensionPoints>
    <inputPreFlatteningPoint>
      <filter ...
    </inputPreFlatteningPoint>

    <outputExtensionPoints>
      <finalPoint>
        <filter ...
      </finalPoint>
    </outputExtensionPoints>
  </extensionPoints>

The diagram below shows the basic DCP pipeline model, with two input pipes (A and B), a comparator in the middle and a single output pipe. The location of named extension points is also shown.

inputAExtensionPointsinputAExtensionPointspreAttributePointfinalPoint#inputPreFlatteningPointinputBExtensionPointsoutputExtensionPoints

Filters are added to the Document Comparator pipeline at the extension points labelled in the diagram above (click on an extension point label to see the corresponding DCP Schema element)

Both of the XML inputs to a Comparison are passed through chains of input filters. These filters can add, remove or change information as data passes through them. Each filter operates by modifying a Stream of SAX events (or callbacks to an SAX ContentHandler).

The operation of these filters can be defined using Java or XSLT. The input filters can be symmetrical (the same filters for each input) through the use of inputExtensionPoints or asymmetrical with the separate inputAExtensionPoints and inputBExtensionPoints elements used to specify the filters for each input. Input filters intended to affect word-by-word and formatting-element features are applied to both inputs at the inputPreFlatteningPoint.

In our example above, we only use the extension points: inputPreFlatteningPoint and finalPoint.

2.8. Filters

A DCP filter is represented by a filter element that must be contained within an element representing an extension point in the pipeline to which the filter should be added.

More generally, a filter is a component in a pipeline which processes XML data in some way.

Input and output filters can be implemented using XSLT or Java. The use of Java for output filtering is facilitated by the use of the XMLOutputFilter class and associated adapters provided in the DeltaXML Core API. These supplant the JAXP mechanism and are described in more detail in Powering Pipelines with JAXP.

Java filters

A Java filter is one which implements the org.xml.sax.XMLFilter interface, typically by extending the XMLFilterImpl class. It is used in compiled form. The associated class file must be available to the classloader of the application. To use a Java filter its fully qualified class is specified in a class element added as a child to the filter element, as in the following example . This example demonstrates the use of one of the filters included in the deltaxml.jar file included in the release.

Using a Java filter

<filter>
  <class name="com.deltaxml.pipe.filters.WordByWordInfilter"/>
</filter>

XSLT filters

There are a number of ways to locate an XSLT filter, including:

  • Specify a URL in a http element
  • Specify a file path in a file element
  • Include the filter in a Jar file and use a resource element

HTTP URL support is based on the java.net.URL class. The following example shows how a filter can be addressed using a URL.

Referring to an XSLT filter by HTTP URL

<filter>
  <http url="http://www.example.com/samples/filter.xsl"/>
</filter>

Files can also be used to specify XSLT filter locations. The underlying support for this type of filter specification is based on the java.io.File class and any file specifications should be compatible with the pathnames used with this Java class. See the following example

Referring to an XSLT filter by File location

<filter if="normalize-whitespace">
  <file path="mark-ws-preserved.xsl" relBase="dxp"/>
</filter>

The above example uses a relative path to specify the location of the file. For such relative paths, the relBase attribute is used to specify how the path is resolved. This attribute uses one of these 3 values:

  • current - resolve using the current working directory, obtained from the Java user.dir system property
  • home - resolve using the user's home directory, corresponding to the Java property user.home
  • dxp - resolve using the directory containing the DCP file, when it is loaded from a File (note: 'dxp' is used here to maintain compatibility with the filter element structure used in DXP.).

The final way of locating XSLT scripts is the resource mechanism. This allows XSLT files to be located on the classpath, and in particular in .jar files. The path used is the location of the XSLT script within the jar file, and more precisely is the path used as an argument to the ClassLoader.getResource(String) method.

This mechanism is provided so that you can deliver, to an end-user, a single jar file containing both code and data for one or more DCP pipeline. The following example XML snippet shows how a reference to a filter located in a jar file is added.

Referring to an XSLT filters inside a Jar File

<filter if="render-as-folding-html">
    <resource name="xsl/dx2-deltaxml-folding-html" />
    ...
</filter>

Filter Parameters

The operation of a filter may be controlled by parameters passed to the filter. Any number of parameters may be supplied to a filter, but their names must match those defined within the filter. Parameters are listed as child parameter elements within the filter element. An example:

<filter if="render-as-folding-html">
  <resource name="xsl/dx2-deltaxml-folding-html" />
  <parameter name="smart-whitespace-normalization"
             xpath="not($normalize-whitespace)"/>
</filter>
      
    

DCP filter parameter values are set using 'literalValue', 'parameterRef' or 'xpath' attributes as described in the Setting DCP Property Values section above.

When setting parameter values for XSLT filters, the 'xpath' attribute has special significance because the result of evaluating the expression is passed directly to the XSLT filter as an XPath Data Model value (Saxon XdmValue), this means that parameter values are not restricted to simple strings, they may for example evaluate to a sequence of xs:integer values. When using non-string values, the corresponding xsl:param instruction in the XSLT should by typed with an appropriate type, for example:

<parameter name="heading-levels" xpath="(1,2,3)"/>

Should have a corresponding declaration in the XSLT, such as:

<xsl:param name="heading-levels" as="xs:integer*"/>

To supply parameters to Java filters a parameter setting, or set method, should be provided. This method must conform to certain requirements, its name must be the string set followed by the exact DCP parameter name string. It should also take a single boolean or String parameter.

Please consult the sample filters and pipelines provided in the release for examples.

Conditional Filter Processing

While filters are always loaded when a DCP is first initialized, externally supplied pipeline parameter values can be used to enable or disable these filters for specific comparisons.

Two attributes, 'if' and 'unless' may be added to any filter element. Their values should refer to one boolean formal parameter by name. In the case of the if attribute, when the associated parameter is true then the filter is applied. Conversely, the unless attribute applies the filter when the referenced parameter is false. If both pipeline control parameters are used (and hopefully refer to different parameters!) the application of the pipeline stage is determined by the boolean-and of both conditions.

The 'when' attribute must be used on its own. Its value should be an XPath expression that evaluates to an xs:boolean. All pipeline parameters are part of the evaluation context, but there is no context item so expressions should be 'context-free'.

The following snippet from the full example shows how filters in the 'inputPreFlatteningPoint' extension point are enabled or disabled according to the values of pipeline parameters 'formatting-elements' 'document-type' and 'normalize-whitespace'.

Conditional Filter Example

<inputPreFlatteningPoint>
  <filter when="$formatting-elements 
          and $document-type eq 'major'">
  <file path="mark-major-formatting.xsl" relBase="dxp"/>
  </filter>
  <filter when="$formatting-elements 
          and $document-type eq 'minor'">
  <file path="mark-minor-formatting.xsl" relBase="dxp"/>
  </filter>
  <filter if="normalize-whitespace">
  <file path="mark-mixed-content.xsl" relBase="dxp"/>
  </filter>
  <filter if="normalize-whitespace">
  <file path="mark-ws-preserved.xsl" relBase="dxp"/>
  </filter>
</inputPreFlatteningPoint>

2.9. Processing Instructions

For applications exploiting different DCP configurations, it may help to be able to read application-specific information from a DCP file. To assist with this, processing instructions added as immediate children of the root element may be read using the getProcessingInsturction(String name) method of the DCPConfiguration class.

For example, using this processing-instruction as a child of the documentComparator element:

<?deltaxml.outputType xml?>
The following Java code can be used to retrieve the value:
DCPConfiguration dcp = new DCPConfiguration(new File("sample.dxp"));
String piValue= dcp.getProcessingInstruction("deltaxml.outputType");

3. Differences between the DCP (DocumentComparator) and DXP (PipelinedComparator).

This section describes the differences between the DCP covered in this document and DXP, described in the DXP User Guide document.

DXP is an XML format used to define the filters and settings for a PipelinedComparator object instead of the DocumentComparator object configured with DCP. The PipelinedComparator object is a more light-weight XML comparator, more suited to the comparison of XML data documents. The main differences between DCP and DXP are outlined below:

  • DCP defines a DocumentComparator object, DXP defines a PipelinedComparator object.
  • DCP was introduced to DeltaXML Core after DXP.
  • The DCP grammar is defined using XML Schema (1.0 or 1.1) instead of the DTD used for DXP.
  • DCP configures all options for the DocumentComparator as well as defining the pipeline itself, DXP supports a more limited set of options for the PipelinedComparator.
  • Filters are defined in the same way in DCP, but they are applied only at specific extension points in the pipeline.
  • DCP allows XPath 2.0 evaluation, either in dedicated attributes or attribute value templates. Unlike DXP, DCP does not support XQuery expressions.
  • A DocumentComparator defined using DCP has a highly featured built-in pipeline and therefore takes longer to initialize than a PipelinedComparator.

4. Initiating a DocumentComparator with DCP

A DCP file can be loaded when performing a comparison using one of the following methods:

  1. With DeltaXML Core's command.jar or deltaxml.exe command-line processor
  2. From DeltaXML Core's Graphical User Interface (not available for .NET distributions)
  3. Using the Java API , construct a DCPConfiguration object using the DCP file
  4. Using the .NET API, construct a DCPConfiguration object using the DCP file

5. DCP Samples

The following included samples use DCP as one of the methods for defining a Document Comparator pipeline: