Loading login details...

Guide to filters and pipelines

Contents

1. Introduction
2. Word By Word Changes
3. Worked Example
4. Handling textual formatting
5. Further document-centric optimizations
6. Pipeline filter ordering
7. Specific formats
8. Other misc filters

1 Introduction

The DeltaXML comparison engine is designed to compare well-formed XML content.  It generally does not understand any particular DTD or schema, or concepts common to readers/writers of documents such as words or sentences.

We use filters to add these concepts/semantics and introduce a finer granularity of processing into the comparison process.  Using filters provides more flexibility/extensibility than building these concepts into the comparison engine.  Consider the following requirements:

2 Word By Word Changes

The Word By Word pipeline is general purpose and is a very good starting point for document centric comparison pipelines.

2.1 The word by word concept/introduction

At the centre of the pipeline is a pair of filters one of which subdivides text into smaller chunks so that the comparison engine can process change at a smaller or 'word' granularity, and the other re-constitutes the text from the words.  The comparison engine processes data in accordance with the XML specification which considers a PCDATA segment as a contiguous sequence of characters.  XML itself has no concept of a sentance, a word, or punctuation; the comparison engine follows the XML specification.

The WordInfilter will filter document-centric markup such as this example below.  Note that the result has been indented here to make it easier to read:

<para>The quick brown fox jumps over the lazy dog.</para>

<para>
  <deltaxml:word>The</deltaxml:word>
  <deltaxml:space> </deltaxml:space>
  <deltaxml:word>quick</deltaxml:word>
  <deltaxml:space> </deltaxml:space>
  <deltaxml:word>brown</deltaxml:word>
  <deltaxml:space> </deltaxml:space>
  <deltaxml:word>fox</deltaxml:word>
  <deltaxml:space> </deltaxml:space>
  <deltaxml:word>jumps</deltaxml:word>
  <deltaxml:space> </deltaxml:space>
  <deltaxml:word>over</deltaxml:word>
  <deltaxml:space> </deltaxml:space>
  <deltaxml:word>the</deltaxml:word>
  <deltaxml:space> </deltaxml:space>
  <deltaxml:word>lazy</deltaxml:word>
  <deltaxml:space> </deltaxml:space>
  <deltaxml:word>dog</deltaxml:word>
  <deltaxml:punctuation>.</deltaxml:punctuation>
<para>

The corresponding WordOutfilter is designed to reverse this process and deal with any delta information added by the comparator.  Further details of the operation of these filters is included in their documentation.  However some quick notes are provided here:

To see this process in action, please visit our feature walkthroughs, in particular, visit the textual changes example and experiment with the word-by-word setting.

2.2 The unchanged space problem

When spaces are used to separate words, the comparator will identify lots of unchanged spaces in what appear to be completely unrelated sentences of text to a human reader.  Again XML and the comparison engine only understands well-formed XML and as such cannot attach any more signifance to a word element than a space element. The result of comparing different strings of text, after word processing will appear as follows.  In this case we will use red/green background colours to signify deleted/added (A/B text respectively) in the results to make them more concise:

<para>Hello World!</para>  compared with: <para>The quick brown ...</para>

After processing with WordInfilter, comparison, WordOutfilter the result would appear to be:

<para>HelloThe Worldquick! brown ...</para>

Such a result, while having a mathematically optimal edit-path length, is difficult for a human to interpret.  The WordSpaceFixup filter is one part of the process in converting this result form the form above, into a less mathematically optimal, but easier to interpret form such as:

<para>Hello World!The quick brown ...</para>

To see a more detailed example of this and the other optimizations please see the Worked Example section below.

2.3 Orphaned Words

One of the functions of the DeltaXML comparison engine is to identify unchanged data between the two input XML trees.  For data-centric XML
and coarse granularity documents this approach works well.  However when working with fine or word granularity documents unrelated sentences or paragraphs of text do ofthen contain the same words.  In the English language these words could typically include: and, or, in, of, the, but, to, is.

These common words may fragment the differences as shown above, or on extreme cases cause mis-matching of the containing element such as a paragraph. The OrphanedWordOutfilter is another optimizing filter which makes the result easier to interpret. 

An orphaned word is defined as an unchanged word surrounded by a certain (parameterizable) threshold of changed (added/deleted/modified) words.  The Worked Example section below provides an example of how this filter works.

[TODO: Takes a very localize view of changes, threshold takes a more global, top-down view.?]

3 Worked Example

In the following example we will compare the following two simple textual XML examples with various filters.

Input 1: <p>A pangram uses all the letters of the alphabet and is often used to test typewriters or computer keyboards, for example: The quick brown fox jumps over the lazy dog.</p>

Input 2: <p>A pangram uses all the letters of the alphabet, some repeatedly, and is often used to test typewriters or computer keyboards, for example: A quick movement of the enemy will jeopardize six gunboats.</p>

3.1 Applying the WordInfilter

Firstly we will process these examples using the WordInfilter and WordOutfilter. The result is very verbose consisting of many word and space elements, but can be represented precisely and consisely using coloured text as below:

<p>A pangram uses all the letters of the alphabetalphabet, some repeatedly, and is often used to test typewriters or computer keyboards, for example: TheA quick brownmovement foxof jumps over the lazyenemy dog.will jeopardize six gunboats.</p>

3.2 Adjusting punctuation

The previous result used the default (empty) definition of punctuation. We can see that because of this we appear, at first glance, to have changed the word alphabet and the final full-stop appears twice, firstly with the word dog and then with gunboats.  By introducing the deltaxml:punctuation attribute into the input, with for example the following setting:

xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:punctuation=", : ."

The output is improved:

<p>A pangram uses all the letters of the alphabet, some repeatedly, and is often used to test typewriters or computer keyboards, for example: TheA quick brownmovement foxof jumps over the lazyenemy dogwill jeopardize six gunboats.</p>

3.3 Adding the OrphanedWordOutfilter

The following result demonstrates the effect of adding the orphanedWordOutfiler as the first output filter after the comparator.  It detects unchanged words and produces a  pair of added/deleted words in their place - in this example the words quick and the are duplicated in this way:

<p>A pangram uses all the letters of the alphabet, some repeatedly, and is often used to test typewriters or computer keyboards, for example: TheA quickquick brownmovement foxof jumps over thethe lazyenemy dogwill jeopardize six gunboats.</p>

3.4 Processing unchanged spaces with WordSpaceFixup

If we now add the WordSpaceFixup outfilter to our pipeline, the result now becomes:

<p>A pangram uses all the letters of the alphabet,  some repeatedly, and is often used to test typewriters or computer keyboards, for example: TheA  quickquick  brownmovement  foxof  jumps over thethe  lazyenemy  dogwill jeopardize six gunboats.</p>

As with the previous filter, the benefit of this processing becomes clear when we group the added and deleted words/spaces together, in this case with the WordOutfilter which follows below.

3.5 Converting back into text with the WordOutfilter

The next output filter is WordOutfilter which as well as converting the word, punctuation and space elements back into text also groups deleted and added sections (between any unchanged words) back together.

<p>A pangram uses all the letters of the alphabet , some repeatedly, and is often used to test typewriters or computer keyboards, for example: The quick brown fox jumps over the lazy dogA quick movement of the enemy will jeopardize six gunboats.</p>

4 Handling textual formatting

A common way of handling the formatting of text (things like boldness, italics and font/font-size) is to use inline formatting elements around the text which has a formatting change.  Typical elements include: <b>, <i> and <span>.  An informal view of this is often that some inline 'tags' have been added to the text.  However, from an XML perspective, what has happened is that a new element has been added with some content.  Let us consider some examples:

<p>hello world</p>  <p><b>hello word</b></p>

In this case the informal view is that we've added an open-b tag and a close-b tag.  However from an XML perpsective what has happened is the PCDATA 'hello world' has been removed from the <p> element and replaced by a single, new b element containing some PCDATA.  This tree-structured view is that used by the comparison engine and followed in the delta representation.  If we were to follow this representation of deleted text and added element, we may have a simplistic representation of this result as:

<p>hello worldhello world</p>

A common objection is that showing the text with the old and new formatting becomes increasing difficult to understand and interpret as the length of the formatted text increases.  Consider for example the case of a large paragraph of text which has a format change, such as a font change and also one or two words changed.  The formatting change is represented by a new element which contains the text of the new paragraph, and the old text is also provided.  But the granularity of the small number of word changes has been lost amongst the structural change.

Our solution to these issues is to make formatting changes 'non-structural'. The technique used is to flatten the formatting elements.  For example:

<p><b>hello world</b></p>

could be represented (in a reversable manner) by:

<p><format-start><b/></format-start>hello world<format-end><b/></format-end></p>

What is important to note is that the hello world textual PCDATA is at the same level of the XML hierarchy as the orignal, non-bold version.  We have flattened the formatting elements to be siblings of the text.  So when these are compared, ideally with the word-by-word filters, but not-necessarily so, the result reports that the words remain unchanged, but that some format start/end elements have been added.

For many examples we can reconstruct the flattened formatting infromation in an output filter, in the case of the example above we can reconstruct the b element.  However there are two issues to consider with the reconstruction:

When text changes we can use formatting, for example colouring text red and green, to represent the changes. However changes to formatting are hard to represent, for example in our boldening example how do we show this result to the user?  If our 'hello world' text is shown as green and bold how would the user be able to differentiate the change of words from the change for format?  There is another problem when trying to reconstruct a hierarchical result - it is refered to the 'overlapping hierachies' problem.  SGML attempted to solve this issue with the CONCUR mechanism, with limited sucess.  Because of both of these issues, when there are any problems reconstructing the hierarchical/formatted result, we take the approach of favouring and using the 'B' or new comparator input and if necessary losing some formatting information from the 'A' or old input. 

If your requirements are to report absolutely all changes then the formatting filters may not be appropriate, as some formatting information is not retained.  However, many use-cases are primarily concerned with the text of the document and the formatting is a secondary concern; in these cases the flattening process gives results which are often easier for the user to interpret, even at the expense of occasionally losing information about how the 'A' or old text was formatted.

When the formatting filters are used in conjunction with the word-by-word filters it is recommend that the format conversion/reconstruction is applied before/after the word conversions respectivelty (see below for suggested filter order) for performance reasons:  the formatting filters are implemented in XSLT, more CPU time and heap space is required to process the typically larger amount of XML data needed for the word-level representation.

5 Further document-centric optimizations

5.1 Threshold filters

As some of the simple textual/word-based examples above may have demonstrated, an object with lots of intermixed added and deleted content, and with little unchanged content, is often hard to interpret.  The example above demonstrates some optimizations which are applied to sequences of words.  However the same  situation/phenomenum is true in larger, more structured, objects such as paraghraphs or table-rows.  The threshold filter (dx2-threshold-outfilter.xsl) is designed to  optimize these non-word situations.  Unlike some word-level filters such as WordSpaceFixup and OrphanedWordOutfilter, which take a very localized view of change, this filter takes a top-down view of an object.  It measures the amount of added and deleted textual or PCDATA content relative to the amount of unchanged PCDATA. Changes to attributes, which in most document-centric XML are meta-data, are ignored in this analysis.  When the amount of changed content, relative to the unchanged content, reaches a certain threshold, the filter will create two copies of the object being processed: one containing the added and unchanged content, the second containing the deleted and unchanged.

The application of this filter can be controlled via attributes on the instance data, which with an appropriate in-filter can provide per element-type of per element-instance control over this process.  Consideration should be given to the DTD or schema of the format being processed, for example, if a table-cell element is only allowed to contain a single para or p child element, then it would be inappropriate to apply this process.  In this case an in-filter could be told to add appropriate attributes to elements with a table-cell/para XPath so that they are not considered for thresholding.

Here is an example which demonstrates the effect of thresholding (taken from a DeltaXML comparison pipeline for XHTML, processing two versions of the XML 1.0 edition 4 specification - the 14 June 2006 Proposal and the 29 September 2006 Recommendation).  Figure 1 shows a fragment of the result without thresholding and Figure 2 shows the result with this filter applied.

simple-specs
Figure 1: Specification fragment before threshold filter

threshold-specs
Figure 2: Specification fragment after threshold filter

5.2 The red-green filters

At the heart of the DeltaXML comparison engine is code which executes an optimization to produce the shortest edit-path.  In this optimization matching unchanged content reduces the edit-path length.  Some processes such as orphaned words are intended to reverse the optimization.  Red-green is then used to make the results more understandable to a human being.  It works on this premise:  it is possible to move the added/deleted elements within a delta around as long as they do not move past any unchanged elements.

Consider the following example of an XML document structure:

<section deltaxml:deltaV2="A!=B">
  <para deltaxml:deltaV2="A=B">
One</para>
  <para deltaxml:deltaV2="B">two</para>
  <para deltaxml:deltaV2="A">ten</para>
  <para deltaxml:deltaV2="B">three</para>
  <para deltaxml:deltaV2="B">four</para>
  <para deltaxml:deltaV2="A">twenty</para>
  <para deltaxml:deltaV2="A=B">finished</para>
  <para deltaxml:deltaV2="A">...</para>
</section>

This can be converted into:

<section deltaxml:deltaV2="A!=B">
  <para deltaxml:deltaV2="A=B">One</para>
  <para deltaxml:deltaV2="B">two</para>
  <para deltaxml:deltaV2="B">three</para>
  <para deltaxml:deltaV2="B">four</para>
  <para deltaxml:deltaV2="A">ten</para>
  <para deltaxml:deltaV2="A"> twenty </para>
  <para deltaxml:deltaV2="A=B">finished</para>
  <para deltaxml:deltaV2="A">...</para>
</section>

It is difficult to demonstrate the effect of this process on a small fragment of XML (partly becuase the WordOutfilter also provides a red-green like function at the word level). Insteaad we will use the previous example from the XML 1.0 specification.  The Figure 3, below, shows how the red-green filter can make the thresholded result, shown above in Figure 2, easier to interpret/understand.

redgreen-specs
Figure 3: Specification fragment after red-green filter

6 Pipeline filter ordering

The order that the filters appear below is the same order that we suggest they be used in most document-orientated XML comparison pipelines.  The DXP file for the XHTML pipeline, included in the samples directory of the Core 5.0 release, demonstrates a similar pipeline configuration with additional format-specifc input and output filters.

Component Name

Type

Infilter/Outfilter

Optional

Notes

dx2-format-infilter.xsl

XSLT

Infilter

yes[1]

Flattens all elements that are flagged with a deltaxml:format='true' attribute so that they do not pollute the results as structural changes.  These attributes are typically added by a prior infilter.

WordInfilter[2]

Java SAX

Infilter

no

Controlled via the xml:space='preserve' and xml:word-by-word attributes

DeltaXML Comparator

Comparator

N/A

no

OrphanedWordOutfilter[2]

Java SAX

Outfilter

yes

Parameters for orphaned word sequence length and size of surrounding modified words

WordSpaceFixup[2]

Java SAX

Outfilter

no

WordOutfilter[2]

Java SAX

Outfilter

no

dx2-threshold.xsl

XSLT

Outfilter

yes

Controlled via the deltaxml:threshold attribute typically added by an infilter.  Parameters for threshold size

dx2-red-gren-outfilter.xsl

XSLT

Outfilter

yes

Controlled via the deltaxml:red-green attribute tpyically added by an infilter.

dx2-format-outfilter.xsl

XSLT

Outfilter

yes[1]

[1] Required if you want to ignore formatting changes.
[2] Located in the com.deltaxml.pipe.filters.dx2.wbw package.

7 Specific formats

So far we have talked about document orientated content without going into the specifics of any particular format or language.  Some of the filters discussed previously , or the comparator itself, are controlled using various attributes.  These include attributes describing orderless elements and punctuation characters.  These attributes are easily added using a language specific input filter.  For example the xhtml input filter will add deltaxml:keys based on xhtml id attributes, prevent word-by-word processing for <pre> elements.  An output filter is often also needed to convert delta attributes and elements into something more useful to the user.  In the case of XHTML an output filter will colour added or deleted text green and red respectively.  In some cases the format or language specific filters are designed to be used as a pair, they may be conversions done in an input filter which are desgined to be reversed by an output filter.  This format flattening filters demonstrate this pairing of conversion filters, but further language specific conversions could be provided in language specific filter pairs.

7.1 xhtml

The xhtml input and output filters are wrapped around the generic word/document pipeline discussed previously.  The output filter converts any delta elements and attributes so that changes are represented in xhtml/CSS styling.  The output from this pipeline should be a valid xhtml file.

7.2 docbook

We have similar pairs of filters available to support Docbook.  Currently support is provided for Docbook version 4.  Support is also planned for Docbook v5 using the new docbook namespace.

8 Other misc filters

8.1 Normalize Space

The space normalization filter is very useful when comparing elements which are typically rendered as part of their publication or presentation, for example xhtml is rendered by a browser (it normalizes the content prior to presentation).  By normalizing the input prior to comparison any whitespace differences will be removed.

The NormalizeSpace filter implements a type of normalization that is useful in documents and which similar to that performed by browsers, please refer to the specific documentation for further details.  This filter is normally positioned as the first input filter in a pipeline.

8.2 Clean House

The clean house filter is designed to clean-up or remove any remaining attributes in any deltaxml namespaces.  It would normally be used after any result formatting or conversion process and is typically the final stage of the output pipeline.