Guide to filters and pipelines
Contents
1. Introduction
2. Word By Word Changes
3. Worked Example
4. Handling textual formatting
5. Further document-centric optimizations
6. Pipeline filter ordering
7. Specific formats
8. Other misc filters
1 Introduction
The DeltaXML comparison engine is designed to compare well-formed XML content. It generally does not understand any particular DTD or schema, or concepts common to readers/writers of documents such as words or sentences.
We use filters to add these concepts/semantics and introduce a finer granularity of processing into the comparison process. Using filters provides more flexibility/extensibility than building these concepts into the comparison engine. Consider the following requirements:
- The definition of punctuation could vary according to the locale, for example when processing Spanish text the ¿ may be a punctuation character.
- There may be certain elements or contexts where you wish to preserve formatting or perhaps do a line-based change, for example <pre>
- The different output requirements of the format being processed, xhtml vs docbook vs DITA
2 Word By Word Changes
The Word By Word pipeline is general purpose and is a very good starting point for document centric comparison pipelines.
2.1 The word by word concept/introduction
At the centre of the pipeline is a pair of filters one of which subdivides text into smaller chunks so that the comparison engine can process change at a smaller or 'word' granularity, and the other re-constitutes the text from the words. The comparison engine processes data in accordance with the XML specification which considers a PCDATA segment as a contiguous sequence of characters. XML itself has no concept of a sentance, a word, or punctuation; the comparison engine follows the XML specification.
The WordInfilter will filter document-centric markup such as
this example below. Note that the result has been indented here to make it
easier to read:
<para>The quick brown fox jumps over the lazy
dog.</para>
<para>
<deltaxml:word>The</deltaxml:word>
<deltaxml:space> </deltaxml:space>
<deltaxml:word>quick</deltaxml:word>
<deltaxml:space> </deltaxml:space>
<deltaxml:word>brown</deltaxml:word>
<deltaxml:space> </deltaxml:space>
<deltaxml:word>fox</deltaxml:word>
<deltaxml:space> </deltaxml:space>
<deltaxml:word>jumps</deltaxml:word>
<deltaxml:space> </deltaxml:space>
<deltaxml:word>over</deltaxml:word>
<deltaxml:space> </deltaxml:space>
<deltaxml:word>the</deltaxml:word>
<deltaxml:space> </deltaxml:space>
<deltaxml:word>lazy</deltaxml:word>
<deltaxml:space> </deltaxml:space>
<deltaxml:word>dog</deltaxml:word>
<deltaxml:punctuation>.</deltaxml:punctuation>
<para>
The corresponding WordOutfilter is designed to reverse this
process and deal with any delta information added by the comparator. Further
details of the operation of these filters is included in their documentation.
However some quick notes are provided here:
- The use of the space elements allows spacing changes to be recorded. Many formats/applications which normalize text prior to display (such as browsers) may not need such detailed information.
- Whitespace follows the W3C XML Recommendation for whitespace; punctuation defaults to the empty set of characters, but can be adjusted/configured; words are contiguous sequences of characters separated by space or punctuation.
- The punctuation can be adjusted on per format/element-type or element instance basis througth the use of an appropriate input filter.
- WordInfilter usually applies this splitting process to all PCDATA in an XML file, but can be configured not to, through the use of attributes typically added with a pre-filter.
To see this process in action, please visit our feature walkthroughs, in particular, visit the textual changes example and experiment with the word-by-word setting.
2.2 The unchanged space problem
When spaces are used to separate words, the comparator will identify lots of unchanged spaces in what appear to be completely unrelated sentences of text to a human reader. Again XML and the comparison engine only understands well-formed XML and as such cannot attach any more signifance to a word element than a space element. The result of comparing different strings of text, after word processing will appear as follows. In this case we will use red/green background colours to signify deleted/added (A/B text respectively) in the results to make them more concise:
<para>Hello World!</para> compared with:
<para>The quick brown ...</para>
After processing with WordInfilter, comparison,
WordOutfilter the result would appear to be:
<para>HelloThe
Worldquick!
brown ...</para>
Such a result, while having a mathematically optimal edit-path length, is
difficult for a human to interpret. The WordSpaceFixup filter is
one part of the process in converting this result form the form above, into a
less mathematically optimal, but easier to interpret form such as:
<para>Hello
World!The quick
brown ...</para>
To see a more detailed example of this and the other optimizations please see the Worked Example section below.
2.3 Orphaned Words
One of the functions of the DeltaXML comparison engine is to identify
unchanged data between the two input XML trees. For data-centric XML
and coarse granularity documents this approach works well. However when working
with fine or word granularity documents unrelated sentences or paragraphs of
text do ofthen contain the same words. In the English language these words
could typically include: and, or, in, of,
the, but, to, is.
These common words may fragment the differences as shown above, or on extreme
cases cause mis-matching of the containing element such as a paragraph. The
OrphanedWordOutfilter is another optimizing filter which makes the
result easier to interpret.
An orphaned word is defined as an unchanged word surrounded by a certain (parameterizable) threshold of changed (added/deleted/modified) words. The Worked Example section below provides an example of how this filter works.
[TODO: Takes a very localize view of changes, threshold takes a more global, top-down view.?]
3 Worked Example
In the following example we will compare the following two simple textual XML examples with various filters.
Input 1: <p>A pangram uses all the letters of the
alphabet and is often used to test typewriters or computer keyboards, for
example: The quick brown fox jumps over the lazy dog.</p>
Input 2: <p>A pangram uses all the letters of the
alphabet, some repeatedly, and is often used to test typewriters or computer
keyboards, for example: A quick movement of the enemy will jeopardize six
gunboats.</p>
3.1 Applying the WordInfilter
Firstly we will process these examples using the WordInfilter
and WordOutfilter. The result is very verbose consisting of many
word and space elements, but can be represented precisely and consisely using
coloured text as below:
<p>A pangram uses all the letters of
the alphabetalphabet, some
repeatedly, and is often used to test typewriters or computer keyboards,
for
example: TheA quick brownmovement foxof jumps
over the lazyenemy dog.will
jeopardize six gunboats.</p>
3.2 Adjusting punctuation
The previous result used the default (empty) definition of punctuation. We
can see that because of this we appear, at first glance, to have changed the
word alphabet and the final full-stop appears twice, firstly with the
word dog and then with gunboats. By introducing the
deltaxml:punctuation attribute into the input, with for example the
following setting:
xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
deltaxml:punctuation=", : ."
The output is improved:
<p>A pangram uses all the letters of the
alphabet, some
repeatedly, and is often used to test
typewriters or computer keyboards, for example:
TheA quick
brownmovement
foxof jumps
over the lazyenemy
dogwill jeopardize six
gunboats.</p>
3.3 Adding the OrphanedWordOutfilter
The following result demonstrates the effect of adding the orphanedWordOutfiler as the first output filter after the comparator. It detects unchanged words and produces a pair of added/deleted words in their place - in this example the words quick and the are duplicated in this way:
<p>A pangram uses all the letters of the
alphabet, some
repeatedly, and is often used to test
typewriters or computer keyboards, for example:
TheA
quickquick
brownmovement
foxof jumps
over thethe lazyenemy dogwill
jeopardize six gunboats.</p>
3.4 Processing unchanged spaces with WordSpaceFixup
If we now add the WordSpaceFixup outfilter to our pipeline, the result now becomes:
<p>A pangram uses all the letters of the
alphabet, some
repeatedly, and is often used to test
typewriters or computer keyboards, for example:
TheA quickquick brownmovement foxof jumps
over
thethe lazyenemy dogwill
jeopardize six gunboats.</p>
As with the previous filter, the benefit of this processing becomes clear when we group the added and deleted words/spaces together, in this case with the WordOutfilter which follows below.
3.5 Converting back into text with the WordOutfilter
The next output filter is WordOutfilter which as well as converting the word, punctuation and space elements back into text also groups deleted and added sections (between any unchanged words) back together.
<p>A pangram uses all the letters of the
alphabet
,
some
repeatedly, and is often used to test
typewriters or computer keyboards, for example: The quick
brown fox jumps over the lazy dogA quick movement of
the enemy will jeopardize six
gunboats.</p>
4 Handling textual formatting
A common way of handling the formatting of text (things like boldness,
italics and font/font-size) is to use inline formatting elements around the text
which has a formatting change. Typical elements include:
<b>, <i> and <span>. An
informal view of this is often that some inline 'tags' have been added to the
text. However, from an XML perspective, what has happened is that a new element
has been added with some content. Let us consider some examples:
<p>hello world</p> <p><b>hello word</b></p>
In this case the informal view is that we've added an open-b tag and a close-b tag. However from an XML perpsective what has happened is the PCDATA 'hello world' has been removed from the <p> element and replaced by a single, new b element containing some PCDATA. This tree-structured view is that used by the comparison engine and followed in the delta representation. If we were to follow this representation of deleted text and added element, we may have a simplistic representation of this result as:
<p>hello worldhello world</p>
A common objection is that showing the text with the old and new formatting becomes increasing difficult to understand and interpret as the length of the formatted text increases. Consider for example the case of a large paragraph of text which has a format change, such as a font change and also one or two words changed. The formatting change is represented by a new element which contains the text of the new paragraph, and the old text is also provided. But the granularity of the small number of word changes has been lost amongst the structural change.
Our solution to these issues is to make formatting changes 'non-structural'. The technique used is to flatten the formatting elements. For example:
<p><b>hello world</b></p>
could be represented (in a reversable manner) by:
<p><format-start><b/></format-start>hello
world<format-end><b/></format-end></p>.
What is important to note is that the hello world textual PCDATA is at the same level of the XML hierarchy as the orignal, non-bold version. We have flattened the formatting elements to be siblings of the text. So when these are compared, ideally with the word-by-word filters, but not-necessarily so, the result reports that the words remain unchanged, but that some format start/end elements have been added.
For many examples we can reconstruct the flattened formatting infromation in an output filter, in the case of the example above we can reconstruct the b element. However there are two issues to consider with the reconstruction:
- how to represent the formatting changes
- how to deal with overlapping hierarchies
When text changes we can use formatting, for example colouring text red and green, to represent the changes. However changes to formatting are hard to represent, for example in our boldening example how do we show this result to the user? If our 'hello world' text is shown as green and bold how would the user be able to differentiate the change of words from the change for format? There is another problem when trying to reconstruct a hierarchical result - it is refered to the 'overlapping hierachies' problem. SGML attempted to solve this issue with the CONCUR mechanism, with limited sucess. Because of both of these issues, when there are any problems reconstructing the hierarchical/formatted result, we take the approach of favouring and using the 'B' or new comparator input and if necessary losing some formatting information from the 'A' or old input.
If your requirements are to report absolutely all changes then the formatting filters may not be appropriate, as some formatting information is not retained. However, many use-cases are primarily concerned with the text of the document and the formatting is a secondary concern; in these cases the flattening process gives results which are often easier for the user to interpret, even at the expense of occasionally losing information about how the 'A' or old text was formatted.
When the formatting filters are used in conjunction with the word-by-word filters it is recommend that the format conversion/reconstruction is applied before/after the word conversions respectivelty (see below for suggested filter order) for performance reasons: the formatting filters are implemented in XSLT, more CPU time and heap space is required to process the typically larger amount of XML data needed for the word-level representation.
5 Further document-centric optimizations
5.1 Threshold filters
As some of the simple textual/word-based examples above may have
demonstrated, an object with lots of intermixed added and deleted content, and
with little unchanged content, is often hard to interpret. The example above
demonstrates some optimizations which are applied to sequences of words.
However the same situation/phenomenum is true in larger, more structured,
objects such as paraghraphs or table-rows. The threshold filter
(dx2-threshold-outfilter.xsl) is designed to optimize these
non-word situations. Unlike some word-level filters such as
WordSpaceFixup and OrphanedWordOutfilter, which take a
very localized view of change, this filter takes a top-down view of an object.
It measures the amount of added and deleted textual or PCDATA content relative
to the amount of unchanged PCDATA. Changes to attributes, which in most
document-centric XML are meta-data, are ignored in this analysis. When the
amount of changed content, relative to the unchanged content, reaches a certain
threshold, the filter will create two copies of the object being processed: one
containing the added and unchanged content, the second containing the deleted
and unchanged.
The application of this filter can be controlled via attributes on the
instance data, which with an appropriate in-filter can provide per element-type
of per element-instance control over this process. Consideration should be
given to the DTD or schema of the format being processed, for example, if a
table-cell element is only allowed to contain a single para
or p child element, then it would be inappropriate to apply this
process. In this case an in-filter could be told to add appropriate attributes
to elements with a table-cell/para XPath so that they are not
considered for thresholding.
Here is an example which demonstrates the effect of thresholding (taken from a DeltaXML comparison pipeline for XHTML, processing two versions of the XML 1.0 edition 4 specification - the 14 June 2006 Proposal and the 29 September 2006 Recommendation). Figure 1 shows a fragment of the result without thresholding and Figure 2 shows the result with this filter applied.
![]() |
| Figure 1: Specification fragment before threshold filter |
![]() |
| Figure 2: Specification fragment after threshold filter |
5.2 The red-green filters
At the heart of the DeltaXML comparison engine is code which executes an optimization to produce the shortest edit-path. In this optimization matching unchanged content reduces the edit-path length. Some processes such as orphaned words are intended to reverse the optimization. Red-green is then used to make the results more understandable to a human being. It works on this premise: it is possible to move the added/deleted elements within a delta around as long as they do not move past any unchanged elements.
Consider the following example of an XML document structure:
<section deltaxml:deltaV2="A!=B">
<para deltaxml:deltaV2="A=B">One</para>
<para
deltaxml:deltaV2="B">two</para>
<para
deltaxml:deltaV2="A">ten</para>
<para
deltaxml:deltaV2="B">three</para>
<para
deltaxml:deltaV2="B">four</para>
<para
deltaxml:deltaV2="A">twenty</para>
<para
deltaxml:deltaV2="A=B">finished</para>
<para
deltaxml:deltaV2="A">...</para>
</section>
This can be converted into:
<section deltaxml:deltaV2="A!=B">
<para deltaxml:deltaV2="A=B">One</para>
<para
deltaxml:deltaV2="B">two</para>
<para
deltaxml:deltaV2="B">three</para>
<para
deltaxml:deltaV2="B">four</para>
<para
deltaxml:deltaV2="A">ten</para>
<para
deltaxml:deltaV2="A"> twenty </para>
<para deltaxml:deltaV2="A=B">finished</para>
<para
deltaxml:deltaV2="A">...</para>
</section>
It is difficult to demonstrate the effect of this process on a small fragment
of XML (partly becuase the WordOutfilter also provides a red-green
like function at the word level). Insteaad we will use the previous example from
the XML 1.0 specification. The Figure 3, below, shows how the red-green filter
can make the thresholded result, shown above in Figure 2, easier to
interpret/understand.
![]() |
| Figure 3: Specification fragment after red-green filter |
6 Pipeline filter ordering
The order that the filters appear below is the same order that we suggest they be used in most document-orientated XML comparison pipelines. The DXP file for the XHTML pipeline, included in the samples directory of the Core 5.0 release, demonstrates a similar pipeline configuration with additional format-specifc input and output filters.
|
Component Name |
Type |
Infilter/Outfilter |
Optional |
Notes |
|---|---|---|---|---|
|
dx2-format-infilter.xsl |
XSLT |
Infilter |
yes[1] |
Flattens all elements that are flagged with a deltaxml:format='true' attribute so that they do not pollute the results as structural changes. These attributes are typically added by a prior infilter. |
|
WordInfilter[2] |
Java SAX |
Infilter |
no |
Controlled via the |
|
DeltaXML Comparator |
Comparator |
N/A |
no |
|
|
OrphanedWordOutfilter[2] |
Java SAX |
Outfilter |
yes |
Parameters for orphaned word sequence length and size of surrounding modified words |
|
WordSpaceFixup[2] |
Java SAX |
Outfilter |
no |
|
|
WordOutfilter[2] |
Java SAX |
Outfilter |
no |
|
|
dx2-threshold.xsl |
XSLT |
Outfilter |
yes |
Controlled via the |
|
dx2-red-gren-outfilter.xsl |
XSLT |
Outfilter |
yes |
Controlled via the |
|
dx2-format-outfilter.xsl |
XSLT |
Outfilter |
yes[1] |
[1] Required if you want to ignore formatting changes.
[2] Located in the com.deltaxml.pipe.filters.dx2.wbw package.
7 Specific formats
So far we have talked about document orientated content without going into the specifics of any particular format or language. Some of the filters discussed previously , or the comparator itself, are controlled using various attributes. These include attributes describing orderless elements and punctuation characters. These attributes are easily added using a language specific input filter. For example the xhtml input filter will add deltaxml:keys based on xhtml id attributes, prevent word-by-word processing for <pre> elements. An output filter is often also needed to convert delta attributes and elements into something more useful to the user. In the case of XHTML an output filter will colour added or deleted text green and red respectively. In some cases the format or language specific filters are designed to be used as a pair, they may be conversions done in an input filter which are desgined to be reversed by an output filter. This format flattening filters demonstrate this pairing of conversion filters, but further language specific conversions could be provided in language specific filter pairs.
7.1 xhtml
The xhtml input and output filters are wrapped around the generic word/document pipeline discussed previously. The output filter converts any delta elements and attributes so that changes are represented in xhtml/CSS styling. The output from this pipeline should be a valid xhtml file.
7.2 docbook
We have similar pairs of filters available to support Docbook. Currently support is provided for Docbook version 4. Support is also planned for Docbook v5 using the new docbook namespace.
8 Other misc filters
8.1 Normalize Space
The space normalization filter is very useful when comparing elements which are typically rendered as part of their publication or presentation, for example xhtml is rendered by a browser (it normalizes the content prior to presentation). By normalizing the input prior to comparison any whitespace differences will be removed.
The NormalizeSpace filter implements a type of normalization that is useful in documents and which similar to that performed by browsers, please refer to the specific documentation for further details. This filter is normally positioned as the first input filter in a pipeline.
8.2 Clean House
The clean house filter is designed to clean-up or remove any remaining attributes in any deltaxml namespaces. It would normally be used after any result formatting or conversion process and is typically the final stage of the output pipeline.


