Table of Contents
- 1 Introduction
- 2 Inputs
- 3 Basic Pipeline Definition
- 4 Whitespace changes
- 5 Changes-only vs full delta
- 6 Highlighting changes
- 7 Word by word comparison
- 8 Using keys to manage alignment
- 9 Formatting Changes
- 10 Gathering adds and deletes together
- 11 Preserving the doctype
- 12 Preserving Comments
- 13 Summary
- 14 Running the sample pipeline
Pipeline Construction Tutorial
1 Introduction
This tutorial is intended to guide you through the process of creating a pipeline from scratch. The pipeline created in the tutorial is intended to compare XHTML files, producing an XHTML file as output that includes changes highlighted by colored styles. While it is not necessarily a complete pipeline for processing any XHTML input, the idea is to show common issues that need to be solved by using configuration settings or existing input and output filters as well as giving an idea of how to start writing your own filters to solve particular problems.
2 Inputs
Example 1: input 1, a list of news headlines and summaries from Google News (document-v1.xhtml in the sample directory)
<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>DeltaXML Pipeline Tutorial Sample</title>
</head>
<body>
<h1 id="title">World News Headlines</h1>
<!-- headlines from news.google.com on Tuesday 19th October 2010 -->
<h2 id="chechnya-headline">Militants storm Chechen parliament in deadly standoff</h2>
<p id="chechnya-summary">GROZNY, Russia - Militants stormed parliament in Russia's conflict-torn Chechnya Tuesday,
seizing deputies and gunning down guards, before being killed in a bloody standoff with
security forces.</p>
<h2 id="flights-headline">Flights expected to be canceled in France amid strikes</h2>
<p id="flights-summary">A striker joins the blockade of a fuel storage depots to protest against pension reform
on October 18 in Frontignan, France.</p>
<h2 id="megi-headline">Typhoon Megi Kills 10 in Philippines, Heads for China</h2>
<p id="megi-summary">Typhoon Megi, which left at least 10 people dead as it crossed the Philippines yesterday,
strengthened as it churned over the South China Sea on a path to Hong Kong, the US Navy
Joint Typhoon ...</p>
</body>
</html>Example 2: the second input is the same document with some modifications (document-v2.xhtml in the sample directory)
<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>DeltaXML Pipeline Tutorial Sample</title>
</head>
<body>
<h1 id="title">World News Headlines</h1>
<!-- headlines from http://news.google.com on Tuesday 19th October 2010 -->
<h2 id="chechnya-headline">Militants storm Chechen parliament in deadly standoff</h2>
<p id="chechnya-summary">Militants stormed parliament in Russia's conflict-torn Chechnya on Tuesday,
seizing deputies and gunning down guards before being killed in a bloody standoff with
security forces.</p>
<h2 id="nobel-headline">Nobel winner's brother refused China prison visit </h2>
<p id="nobel-summary">Chinese authorities refused to allow the brother of jailed Nobel Peace Prize winner
Liu Xiaobo to visit him in prison in apparent violation of the rules, a Hong Kong-based rights group said Tuesday.</p>
<h2 id="megi-headline">Typhoon Megi Kills 10 in Philippines, Heads for China</h2>
<p id="megi-summary">Typhoon Megi, which left at least ten people dead as it crossed the Philippines yesterday,
strengthened as it churned over the South China Sea on a path to Hong Kong, the US Navy
<em>Joint Typhoon Warning Center</em> said.</p>
</body>
</html>3 Basic Pipeline Definition
The simplest pipeline that can be defined in a DXP file is declared using the following XML:
Example 3: the simplest pipeline definition
<comparatorPipeline description="Pipeline Tutorial" id="tutorial"/>
When run, this pipeline will cause the inputs to be compared as they are, using an unconfigured PipelinedComparator. The result is the direct output from the comparison.
The following example shows the output from the command-line interface when the input files are compared with this pipeline:
Example 4: command-line output from the pipeline
com.deltaxml.api.DeltaXMLProcessingException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
This is not the output we were expecting. The comparison has thrown an Exception and did not produce any sensible output. Fortunately, the Exception message gives us useful information about what has happened; the DTD pointed to in the doctype of the inputs was requested during parsing and has the http request has returned a 503 response code. In this case, it is because the W3C website, in order reduce traffic to its servers does not allow access to the DTDs in this way.
There are two ways to solve this problem: 1) use a catalog resolver and a catalog file to point to local copies of the DTD files to be used in parsing; 2) configure the parser so that it doesn't request the DTD in the first place. Solution 1 may be preferable if you wish to validate the inputs as they are parsed. This solution is covered in the sample 'How to use a catalog resolver with DeltaXML'. For the purposes of this tutorial, we will use the second solution and configure the parser.
The Apache Xerces parser has a feature setting that allows the loading of external DTD files to be turned off. N.B. this feature cannot be turned off if validation is turned on. The following example shows this parser feature being added to the DXP file:
Example 5: setting the parser feature that stops the DTD from being loaded
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
</comparatorPipeline>The comparison will now run without throwing the Exception and produces some output.
4 Whitespace changes
The following example shows an extract from the output to the pipeline:
Example 6: output showing changes to whitespace
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">
</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">
</deltaxml:text>
</deltaxml:textGroup>This section of the result file is showing that there have been changes to whitespace in the two input files. In the case of XHTML, whitespace like this is not significant as it will be rendered as a single space (or none at all depending on its location) when the page is viewed in a browser. It would be better if we could process the input files to remove any unnecessary whitespace before comparing them.
The NormalizeSpace filter, included in the DeltaXML release, is intended to do just that. Inter-element whitespace is removed completely and multiple whitespace characters within PCDATA are converted into a single space character.The following example shows how this filter is added to the pipeline:
Example 7: adding the NormalizeSpace filter to the pipeline
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
</inputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
</comparatorPipeline>5 Changes-only vs full delta
Now that insignificant whitespace changes have been removed, it is easier to see the changes we are interested in in the result file. One thing about the following example is immediately noticeable; where there are no changes to the text (e.g. the first two heading elements), the element content has been removed. This is not useful considering we wish to eventually display the result as an XHTML page. the reason for this is that DeltaXML is in changes-only mode by default. This mode produces results that remove the sub-trees of unchanged elements and only show the child elements where change has occurred. This is useful for large data files where too much context makes it difficult to see changes but is not useful for document-based formats, such as XHTML, where the end result will be the full document with changes.
Example 8: past of the result file showing the result of changes-only mode
<html xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" xmlns="http://www.w3.org/1999/xhtml" deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="changes-only"> <head deltaxml:deltaV2="A=B"> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> </head> <body deltaxml:deltaV2="A!=B"> <h1 deltaxml:deltaV2="A=B"/> <h2 deltaxml:deltaV2="A=B"/> <p deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">GROZNY, Russia - Militants stormed parliament in Russia's conflict-torn Chechnya Tuesday, seizing deputies and gunning down guards, before being killed in a bloody standoff with security forces. </deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">Militants stormed parliament in Russia's conflict-torn Chechnya on Tuesday, seizing deputies and gunning down guards before being killed in a bloody standoff with security forces. </deltaxml:text> </deltaxml:textGroup> </p> ... </body> </html>
In order to change this, we need to configure DeltaXML to produce full context result using a comparator feature. The following example shows how this is added to the pipeline:
Example 9: configuring the comparison to produce a full context result
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
</inputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
<comparatorFeatures>
<feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
</comparatorFeatures>
</comparatorPipeline>With this setting the result now includes the full content of all unchanged elements as can be seen in the following example:
Example 10: the full context result
<html xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" xmlns="http://www.w3.org/1999/xhtml" deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context"> <head deltaxml:deltaV2="A=B"> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>DeltaXML Pipeline Tutorial Sample</title> </head> <body deltaxml:deltaV2="A!=B"> <h1 deltaxml:deltaV2="A=B" id="title">World News Headlines</h1> <h2 deltaxml:deltaV2="A=B" id="chechnya-headline">Militants storm Chechen parliament in deadly standoff</h2> <p deltaxml:deltaV2="A!=B" id="chechnya-summary"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">GROZNY, Russia - Militants stormed parliament in Russia's conflict-torn Chechnya Tuesday, seizing deputies and gunning down guards, before being killed in a bloody standoff with security forces. </deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">Militants stormed parliament in Russia's conflict-torn Chechnya on Tuesday, seizing deputies and gunning down guards before being killed in a bloody standoff with security forces. </deltaxml:text> </deltaxml:textGroup> </p> ... </body> </html>
6 Highlighting changes
The aim if this pipeline is to produce a result XHTML file that has changes
highlighted using styling. At the moment, the result file contains changes
marked using deltaxml:textGroup elements and
deltaxml:deltaV2 attributes. The next step is to write an output
filter that will convert these changes into something that makes sense in XHTML.
One approach is to create two CSS styles that can be applied to content
contained within spans that have a particular class attribute. An output filter
that wraps all added or deleted content in a span with the appropriate class
attribute is fairly simple to write. it could also add the CSS to the header of
the result file. An example template to wrap deleted text is shown below:
Example 11: an XSLT template to wrap deleted text in a span with an appropriate class attribute
<!-- wrap all deleted text in a span with class attribute value 'deltaxml-old' -->
<xsl:template match="text()[ancestor::*[@deltaxml:deltaV2='A']]">
<xsl:element name="span" namespace="http://www.w3.org/1999/xhtml">
<xsl:attribute name="class" namespace="">deltaxml-old</xsl:attribute>
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>This matches all text that appears under a deleted element (or the deleted
part of a deltaxml:textGroup construct) and wraps it in a
span with class="deltaxml-old". A similar template can
be used for added text. A template match for the head element can
be used to add the CSS style information into the result.
The only other processing that needs to be done in this filter is to decide
which attributes to output where an element has attribute changes. Changed
attributes are all held within a deltaxml:attributes element. A
simple approach is to use a moded copy filter, dx2-extract-version-moded.xsl
(included in DeltaXML) to copy out attributes present in the latest version of
the document. This is achieved by importing the copy filter and using the
following template:
Example 12: a template to output the latest version of all attributes
<!-- If any attributes have been changed, output the new version --> <xsl:template match="deltaxml:attributes"> <xsl:apply-templates select="." mode="B"/> </xsl:template>
The full filter can be found in the sample directory (xhtml-outfilter.xsl). In order for the output file to be a valid XHTML document, all namespaces, attributes and elements added by DeltaXML need to be removed. The xhtml outfilter will remove a certain number of these but to ensure that they are all removed, the clean-house.xsl filter should be used. The following example shows how these filters are added into the pipeline:
Example 13: adding the xhtml and clean house outfilters to the pipeline.
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
</inputFilters>
<outputFilters>
<filter>
<file path="xhtml-outfilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/clean-house.xsl"/>
</filter>
</outputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
<comparatorFeatures>
<feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
</comparatorFeatures>
</comparatorPipeline>The output file is now a valid XHTML document and contains changes highlight with red and green backgrounds. The following is part of the output as rendered in a browser:
Example 14: part of the rendered result
GROZNY, Russia - Militants stormed parliament in Russia's conflict-torn Chechnya Tuesday, seizing deputies and gunning down guards, before being killed in a bloody standoff with security forces.Militants stormed parliament in Russia's conflict-torn Chechnya on Tuesday, seizing deputies and gunning down guards before being killed in a bloody standoff with security forces. |
7 Word by word comparison
As can be seen from the example above, the granularity of change is limited to single blocks of text. The real changes to the text above are smaller than the large change shown above but to achieve this level of granularity requires some more filters. These are the WordByWord filters and their use is explained fully in the sample 'How to Use Word by Word Text Comparison'. The following example shows the filters being added to the pipeline:
Example 15: word by word filters being added to the pipeline
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordInfilter"/>
</filter>
</inputFilters>
<outputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.OrphanedWordOutfilter"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordSpaceFixup"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter"/>
</filter>
<filter>
<file path="xhtml-outfilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/clean-house.xsl"/>
</filter>
</outputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
<comparatorFeatures>
<feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
</comparatorFeatures>
</comparatorPipeline>The example change shown above now looks like this:
Example 16: word by word in action
GROZNY, Russia - Militants stormed parliament in Russia's conflict-torn Chechnya on Tuesday, seizing deputies and gunning down guards,guards before being killed in a bloody standoff with security forces. |
This is still not ideal as the removal of a single comma character after the word 'guards' causes too much change to be shown. We need to define what characters are treated as punctuation as described in 'How to Use Word by Word Text Comparison'. The following template in an input filter will achieve this for us:
Example 17: a template to add a punctuation definition to the input body element
<xsl:template match="xhtml:body">
<xsl:copy>
<xsl:attribute name="deltaxml:punctuation" select="'. , ; ! ?'"/>
<xsl:apply-templates select="@*, node()"/>
</xsl:copy>
</xsl:template>This instructs the WordByWord filters to treat period, comma, semi-colon, exclamation mark and question mark characters as punctuation. The filter is added to the pipeline as follows:
Example 18: adding the xhtml infilter to the pipeline
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
<filter>
<file path="xhtml-infilter.xsl" relBase="dxp"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordInfilter"/>
</filter>
</inputFilters>
<outputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.OrphanedWordOutfilter"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordSpaceFixup"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter"/>
</filter>
<filter>
<file path="xhtml-outfilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/clean-house.xsl"/>
</filter>
</outputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
<comparatorFeatures>
<feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
</comparatorFeatures>
</comparatorPipeline>This has the desired effect:
Example 19: punctuation definition taking effect
GROZNY, Russia - Militants stormed parliament in Russia's conflict-torn Chechnya on Tuesday, seizing deputies and gunning down guards, before being killed in a bloody standoff with security forces. |
8 Using keys to manage alignment
Another part of the result looks like this:
Example 20: mismatched headings in the result
Flights expected to be canceled in France amid strikesNobel winner's brother refused China prison visit |
This has been caused by h2 elements in the input documents matching. The text content of these elements is completely different and so, even with word by word processing, all of the text is shown as changed. The h2 elements in the input look like this:
Example 21: the matched h2 elements from the two inputs
<h2 id="flights-headline">Flights expected to be canceled in France amid strikes</h2>
<h2 id="nobel-headline">Nobel winner's brother refused China prison visit </h2>
These elements represent different headlines, the first being deleted as part
of the edit and the second being added. Matching them together and representing
a change of content in the result is not the correct behaviour in this case.
What is needed is a way to inform DeltaXML that these are different elements and
should not be matched. The solution is to use the deltaxml:key
attribute. An element with a deltaxml:key attribute is only ever
matched with a corresponding element with the same name and key value. If we
were to use the id attributes on the h2 elements in the inputs as a key value,
they would not be matched during the comparison and the result would display a
deletion and addition of the whole element as would be expected.
Key attributes can be added by a new template in the existing xhtml-infilter.xsl file:
Example 22: XSLT template to copy id attribute values to keys
<!-- copy the value of all id attributes to deltaxml:key attributes --> <xsl:template match="@id"> <xsl:copy/> <xsl:attribute name="deltaxml:key" select="."/> </xsl:template>
9 Formatting Changes
The final paragraph of the result file looks like this:
Example 23: the final paragraph in the result file
Typhoon Megi, which left at least 10ten people dead as it crossed the Philippines yesterday, strengthened as it churned over the South China Sea on a path to Hong Kong, the US Navy JointJoint Typhoon Warning CenterTyphoon ...said. |
The changes to 'Joint Typhoon Warning Center' look incorrect as the words
'Joint Typhoon' are common and should not be shown as changed. The problem here
is that, as well as adding words, to the phrase, the while phrase has been
italicised using the <em> element. This affects the structure
of the paragraph and so the result appears as above. The solution to this is to
mark the <em> element as a formatting element. The
'How to Handle Formatting Element Changes' sample
explains in detail how and why this approach solves the problem. The following
example shows the template needed in the xhtml infilter to peform the task:
Example 24: an XSLT template to mark <em>
elements as formatting
<!-- mark em elements as formatting elements -->
<xsl:template match="xhtml:em">
<xsl:copy>
<xsl:attribute name="deltaxml:format" select="'true'"/>
<xsl:apply-templates select="@*, node()"/>
</xsl:copy>
</xsl:template>In a fully functional pipeline, this template should also add the attribute
to other formatting elements such as xhtml:strong,
xhtml:span etc.
Two XSLT filters must also be added to the pipeline, to process these elements before and after comparison. These are shown in the following example:
Example 25: adding the format element processing filters to the pipeline
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
<filter>
<file path="xhtml-infilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/dx2-format-infilter.xsl"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordInfilter"/>
</filter>
</inputFilters>
<outputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.OrphanedWordOutfilter"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordSpaceFixup"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter"/>
</filter>
<filter>
<resource name="xsl/dx2-format-outfilter.xsl"/>
</filter>
<filter>
<file path="xhtml-outfilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/clean-house.xsl"/>
</filter>
</outputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
<comparatorFeatures>
<feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
</comparatorFeatures>
</comparatorPipeline>The final paragraph now looks like this:
Example 26: the final paragraph in the result with the new format element handling
Typhoon Megi, which left at least 10ten people dead as it crossed the Philippines yesterday, strengthened as it churned over the South China Sea on a path to Hong Kong, the US Navy Joint Typhoon Warning Center... said. |
10 Gathering adds and deletes together
Having converted ids into keys to ensure that added and deleted headings and paragraphs are not compared against each other, the result now shows the following:
Example 27: added and deleted headings and paragraphs in the result
Flights expected to be canceled in France amid strikes Nobel winner's brother refused China prison visit A striker joins the blockade of a fuel storage depots to protest against pension reform on October 18 in Frontignan, France. Chinese authorities refused to allow the brother of jailed Nobel Peace Prize winner Liu Xiaobo to visit him in prison in apparent violation of the rules, a Hong Kong-based rights group said Tuesday. |
This would read better if the deleted summary was under the deleted headline and the added summary under the added headline. This can be achieved by using the red-green filter. This filter sorts consecutive, mixed added and deleted items so that all of the deleted items appear together and all of the added items appear together. the following example shows it being added to the pipeline:
Example 28: adding the dx2-red-green-outfilter.xsl to the pipeline
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
<filter>
<file path="xhtml-infilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/dx2-format-infilter.xsl"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordInfilter"/>
</filter>
</inputFilters>
<outputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.OrphanedWordOutfilter"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordSpaceFixup"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter"/>
</filter>
<filter>
<resource name="xsl/dx2-format-outfilter.xsl"/>
</filter>
<filter>
<resource name="xsl/dx2-red-green-outfilter.xsl"/>
</filter>
<filter>
<file path="xhtml-outfilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/clean-house.xsl"/>
</filter>
</outputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
<comparatorFeatures>
<feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
</comparatorFeatures>
</comparatorPipeline>The result then looks like this:
Example 29: the result after applying the red-green filter
Flights expected to be canceled in France amid strikes A striker joins the blockade of a fuel storage depots to protest against pension reform on October 18 in Frontignan, France. Nobel winner's brother refused China prison visit Chinese authorities refused to allow the brother of jailed Nobel Peace Prize winner Liu Xiaobo to visit him in prison in apparent violation of the rules, a Hong Kong-based rights group said Tuesday. |
11 Preserving the doctype
You may have noticed that the result file no longer has a doctype definition,
even though the inputs do. To preserve the doctype, it needs to first be
converted into XML so that it can be processed as part of the comparison and
then converted back into a doctype as part of the post-processing. This is
described in more detail in the 'How to preserve input
doctype information' sample but essentially consists of adding a pair of
filters to the filter chain (one input and one output) to preserve the doctype.
As the doctype-outfilter MUST be the last filter in the filter chain, it needs
to include the functionality of the clean-house filter added earlier. This is
achieved using the <xsl:include> element in the
doctype-outfilter. Clean house could not appear before the doctype-outfilter as
it would remove the doctype information stored as XML in the result file.
Example 30: adding filters to preserve doctype to the pipeline
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.DoctypeToXML"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
<filter>
<file path="xhtml-infilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/dx2-format-infilter.xsl"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordInfilter"/>
</filter>
</inputFilters>
<outputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.OrphanedWordOutfilter"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordSpaceFixup"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter"/>
</filter>
<filter>
<resource name="xsl/dx2-format-outfilter.xsl"/>
</filter>
<filter>
<resource name="xsl/dx2-red-green-outfilter.xsl"/>
</filter>
<filter>
<file path="xhtml-outfilter.xsl" relBase="dxp"/>
</filter>
<!-- doctype-outfilter includes clean-house functionality -->
<filter>
<resource name="xsl/doctype-outfilter.xsl"/>
</filter>
</outputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
<comparatorFeatures>
<feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
</comparatorFeatures>
</comparatorPipeline>The final step is to add a template to the xhtml outfilter to keep only the doctype information from the lastest version of the file:
Exampel 31: a template to add to the xhtml outfilter to keep only the latest doctype information
<!-- keep only the latest doctype information --> <xsl:template match="deltaxml:doctype"> <xsl:apply-templates select="." mode="B"/> </xsl:template>
The doctype is now output in the result file.
12 Preserving Comments
The inputs contain comments which you may wish to preserve in the result file. By default, comments are removed but it is possible to preserve them in a similar way to the doctype preservation. For full details see the 'How to Preserve Processing Instructions and Comments' sample.
Again, the solution is to add a pair of filters to the pipeline:
Example 32: adding filters to preserve comments to the pipeline
<comparatorPipeline description="Pipeline Tutorial" id="tutorial">
<inputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.DoctypeToXML"/>
</filter>
<filter>
<resource name="/xsl/pi2xml.xsl"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.NormalizeSpace"/>
</filter>
<filter>
<file path="xhtml-infilter.xsl" relBase="dxp"/>
</filter>
<filter>
<resource name="xsl/dx2-format-infilter.xsl"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordInfilter"/>
</filter>
</inputFilters>
<outputFilters>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.OrphanedWordOutfilter"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordSpaceFixup"/>
</filter>
<filter>
<class name="com.deltaxml.pipe.filters.dx2.wbw.WordOutfilter"/>
</filter>
<filter>
<resource name="xsl/dx2-format-outfilter.xsl"/>
</filter>
<filter>
<resource name="xsl/dx2-red-green-outfilter.xsl"/>
</filter>
<filter>
<resource name="/xsl/xml2pi.xsl"/>
</filter>
<filter>
<file path="xhtml-outfilter.xsl" relBase="dxp"/>
</filter>
<!-- doctype-outfilter includes clean-house functionality -->
<filter>
<resource name="xsl/doctype-outfilter.xsl"/>
</filter>
</outputFilters>
<parserFeatures>
<!-- w3.org gives 503 errors for DTD requests, we don't actually need it for this sample -->
<feature name="http://apache.org/xml/features/nonvalidating/load-external-dtd" literalValue="false"/>
</parserFeatures>
<comparatorFeatures>
<feature name="http://deltaxml.com/api/feature/isFullDelta" literalValue="true"/>
</comparatorFeatures>
</comparatorPipeline>By placing the xml2pi filter in front of the xhtml-outfilter, we avoid adding
<span> elements to modified text within the comments
themselves.
13 Summary
By adding various filters and configuration settings, we have constructed a pipeline to compare XHTML files and produce a result XHTML file that displays changes using CSS. This is by no means a production-level solution but should give an idea of how to approach building your own pipeline. Reuse of existing filters to solve common problems is a simple way of building functionality into a pipeline without have to write much, if any, XSLT of your own. Most of these filters have a sample page explaining their use in more detail. See the Core documentation page for more details.
14 Running the sample pipeline
If you have Ant installed, use the build script provided to run the sample.
Simply type the following command to run the pipeline and produce the output
file result.xhtml.
ant run
If you don't have Ant installed, you can run the sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct slashes for your operating system).
java -jar ../../command.jar compare tutorial document-v1.xhtml document-v2.xhtml result.xhtml
To clean up the sample directory, run the following Ant command.
ant clean
