Table of Contents
How to Ignore Formatting Element Changes
1 Introduction
There are many XML languages that describe documents available; XHTML,
DocBook and ODF to name a few. One common feature of these XML languages is that
some of their elements are used not to define structure but to mark text as
having a certain format. Elements such as <strong> or
<em> in XHTML, <emphasis> in DocBook and
<text:span> in ODF are examples of such elements.
DeltaXML Core makes no distinction between structural elements and formatting elements when comparing two versions of a document. Because of this, changes to formatting elements can generate more change than expected in a delta file. Consider the following examples of a very simple documentation language.
Example 1: A simple XML document
(input1.xml
in the sample directory)
<document> <para>This paragraph will have words made bold in the following version.</para> <para>In this sentence, new words will be added and some made bold.</para> </document>
Example 2: The same document with text changes and text formatting
added
(input2.xml
in the sample directory)
<document> <para>This paragraph will have <bold>words made bold</bold> in the following version.</para> <para>In this sentence, <bold>new bold words</bold> will be added and some made bold.</para> </document>
When these files are compared, they generate a delta file that shows a lot of change.
Example 3: The delta file without taking formatting elements into account (text is being compared word by word)
<document xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context"> <para deltaxml:deltaV2="A!=B"> This paragraph will have <deltaxml:textGroup deltaxml:deltaV2="A"> <deltaxml:text deltaxml:deltaV2="A">words</deltaxml:text> </deltaxml:textGroup> <bold deltaxml:deltaV2="B">words made bold</bold> <deltaxml:textGroup deltaxml:deltaV2="A"> <deltaxml:text deltaxml:deltaV2="A">made bold </deltaxml:text> </deltaxml:textGroup> in the following version. </para> <para deltaxml:deltaV2="A!=B"> In this sentence, <deltaxml:textGroup deltaxml:deltaV2="A"> <deltaxml:text deltaxml:deltaV2="A">new</deltaxml:text> </deltaxml:textGroup> <bold deltaxml:deltaV2="B">new bold words</bold> <deltaxml:textGroup deltaxml:deltaV2="A"> <deltaxml:text deltaxml:deltaV2="A">words </deltaxml:text> </deltaxml:textGroup> will be added and some made bold. </para> </document>
Although this is a correct representation of what has changed between the two versions of the document, it may not be intuitive to somebody making changes in a WYSIWYG editor. For a document editor, the most important changes are textual changes, not the format changes and this delta file shows more text change than actually occurred. DeltaXMLCore includes some XSLT filters to improve this result by taking into account those elements that are merely used for textual formatting.
2 Marking up formatting elements
The first step in improving this result is to mark up the elements that are
used for text formatting so that DeltaXMLCore can identify them. This is
achieved by adding the deltaxml:format="true" attribute to those
elements. In the example documents above, the <bold> element
needs to be marked in this way. The following XSLT template could be used to do
this.
Example 4: an XSLT template to mark bold elements (defined
in
mark-formatting.xslin the sample directory)
<xsl:template match="bold"> <xsl:copy> <xsl:attribute name="deltaxml:format" select="'true'"/> <xsl:apply-templates select="node()"/> </xsl:copy> </xsl:template>
3 Flattening formatting elements
The next step is to flatten the marked elements to remove the structure they
create in the document. This, together with a word by word comparison (see the
word by word sample for more details), allows the
text to be given a higher matching priority than formatting. To flatten the
elements, the document is processed using the
dx2-format-infilter.xsl
included in DeltaXMLCore to produce the following document:
Example 5: the document with flattened formatting elements (indentation to improve readability)
<document xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"> <para> This paragraph will have <deltaxml:format-start> <deltaxml:element> <bold/> </deltaxml:element> </deltaxml:format-start> words made bold <deltaxml:format-end/> in the following version. </para> <para> In this sentence, <deltaxml:format-start> <deltaxml:element> <bold/> </deltaxml:element> </deltaxml:format-start> new bold words <deltaxml:format-end/> will be added and some made bold. </para> </document>
The <deltaxml:element> element is used as a wrapper to
store the formatting element in, along with any attributes it may have.
Now that the formatting has been flattened, all text within the paragraph is at the same level of hierarchy and so can be compared in a more appropriate way.
4 Reconstructing formatting
Once the comparison has taken place, the flattened formatting elements need to be reconstructed. Because it may not be possible to reconstruct elements from both versions of the document (there is no guarantee that the nesting will be correct for a well-formed XML document), only the formatting elements from the second input are reconstructed. This means that any flattened formatting elements with a delta value of 'A' are ignored. This has the effect of ignoring all formatting changes and showing the formatting of the second document in the result but with textual changes marked.
To reconstruct the formatting, the document should be processed with the
dx2-format-outfilter.xsl
XSLT stylesheet, included with DeltaXMLCore. Since we are ignoring certain
changes, there may be cases where a delta value of 'A!=B' is no longer correct
for a paragraph (the only changes were to formatting). in order to correct these
delta values, the final filter from the ignore-changes solution,
propagate-ignore-changes.xsl, should be run.
The final result for the example files above is as follows:
Example 6: the final result
<document xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context"> <para deltaxml:deltaV2="A=B"> This paragraph will have <bold>words made bold</bold> in the following version. </para> <para deltaxml:deltaV2="A!=B"> In this sentence, <bold deltaxml:deltaV2="A!=B">new <deltaxml:textGroup deltaxml:deltaV2="B"> <deltaxml:text deltaxml:deltaV2="B">bold </deltaxml:text> </deltaxml:textGroup> words</bold> will be added and some made bold. </para> </document>
Note that the first paragraph is marked as unchanged because there have been no textual changes. the second paragraph shows the new word added within the newly bolded text.
5 Summary
- Changes to formatting elements is shown as a structural change unless they are marked.
- Formatting elements can be marked and flattened using filters provided as part of DeltaXMLCore.
- Once documents have been compared, flattened formatting is reconstructed.
- The result is a delta file that focuses on textual change.
6 Running the sample
If you have Ant installed, use the build script provided to run the sample.
Simply type the following command to run the pipeline and produce the output
files no-format-result.xml and format-result.xml.
run ant
If you don't have Ant installed, you can run the sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct slashes for your operating system).
java -jar ../../command.jar compare format input1.xml input2.xml no-format-result.xml mark-formatting=false java -jar ../../command.jar compare format input1.xml input2.xml format-result.xml mark-formatting=true
To clean up the sample directory, run the following command in Ant.
ant clean
