Managing change in an XML environment

How to Preserve Doctype Information

1 Introduction

While the PipelinedComparator and DXP pipelines have facilities for specifying a doctype using output properties, they used fixed values, or at best, values that must be specified as parameters. If the input files themselves contain a doctype, it is often preferable to use this as the doctype for the output document without having to specify it. This sample pipeline demonstrates how to output a dynamic doctype based on the doctypes in the input documents.

2 Converting a Doctype into XML

The first step that must be preformed is to convert the Doctype into XML inside the document. As it stands, the Doctype is not technically part of the XML document itself and will not pass through from the parser to the Comparator unless we intervene. In order to preserve this information, we need to make use of the Java LexicalHandler interface (included in the org.xml.sax packages). This interface allows the parser to report doctype information to the first filter in the input chain. This filter must be a Java filter, there is currently no mechanism to achieve this in XSLT. Included in the sample directory is the source code for the DoctypeToXML filter. This filter extends XMLFilterImpl2 which in turn implements the LexicalHandler interface. The purpose of DoctypeToXML is to convert the reported doctype information into an element that is added as the forst child of the root element. This will then be passed through to the Comparator and make its way through to the output filters.

Example 1 show an input document with a doctype declaration and example 2 shows the same document after it has passed through the DoctypeToXML filter.

Example 1: input document with doctype (transitional.xhtml in the sample directory)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>A transitional document</title>
    </head>
    <body>
      <p>A document that may become strict in the future, but is currently transitional</p>
    </body>
</html>

Example 2: the input document after passing through DoctypeToXML

<html xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" 
      xmlns="http://www.w3.org/1999/xhtml">
   <deltaxml:doctype name="html" 
                     publicId="-//W3C//DTD XHTML 1.0 Transitional//EN" 
                     systemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>
   <head>
      <title>A transitional document</title>
   </head>
   <body>
      <p>A document that may become strict in the future, but is currently transitional</p>
   </body>
</html>

3 Handling Changes To the Doctype

Now that the doctype is held as XML within the document, it will be compared as part of the comparison process. This means that if the input documents have different doctypes, that change will be reflected in the result document. In order to output the doctype correctly, we must decide which version is going to be used. We can make use of the generic ignore-changes filters to process this. While this may be a little more complicated than the sample pipeline warrants, it is included as a filter that could be adapted to ignore other changes types at the same time. The most important point is that the <deltaxml:doctype> element should be marked as unchanged by the time we reach the final filter. For more information on the ignoring changes, see the ignore-changes sample.

Example 3 shows how a doctype may change. The XML shown is the a snippet from the immediate output of the comparator when comparing transitional.xhtml and strict.xhtml (included in the sample directory)

Example 3: a modified <deltaxml:doctype> element

<deltaxml:doctype deltaxml:deltaV2="A!=B" name="html">
    <deltaxml:attributes deltaxml:deltaV2="A!=B">
        <dxa:publicId deltaxml:deltaV2="A!=B">
            <deltaxml:attributeValue deltaxml:deltaV2="A">-//W3C//DTD XHTML 1.0 Transitional//EN</deltaxml:attributeValue>
            <deltaxml:attributeValue deltaxml:deltaV2="B">-//W3C//DTD XHTML 1.0 Strict//EN</deltaxml:attributeValue>
        </dxa:publicId>
        <dxa:systemId deltaxml:deltaV2="A!=B">
            <deltaxml:attributeValue deltaxml:deltaV2="A">http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd</deltaxml:attributeValue>
            <deltaxml:attributeValue deltaxml:deltaV2="B">http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd</deltaxml:attributeValue>
        </dxa:systemId>
    </deltaxml:attributes>
</deltaxml:doctype>

As can be seen above, the publicId and systemId for the doctype have changed between the inputs. In order to output a doctype in the result we need to decide which version we are going to use. Example 4 shows an xsl filter that marks up parts of the modified doctype on which to ignore changes.

Example 4: filter to mark with changes to ignore (mark-ignore-doctype-changes.xsl in the sample directory)

<xsl:stylesheet version="2.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute"
                xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
  
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- specify which DOCTYPE systemId should be used if changes present -->
  <!-- see:  http://www.deltaxml.com/dxml/library/ignore-changes.html for more info on possible usage -->
  <xsl:template match="deltaxml:doctype/deltaxml:attributes/dxa:systemId">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="'B,A'"/>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- specify which DOCTYPE publicId should be used if changes present -->
  <xsl:template match="deltaxml:doctype/deltaxml:attributes/dxa:publicId">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="'B,A'"/>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- specify which DOCTYPE root element name should be used if changes present -->
  <xsl:template match="deltaxml:doctype/deltaxml:attributes/dxa:name">
    <xsl:copy>
      <xsl:attribute name="deltaxml:ignore-changes" select="'B,A'"/>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Once this filter has been applied, the standard ignore-changes filters (apply-ignore-changes.xsl and propagate-ignore-changes.xsl) can also be applied. This will result in the <deltaxml:doctype> element being marked as unchanged and having the values present in input B.

4 Output the Doctype in the result.

The final step is to output the doctype in the result file. This must be performed as the last stage of the pipeline. The doctype is output using an XSLT filter that makes use of a saxon extension function. It can therefore only run on a version of Saxon that has extension functions enabled (for Saxon 9.2 and above, this must be either a Professional or Enterprise licensed Edition). Because the filter that outputs the doctype must be the last filter, the 'clean house' functionality must be included in the filter if it is required. An XSLT implementation of the CleanHouse java filter is included for this purpose.

Example 5 shows the output filter that converts the <deltaxml:doctype> element back into a doctype. This implementation will only handle an unchanged <deltaxml:doctype> element, hence the previous step. It is possible to rewrite it to select which version to output at this stage if required.

Example 5: filter using saxon:doctype to output a doctype (doctype-outfilter.xsl in the sample directory)

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
                xmlns:xs="http://www.w3.org/2001/XMLSchema" 
                xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
                xmlns:saxon="http://saxon.sf.net/"
                xmlns:dtd="http://saxon.sf.net/dtd"
                exclude-result-prefixes="#all" 
                version="2.0">
  
  <!-- include clean-house functionality if necessary -->
  <xsl:include href="clean-house.xsl"/> 
  
  <xsl:template match="/*">
    <xsl:apply-templates select="deltaxml:doctype"/>
    <xsl:copy copy-namespaces="no">
      <xsl:apply-templates select="@*, node() except deltaxml:doctype"/>
    </xsl:copy>
  </xsl:template>
  
  <!-- output the doctype using the saxon:doctype instruction -->
  <xsl:template match="deltaxml:doctype">
    <saxon:doctype xsl:extension-element-prefixes="saxon">
      <dtd:doctype name="{@name}">
        <xsl:if test="@systemId">
          <xsl:attribute name="system" select="@systemId"/>
        </xsl:if>
        <xsl:if test="@publicId">
          <xsl:attribute name="public" select="@publicId"/>
        </xsl:if>
      </dtd:doctype>
    </saxon:doctype>
  </xsl:template>
  
</xsl:stylesheet>

5 Running the sample

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output files transitional-strict-result.xhtml and strict-transitional-result.xhtml.

ant run

If you don't have Ant installed, you can run the sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct slashes for your operating system).

java -jar ../../command.jar compare doctypes transitional.xhtml strict.xhtml transitional-strict-result.xhtml
java -jar ../../command.jar compare doctypes strict.xhtml transitional.xhtml strict-transitional-result.xhtml

To clean up the sample directory, run the following command in Ant.

ant clean