Preserving Doctype Information

1. Introduction

While the PipelinedComparator and DXP pipelines have facilities for specifying a doctype using output properties, they used fixed values, or at best, values that must be specified as parameters. If the input files themselves contain a doctype, it is often preferable to use this as the doctype for the output document without having to specify it. This sample pipeline demonstrates how to output a dynamic doctype based on the doctypes in the input documents.

Note that it may be easier to select one of the pre-configured lexical preservation modes as discussed in the Guide to Lexical Preservation, as many of them include the dynamic preservation of doctypes.

2. Simple API approach

A simple approach for retaining doctypes is to enable the built-in lexical preservation on our 'Core S9 API' comparators, such as the PipelinedComparatorS9, which are configured by passing them a LexicalPreservationConfig object. The following code extract illustrates how to enable just doctype preservation on a LexicalPreservationConfig object.

LexicalPreservationConfig lpc= new LexicalPreservationConfig("base");
lpc.setPreserveDoctype(true);

Having enabled the preservation the next step is to specify how changes in doctype and its optional internal subset should be handled. It is straightforward to handle unchanged doctypes, their input value is passed through to the output, possibly with different whitespace layout as this is not reported by the parser. The difficulty comes in working out how to handle inconsistent doctypes, as it is not feasible for an output XML document to have more than one doctype. Here, an answer could be to choose one of the input doctypes, and hope that they are compatible. The 'B' input can be chosen as follows:

lpc.setDoctypeProcessingMode(PreservationProcessingMode.B);

One problem with the above 'input selection' approach is that a doctype's internal subset can declare elements, attributes and entity references, which are used in the document. Therefore, removing these declarations could cause the output document to become invalid. Hence, the lexical preservation scheme provides a special output mode that enables the internal subset declarations from both inputs to be kept, except where they conflict, in which case one is chosen. Setting the doctype output mode to 'BdA' has the affect of choosing the 'B' version of all doctype information, and the 'A' version of any declarations when there is no 'B' version of it.

lpc.setDoctypeProcessingMode(PreservationProcessingMode.BdA);

For further information on the representation of the doctype information please refer to the Explanation section of this sample.

3. Simple command-line approach

It is possible to specify the lexical preservation options when running a lexically enabled pipeline comparator on the command-line, via setting the configuration properties as discussed in the Configuration Properties section of the Lexical Preservation guide. For example, the following configuration properties file can be used to setup the lexical preservation configuration in the same manner as that discussed in the Simple API approach above.

<!DOCTYPE deltaxmlConfig SYSTEM "deltaxml-config.dtd">
<deltaxmlConfig>
  <configProperty name="com.deltaxml.lexicalPreservation.base"
                  value="base" />
  <configProperty name="com.deltaxml.lexicalPreservation.in.items" 
                  value="doctype" />
  <configProperty name="com.deltaxml.lexicalPreservation.out.items" 
                  value="default:change, doctype:BdA" />
</deltaxmlConfig>

Note that this approach actually changes the default lexical properties configuration, but has the desired affect for running a single comparison from the command line. Having said this, care has to be taken to ensure that such configuration property files are not accidentally used.

Note that the above approach has been superseded as of Core V7.2. A DXP file can now use a lexicalPreservation element to configure lexical preservation properties. For further information, see the Preserving Processing Instructions and Comments sample.

4. Running the sample

4.1. Using Java

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output files result-BdA.xhtml and result-B.xhtml.

ant run

If you don't have Ant installed, you can run the sample from a command line by issuing the following commands from the sample directory (ensuring that you use the correct copy command and directory separator for your operating system).

copy config/deltaxmlConfig-BdA.xml deltaxmlConfig.xml
java -jar ../../command.jar compare doctypes input1.xhtml input2.xhtml result-BdA.xhtml
copy config/deltaxmlConfig-B.xml deltaxmlConfig.xml
java -jar ../../command.jar compare doctypes input1.xhtml input2.xhtml result-B.xhtml

To clean up the sample directory, run the following command in Ant.

ant clean

4.2. Using .NET Framework

Here, a batch file is used to execute the comparison, the command to run the Pipelined Comparator sample is simply:

run.bat

5. Explanation

5.1. Converting a Doctype into XML

The first step that must be preformed is to convert the Doctype into XML inside the document. As it stands, the Doctype is not technically part of the XML document itself and will not pass through from the parser to the Comparator unless we intervene. In order to preserve this information, we need to make use of the Java LexicalHandler interface (included in the org.xml.sax packages). This interface allows the parser to report doctype information to the first filter in the input chain. This filter must be a Java filter, there is currently no mechanism to achieve this in XSLT. Included in deltaxml.jar is a class called com.deltaxml.pipe.filters.LexicalPreservation. This filter extends XMLFilterImpl3 which in turn implements the LexicalHandler interface. The purpose of LexicalPreservation, amongst other things, is to convert the reported doctype information into an element that is added as the first child of the root element. This will then be passed through to the Comparator and make its way through to the output filters.

Example 1 show an input document with a doctype declaration and example 2 shows the same document after it has passed through the LexicalPreservation filter.

Example 1: input document with doctype (input1.xhtml in the sample directory)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
  <!ELEMENT testDelete (EMPTY)>
  <!ELEMENT testCommon (EMPTY)>
  <!ELEMENT testConflict (EMPTY)>
]>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>A xhtml document</title>
    </head>
    <body>
      <p>A xhtml document (view source for DOCTYPE and internal subset details).</p>
    </body>
</html>

Example 2: the input document after passing through LexicalPreservation

<html xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
      xmlns:preserve="http://www.deltaxml.com/ns/preserve"
      xmlns:er="http://www.deltaxml.com/ns/entity-references"
      xmlns:pi="http://www.deltaxml.com/ns/processing-instructions"
      xmlns="http://www.w3.org/1999/xhtml">
  <preserve:doctype name="html" publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"
    systemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <preserve:elementDecl name="testDelete" 
      deltaxml:key="element_testDelete" model="(EMPTY)"/>
    <preserve:elementDecl name="testCommon" 
      deltaxml:key="element_testCommon" model="(EMPTY)"/>
    <preserve:elementDecl name="testConflict" 
      deltaxml:key="element_testConflict" model="(EMPTY)"/>
  </preserve:doctype>
  <head>
    <title>A xhtml document</title>
  </head>
  <body>
    <p>A xhtml document (view source for DOCTYPE and internal subset details).</p>
  </body>
</html>

5.2. Handling Changes to the Doctype

Now that the doctype is held as XML within the document, it will be compared as part of the comparison process. This means that if the input documents have different doctypes, that change will be reflected in the result document. In order to output the doctype correctly, we must decide which version is going to be used. We can make use of the generic ignore-changes filters to process this. While this may be a little more complicated than the sample pipeline warrants, it is included as a filter that could be adapted to ignore other changes types at the same time. The most important point is that the <preserve:doctype> element should be marked as unchanged by the time we reach the final filter. For more information on the ignoring changes, see the ignore-changes sample.

Example 3 shows how a doctype may change. The XML shown is the a snippet from the immediate output of the comparator when comparing input1.xhtml and input2.xhtml (included in the sample directory)

Example 3: a modified <preserve:doctype> element

<html xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
      xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute"
      xmlns:preserve="http://www.deltaxml.com/ns/preserve"
      deltaxml:deltaV2="A!=B" deltaxml:version="2.0" 
      deltaxml:content-type="full-context">
   <preserve:doctype deltaxml:deltaV2="A!=B" name="html">
      <deltaxml:attributes deltaxml:deltaV2="A!=B">
         <dxa:publicId deltaxml:deltaV2="A!=B">
            <deltaxml:attributeValue deltaxml:deltaV2="A">
              -//W3C//DTD XHTML 1.0 Transitional//EN
            </deltaxml:attributeValue>
            <deltaxml:attributeValue deltaxml:deltaV2="B">
              -//W3C//DTD XHTML 1.0 Strict//EN
            </deltaxml:attributeValue>
         </dxa:publicId>
         <dxa:systemId deltaxml:deltaV2="A!=B">
            <deltaxml:attributeValue deltaxml:deltaV2="A">
              http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
            </deltaxml:attributeValue>
            <deltaxml:attributeValue deltaxml:deltaV2="B">
              http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
            </deltaxml:attributeValue>
         </dxa:systemId>
      </deltaxml:attributes>
      <preserve:elementDecl deltaxml:deltaV2="A" name="testDelete" 
        deltaxml:key="element_testDelete" model="(EMPTY)"/>
      <preserve:elementDecl deltaxml:deltaV2="A=B" name="testCommon"
        deltaxml:key="element_testCommon" model="(EMPTY)"/>
      <preserve:elementDecl deltaxml:deltaV2="B" name="testAdd"
        deltaxml:key="element_testAdd" model="(EMPTY)"/>
      <preserve:elementDecl deltaxml:deltaV2="A!=B" name="testConflict"
         deltaxml:key="element_testConflict">
         <deltaxml:attributes deltaxml:deltaV2="A!=B">
            <dxa:model deltaxml:deltaV2="A!=B">
               <deltaxml:attributeValue deltaxml:deltaV2="A">(EMPTY)</deltaxml:attributeValue>
               <deltaxml:attributeValue deltaxml:deltaV2="B">(ANY)</deltaxml:attributeValue>
            </dxa:model>
         </deltaxml:attributes>
      </preserve:elementDecl>
   </preserve:doctype>
   <head deltaxml:deltaV2="A=B">
      <title>A xhtml document</title>
   </head>
   <body deltaxml:deltaV2="A=B">
      <p>A xhtml document (view source for DOCTYPE and internal subset details).</p>
   </body>
</html>

As can be seen above, the publicId and systemId for the doctype have changed between the inputs. In order to output a doctype in the result we need to decide which version we are going to use. An instance of the 'Core S9API' LexicalPreservationConfig class can be used to choose which version of the doctype to output in the event of a change (as discussed previously). Example 4 shows the output when the DOCTYPE output mode is set to 'BdA'; i.e. to output the B doctype where present, or the A doctype if there wasn't one in input B.

Example 4: the resulting file

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [
  <!ELEMENT testDelete (EMPTY)>
  <!ELEMENT testCommon (EMPTY)>
  <!ELEMENT testAdd (EMPTY)>
  <!ELEMENT testConflict (ANY)>
]>
<html xmlns="http://www.w3.org/1999/xhtml"
  xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
  deltaxml:deltaV2="A=B" deltaxml:version="2.0"
  deltaxml:content-type="full-context">
    <head>
        <title>A xhtml document</title>
    </head>
    <body>
      <p>A xhtml document (view source for DOCTYPE and internal subset details).</p>
    </body>
</html>

Note the old testDelete element declaration was only in 'A' input and modified testConflict element declaration was in both the 'A' and 'B' inputs. Therefore, according to the 'BdA' behaviour the deleted declaration (i.e. testDelete) and the 'B' version of the modified declaration are in the output.