Preserving Processing Instructions and Comments

1. Introduction

XML documents often contain processing instructions (PIs) or comments as well as the normal elements and attributes. These parts of the document are reported by the parser and processed during a comparison by default when the DocumentComparator or PipelinedComparatorS9 comparator classes are used. To achieve this, these node types are converted into XML elements within the document and then converted back again after comparison.

Sometimes however it is necessary to control comment and processing instruction processing in a more granular way, perhaps so that changes in these node types can be processed differently (by default the 'B' version of a changed node is output). This sample explains how this can be achieved using filters provided in the DeltaXML product.

Note that, if using the API, it may be easier to select one of the pre-configured lexical preservation modes as discussed in the Guide to Lexical Preservation, as many of them include the preservation of comments and processing instructions.

2. API approach

A simple approach for retaining PIs and comments is to enable the built-in lexical preservation on our 'Core S9 API' comparators, such as the PipelinedComparatorS9, which are configured by passing them a LexicalPreservationConfig object. The following code extract illustrates how to enable just PI and comment preservation on a LexicalPreservationConfig object.

LexicalPreservationConfig lpc= new LexicalPreservationConfig("base");
lpc.setPreserveProcessingInstructions(true);
lpc.setPreserveComments(true);

Having enabled the preservation the next step is to specify how changes in PIs and comments should be handled. It is relatively straightforward to handle PIs and comments that are either unchanged, or appear in an added or deleted XML element. In these cases, the PIs and comments appear as they would in the inputs. The difficulty comes in working out how to handle PIs or comments that are modified or only appear in one of the sources, but not in an added or deleted XML element. In some cases, such as when there is no way of representing a change in a PI or comment, it may be appropriate to output the newer 'B' version; this is illustrated in the following code extract.

lpc.setProcessingInstructionProcessingMode(PreservationProcessingMode.B);
lpc.setCommentProcessingMode(PreservationProcessingMode.B);
lpc.setOuterPiAndCommentProcessingMode(PreservationProcessingMode.CHANGE);

In cases where the desired output format can represent change of a comment or processing instruction, then further details on the lexical preservation format and scheme is required. This more advanced usage is discussed in the Explanation section of this sample.

3. DXP or DCP Approach

DXP and DCP pipeline configuration files can be used to configure the PipelinedComparatorS9 or the DocumentComparator respectively, allowing comparisons to be invoked with special configurations from a simple GUI or the command-line.

3.1. The lexicalPreservation DXP element.

The DXP and DCP pipeline configuration formats support the lexicalPreservation element that can be used to set lexical preservation options. The approach here is to first set the default options that apply to all lexical preservation artifacts, and then to set the overrides for specific lexical artifact types. This is illustrated in the sample preserve-pis-and-comments-lp.dxp, a snippet is shown below:

 ...
 <lexicalPreservation>
    <defaults>
      <retain literalValue="false"/>
    </defaults>
    <overrides>
      <preserveItems>
        <comments>
          <retain literalValue="true"/>
          <processingMode literalValue="B"/>
        </comments>
        <processingInstructions>
          <retain literalValue="true"/>
          <processingMode literalValue="B"/>
        </processingInstructions>
      </preserveItems>
    </overrides>
  </lexicalPreservation>
  ...

This approach is very flexible because settings can be parameterised by using parameters declared in a pipelineParameters element.

4. Running the sample

4.1. Using Java

If you have Ant installed, use the build script provided to run the sample. To use the DXP configuration exploiting a lexicalPreservation element, simply type the following command to run the pipeline and produce the dxp-lp-result.xml output file.

ant run

Use the following command to compile and run the sample with the API approach and produce the output file api-result.xml.

ant run-api

If you don't have Ant installed, you can run the sample DXP from a command line by issuing the following command from the sample directory (ensuring that you use the correct directory separators for your operating system).

java -jar ../../command.jar compare preserve input1.xml input2.xml result.xml

To clean up the sample directory, run the following command in Ant.

ant clean

4.2. Using .NET Framework

Here, a batch file is used to execute the comparison, the command to run the Pipelined Comparator sample is simply:

run.bat

5. Explanation

5.1. Converting Processing Instructions and Comments into XML

The first step in preserving PIs and comments is to convert them into XML elements. The following example shows an XML document that contains PIs and comments.

Example 1: an XML file containing PIs and comments (input1.xml in the sample directory)

<!-- document comment outside of the root element -->
<?pi_target pre-root processing instruction ?>
<root>
  <!-- the following paragraph is a pangram -->
  <para>The quick brown fox jumps over the lazy dog.</para>
  <?pi_target processing instructions may be modified ?>
  <para>A quick movement of the enemy will jeopardize six gunboats.</para>
  <!-- comments may be deleted -->
  <para>A final paragraph</para>
</root>
<!-- comments can appear after the root element -->
<?pi_target so can processing instructions ?>

The DeltaXML Core product includes a lexical preservation feature, which can be configured to enable the processing of PIs and comments. The following example shows the same file after the lexical preservation input processing has been applied. Notice that the PIs and comments that appeared outside of the root element have been moved inside it, wrapped in special container elements highlighting the fact.

Example 2: the XML file after passing through the LexicalPreservation input filter

<root xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"      
      xmlns:pi="http://www.deltaxml.com/ns/processing-instructions"      
      xmlns:preserve="http://www.deltaxml.com/ns/preserve">
  <preserve:pi-and-comment region="BEFORE_DTD">
    <preserve:comment> document comment outside of the root element </preserve:comment>
    <pi:pi_target>pre-root processing instruction </pi:pi_target>
  </preserve:pi-and-comment>
  <preserve:comment> the following paragraph is a pangram </preserve:comment>
  <para>The quick brown fox jumps over the lazy dog.</para>
  <pi:pi_target>processing instructions may be modified </pi:pi_target>
  <para>A quick movement of the enemy will jeopardize six gunboats.</para>
  <preserve:comment> comments may be deleted </preserve:comment>
  <para>A final paragraph</para>
  <preserve:pi-and-comment region="AFTER_BODY">
    <preserve:comment> comments can appear after the root element </preserve:comment>
    <pi:pi_target>so can processing instructions </pi:pi_target>
  </preserve:pi-and-comment>
</root>

These elements can now be compared as part of the comparison and will appear in the delta file.

5.2. Converting back after comparison

The following example shows the delta file produced after comparing input1.xml and input2.xml from the sample directory.

Example 3: a delta file showing changes to PIs and comments

<root xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"            
      xmlns:pi="http://www.deltaxml.com/ns/processing-instructions"      
      xmlns:preserve="http://www.deltaxml.com/ns/preserve"
      deltaxml:deltaV2="A!=B"
      deltaxml:version="2.0"
      deltaxml:content-type="full-context">
   <preserve:pi-and-comment deltaxml:deltaV2="A=B" region="BEFORE_DTD">
      <preserve:comment> document comment outside of the root element </preserve:comment>
      <pi:pi_target>pre-root processing instruction </pi:pi_target>
   </preserve:pi-and-comment>
   <preserve:comment deltaxml:deltaV2="A!=B">
      <deltaxml:textGroup deltaxml:deltaV2="A!=B">
         <deltaxml:text deltaxml:deltaV2="A"> the following paragraph is a pangram </deltaxml:text>
         <deltaxml:text deltaxml:deltaV2="B"> the following two paragraphs are pangrams </deltaxml:text>
      </deltaxml:textGroup>
   </preserve:comment>
   <para deltaxml:deltaV2="A=B">The quick brown fox jumps over the lazy dog.</para>
   <pi:pi_target deltaxml:deltaV2="A!=B">
      <deltaxml:textGroup deltaxml:deltaV2="A!=B">
         <deltaxml:text deltaxml:deltaV2="A">processing instructions may be modified </deltaxml:text>
         <deltaxml:text deltaxml:deltaV2="B">processing instructions may be changed </deltaxml:text>
      </deltaxml:textGroup>
   </pi:pi_target>
   <para deltaxml:deltaV2="A=B">A quick movement of the enemy will jeopardize six gunboats.</para>
   <preserve:comment deltaxml:deltaV2="A"> comments may be deleted </preserve:comment>
   <para deltaxml:deltaV2="A=B">A final paragraph</para>
   <preserve:pi-and-comment deltaxml:deltaV2="A=B" region="AFTER_BODY">
      <preserve:comment> comments can appear after the root element </preserve:comment>
      <pi:pi_target>so can processing instructions </pi:pi_target>
   </preserve:pi-and-comment>
</root>

The lexical preservation output chain can be used to convert this information into processing instructions and comments, as discussed in the API approach section. However, it is also possible for you to convert the lexically preserved processing instructions and comments into other custom formats, using custom XSLT filters. Further details on the format are presented in the Lexical Preservation Format document.

The following example shows the affect of configuring the lexical preservation to choose the 'B' version of the encoded processing instructions and comments of the XML in example 3.

Example 4: output that chooses the 'B' version of PIs and comments

<!-- document comment outside of the root element -->
<?pi_target pre-root processing instruction ?>
<root xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context">
  <!-- the following two paragraphs are pangrams -->
  <para deltaxml:deltaV2="A=B">The quick brown fox jumps over the lazy dog.</para>
  <?pi_target processing instructions may be changed ?>
  <para deltaxml:deltaV2="A=B">A quick movement of the enemy will jeopardize six gunboats.</para>
  <deltaxml:textGroup deltaxml:deltaV2="A"><deltaxml:text deltaxml:deltaV2="A">
  </deltaxml:text></deltaxml:textGroup><para deltaxml:deltaV2="A=B">A final paragraph</para>
</root><!-- comments can appear after the root element -->
<?pi_target so can processing instructions ?>