Preserving Entity References

1. Introduction

XML documents sometimes contain entity references. While entity references can be either expanded or left as references within an XML document, they are not by default processed during a comparison. This is of course not always an issue but sometimes it is necessary to include entity references in the result. To achieve this, they need to be converted into XML elements within the document and then converted back again after comparison. This sample explains how this can be achieved using filters provided in the DeltaXML product.

Note that it may be easier to select one of the pre-configured lexical preservation modes as discussed in the Guide to Lexical Preservation, as some of them include entity preservation (e.g. 'roundTrip' preservation mode).

2. Simple API approach

A simple approach for retaining entity references is to enable the built-in lexical preservation on our 'Core S9 API' comparators, such as the PipelinedComparatorS9, which are configured by passing them a LexicalPreservationConfig object. The following code extract illustrates how to enable just entity references preservation on a LexicalPreservationConfig object.

LexicalPreservationConfig lpc= new LexicalPreservationConfig("base");
lpc.setPreserveEntityReplacementText(false);
lpc.setPreserveEntityReferences(true);

Having enabled the preservation the next step is to specify how changes in entity references should be handled. It is relatively straightforward to handle entity references that are either unchanged, or appear in an added or deleted XML element. In these cases, the entity references appear as they would in the inputs. The difficulty comes in working out how to handle entity references that are modified or only appear in one of the sources, but not in an added or deleted XML element. In some cases, such as when there is no way of representing a change in a entity reference, it may be appropriate to output the newer 'B' version; this is illustrated in the following code extract.

lpc.setAdvancedEntityReferenceUsage(AdvancedEntityRefUsage.SPLIT);

It is also possible to detect changes in the content of an entity reference, by retaining both the entity reference and its replacement text, before the comparison is performed. Following the comparison, any changes in the replacement text will be identified using the usual scheme. This can then be used to highlight that the entity reference's content has changed by adding and deleting the associated entity reference in the output, as discussed in the Predefined Preservation Modes section of the Guide to Lexical Preservation.

LexicalPreservationConfig lpc= new LexicalPreservationConfig("base");
lpc.setPreserveEntityReferences(true);
lpc.setPreserveNestedEntityReferences(true);

Note that if both the entity reference and replacement text are being preserved then it is possible to choose which should be retained in the output using the lpc.setUseEntityReferences method. The default behaviour is to preserve the entity references, rather than use the associated replacement text.

For further information on the representation of the entity declarations and references please refer to the Explanation section of this sample.

3. Running the sample

If you have Ant installed, use the build script provided to run the sample. Simply type the following command to run the pipeline and produce the output files result.xml.

ant run

If you don't have Ant installed, you can run the sample from a command line by issuing the following command from the sample directory (ensuring that you use the correct directory separators for your operating system).

java -jar ../../command.jar compare preserve input1.xml input2.xml result.xml

To clean up the sample directory, run the following command in Ant.

ant clean

4. Explanation

The following explanation makes use of a simplified variant of the sample input files. The key changes are that the DOCTYPE has been changed from XHTML to one specified solely by an internal subset, and that the explanatory text has been removed (as it is contained in this document). This allows us to focus on how changes in the entity declarations and use (via entity references) are handled.

Note: The output is not valid HTML, as it contains an internal subset, but it illustrates the changes in entity references in several web-browsers (including Internet Explorer, Safari, Firefox, and Chrome), so long as the file extension is '.html'.

4.1. Entity references into XML

The first step in preserving entity references is to convert them into XML elements. The following example shows an XML document that contains entity references.

Example 1: an XML file containing entity references

<!DOCTYPE root [
  <!ELEMENT root (p*)>
  <!ELEMENT p (#PCDATA)>
  <!ENTITY city1 "Bath">
  <!ENTITY city2 "York">
  <!ENTITY city3 "Bath">
  <!ENTITY p1 "<p>From &city1; to &city2;.</p>">
  <!ENTITY p2 "<p>From &city1; to &city3;.</p>">
]>
<root>
  <p>City 1 is &city1;</p>
  <p>City 2 is &city2;</p>
  &p1;
  &p2;
</root>

The DeltaXML Core product include filters for 'lexical preservation' which includes the processing of entity references. The following example shows the same file after being loaded with lexical preservation enabled.

Example 2: the XML file after passing through the lexical preservation input processing.

Part 1 - Encoding the doctype and internal subset declarations

<root xmlns:preserve="" 
      xmlns:er=""
      xmlns:deltaxml="">
  <preserve:doctype name="root">
    <preserve elementDecl name="root" deltaxml:key="element_root" model="(p*)" />
    <preserve elementDecl name="p" deltaxml:key="element_p" model="(#PCDATA)" />
 
    <preserve:internalParsedGeneralEntityDecl name="city1" value="Bath"
      deltaxml:key="entity_gen_city1" />
    <preserve:internalParsedGeneralEntityDecl name="city2" value="York"
      deltaxml:key="entity_gen_city2" />
    <preserve:internalParsedGeneralEntityDecl name="city3" value="Bath"
      deltaxml:key="entity_gen_city3" />

    <preserve:internalParsedGeneralEntityDecl name="p1" 
      value="!(*lt!)p!(*gt!)From !(*amp!)city1; to !(*amp!)city2;./!(*lt!)p!(*gt!)"
      deltaxml:key="entity_gen_p1" />
    <preserve:internalParsedGeneralEntityDecl name="p2" 
      value="!(*lt!)p!(*gt!)From !(*amp!)city1; to !(*amp!)city3;./!(*lt!)p!(*gt!)"
      deltaxml:key="entity_gen_p2" />
  </preserve:doctype>

Note that the encoded value of the p1 and p2 entity declaration makes use of a non-standard XML entity character encoding scheme, as this simplifies some of the lexical preservation processing and assists with debugging.

Part 2a - Encoding the body in the preconfigured 'roundTrip' preservation mode

  <er:city1/>
  <er:city2/>
  <er:p1/>
</root>

Part 2b - Encoding the body in the preconfigured 'entityRef' preservation mode

  <er:city1>Bath</er:city1>
  <er:city2>York</er:city2>
  <er:p1><p>From Bath to York.</p></er:p1>
  <er:p2><p>From Bath to Bath.</p></er:p2>
</root>

Part 2c - Encoding the body in the preconfigured 'nestedEntityRef' preservation mode

  <er:city1>Bath</er:city1>
  <er:city2>York</er:city2>
  <er:p1><p>From <er:city1>Bath</er:city1> to <er:city2>York</er:city2>.</p></er:p1>
  <er:p2><p>From <er:city1>Bath</er:city1> to <er:city3>Bath</er:city3>.</p></er:p2>
</root>

4.2. Raw intermediate comparison result

Having encoded the entity references for the purposes of comparison, the next step is to perform the comparison and see how changes in the encoded entity references are represented. We continue the above example by creating an input where the inner/nested city entity references in the definitions of both &er1 and &er2 have been swapped (as illustrated below).

  <!ENTITY p1 "<p>From &city2; to &city1;.</p>">
  <!ENTITY p2 "<p>From &city3; to &city1;.</p>">

This leads to the following post comparison intermediate result, which is split into two sections. The first section presenting the common doctype and internal subset result, and the second section that presents the changes to the XML content.

Part 1 - The raw comparison intermediate result for the doctype and internal subset declarations

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
  xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute"
  xmlns:er="http://www.deltaxml.com/ns/entity-references"
  xmlns:preserve="http://www.deltaxml.com/ns/preserve" 
  deltaxml:deltaV2="A!=B" deltaxml:version="2.0" 
  deltaxml:content-type="full-context">
  <preserve:doctype deltaxml:deltaV2="A!=B" name="root">
    <preserve:elementDecl deltaxml:deltaV2="A=B" name="root" 
      deltaxml:key="element_root" model="(p*)"/>
    <preserve:elementDecl deltaxml:deltaV2="A=B" name="p" deltaxml:key="element_p"
      model="(#PCDATA)"/>
    <preserve:internalParsedGeneralEntityDecl deltaxml:deltaV2="A=B" name="city1"
      deltaxml:key="entity_gen_city1" value="Bath"/>
    <preserve:internalParsedGeneralEntityDecl deltaxml:deltaV2="A=B" name="city2"
      deltaxml:key="entity_gen_city2" value="York"/>
    <preserve:internalParsedGeneralEntityDecl deltaxml:deltaV2="A=B" name="city3"
      deltaxml:key="entity_gen_city3" value="Bath"/>
    <preserve:internalParsedGeneralEntityDecl deltaxml:deltaV2="A!=B" name="p1"
      deltaxml:key="entity_gen_p1">
      <deltaxml:attributes deltaxml:deltaV2="A!=B">
        <dxa:value deltaxml:deltaV2="A!=B">
          <deltaxml:attributeValue deltaxml:deltaV2="A">!(*lt!)p!(*gt!)From
            !(city1!) to !(city2!).!(*lt!)/p!(*gt!)</deltaxml:attributeValue>
          <deltaxml:attributeValue deltaxml:deltaV2="B">!(*lt!)p!(*gt!)From
            !(city2!) to !(city1!).!(*lt!)/p!(*gt!)</deltaxml:attributeValue>
        </dxa:value>
      </deltaxml:attributes>
    </preserve:internalParsedGeneralEntityDecl>
    <preserve:internalParsedGeneralEntityDecl deltaxml:deltaV2="A!=B" name="p2"
      deltaxml:key="entity_gen_p2">
      <deltaxml:attributes deltaxml:deltaV2="A!=B">
        <dxa:value deltaxml:deltaV2="A!=B">
          <deltaxml:attributeValue deltaxml:deltaV2="A">!(*lt!)p!(*gt!)From
            !(city1!) to !(city3!).!(*lt!)/p!(*gt!)</deltaxml:attributeValue>
          <deltaxml:attributeValue deltaxml:deltaV2="B">!(*lt!)p!(*gt!)From
            !(city3!) to !(city1!).!(*lt!)/p!(*gt!)</deltaxml:attributeValue>
        </dxa:value>
      </deltaxml:attributes>
    </preserve:internalParsedGeneralEntityDecl>
  </preserve:doctype>

Part 2a - The raw comparison intermediate result for the body in the preconfigured 'roundTrip' preservation mode

  <p deltaxml:deltaV2="A=B">Cities: <er:city1/>, <er:city2/>, <er:city3/></p>
  <er:p1 deltaxml:deltaV2="A=B"/>
  <er:p2 deltaxml:deltaV2="A=B"/>
</root>

Part 2b - The raw comparison intermediate result for the body in the preconfigured 'entityRef' preservation mode

  <p deltaxml:deltaV2="A=B">Cities: <er:city1>Bath</er:city1>,
    <er:city2>York</er:city2>, <er:city3>Bath</er:city3></p>
  <er:p1 deltaxml:deltaV2="A!=B">
    <p deltaxml:deltaV2="A!=B">
      <deltaxml:textGroup deltaxml:deltaV2="A!=B">
        <deltaxml:text deltaxml:deltaV2="A">From Bath to York.</deltaxml:text>
        <deltaxml:text deltaxml:deltaV2="B">From York to Bath.</deltaxml:text>
      </deltaxml:textGroup>
    </p>
  </er:p1>
  <er:p2 deltaxml:deltaV2="A=B">
    <p>From Bath to Bath.</p>
  </er:p2>
</root>

Part 2c - The raw comparison intermediate result for the body in the preconfigured 'nestedEntityRef' preservation mode

  <p deltaxml:deltaV2="A=B">Cities: <er:city1>Bath</er:city1>,
    <er:city2>York</er:city2>, <er:city3>Bath</er:city3></p>
  <er:p1 deltaxml:deltaV2="A!=B">
    <p deltaxml:deltaV2="A!=B">From 
      <er:city1 deltaxml:deltaV2="A">Bath</er:city1>
      <er:city2 deltaxml:deltaV2="B">York</er:city2> to 
      <er:city2 deltaxml:deltaV2="A">York</er:city2>
      <er:city1 deltaxml:deltaV2="B">Bath</er:city1>.</p>
  </er:p1>
  <er:p2 deltaxml:deltaV2="A!=B">
    <p deltaxml:deltaV2="A!=B">From 
      <er:city1 deltaxml:deltaV2="A">Bath</er:city1>
      <er:city3 deltaxml:deltaV2="B">Bath</er:city3> to 
      <er:city3 deltaxml:deltaV2="A">Bath</er:city3>
      <er:city1 deltaxml:deltaV2="B">Bath</er:city1>.</p>
  </er:p2>
</root>

4.3. XML to Entity references

Before the raw comparison result can be converted into an output format, that format needs to be chosen. For the purposes of this sample and our explanation we use HTML as our output format, where changes are marked up using HTML's ins and del elements. Recall that we are currently presenting a simplified version of the sample code, which is (X)HTML.

The transformation from this DeltaV2 markup to the HTML markup is performed by an XSL transformation that is designed to cope with the types of change found in the sample. Note that it is intended only for the purposes of illustrating this sample code; it is not a general purpose DeltaV2 markup to HTML change markup filter.

We now continue the above example by illustrating the resolved output in two sections: the doctype and XML content sections.

Part 1 - The final result for the doctype and internal subset declarations.

<!DOCTYPE root  [
  <!ELEMENT root (p*)>
  <!ELEMENT p (#PCDATA)>
  <!ENTITY city1 "Bath">
  <!ENTITY city2 "York">
  <!ENTITY city3 "Bath">
  <!ENTITY p1 "<p>From &city2; to &city1;.</p>">
  <!ENTITY p2 "<p>From &city3; to &city1;.</p>">
]>

Part 2a - The final result for the body in the preconfigured 'roundTrip' preservation mode.

<root>
  <p>Cities: &city1;, &city2;, &city3;</p>
  &p1;
  &p2;
</root>

When 'roundTrip' processing no differences are reported in the two inputs, as the changes to the DOCTYPE cannot be reported, and there are no changes to the body of the XML document itself.

Part 2b - The final result for the body in the preconfigured 'entityRef' preservation mode.

<root>
  <p>Cities: &city1;, &city2;, &city3;</p>
  <del>&p1;</del><ins>&p1;</ins>
  &p2;
</root>

With 'entityRef' preservation enabled, it is now possible to detect that the content of entity referred to by &p1 has changed, this is now represented by an insertion and deletion of that entity reference.

Part 2c - The final result for the body in the preconfigured 'nestedEntityRef' preservation mode.

<root>
  <p>Cities: &city1;, &city2;, &city3;</p>
  <del>&p1;</del><ins>&p1;</ins>
  <del>&p2;</del><ins>&p2;</ins>
</root>

With 'nestedEntityRef' preservation enabled, it is now possible to detect that the definition of entity referred to by &p2 has changed, this is now represented by an insertion and deletion of that entity reference. In other words, neither &p2's name or value have changed, but the means by which that value is calculated has changed.

Note that, in the cases where the replacement text is kept in the encoded entity references, it is possible to choose to show the changes in the replacement text, instead of preserving the entity references themselves. There are some contexts in which this is desirable, such as when supporting multiple output formats through a single pipeline, where some of the output formats can contain entity references and others cannot. The following output illustrates the changes if the replacement text is kept.

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <p>Cities: Bath, York, Bath</p>
  <p>
    <del>From Bath to York.</del>
    <ins>From York to Bath.</ins>
  </p>
  <p>From Bath to Bath.</p>
</root>