Lexical Preservation Format

1. Introduction

The underpinning XML Compare comparator only understands three sorts of XML node: element, attribute, and text nodes. Therefore, all other types of node need to be converted into these types of node if they are to be retained. In particular, processing instruction and comment nodes need to be converted if they are to be retained.

In addition to the XML nodes within a document, there are other sorts of information that might want to be retained, such as the document's DOCTYPE, CDATA Sections, and entity references. This information is frequently available from an XML parser, say via its SAX parsing event interface, and thus can be converted into XML nodes that the underpinning comparator uses.

In general the retained information is stored in element nodes that are added to one of three namespaces:

  1. xmlns:pi='http://www.deltaxml.com/ns/processing-instructions' - for retaining processing instructions (PIs), where the element's name and content mirror the PI's tag and content respectively.
  2. xmlns:er='http://www.deltaxml.com/ns/entity-references' - for retaining entity references, where the element's name and content mirror the entity reference's name and content respectively.
  3. xmlns:preserve='http://www.deltaxml.com/ns/preserve' - for retaining all other sorts of information, where the element's name represents the sort of information being retained (e.g. 'doctype', 'cdata', or 'comment'). Here, a mixture of the element's content and attributes are used to encode the retained data.

The lexical preservation Guide and lexical-preservation.rng relaxNG schema provide a top-level overview of what can be preserved and a semi-formal definition of the format respectively. This is supplemented by the remainder of this document which informally defines the format by example. Specifically, how 'retained data' in the following XML document is converted into element nodes.

(01)  <?xml version="1.0" encoding="UTF-8"?>
(02)  <!-- A pre DOCTYPE comment -->
(03)  <!DOCTYPE article SYSTEM "http://www.docbook.org/xml/4.5/docbookx.dtd" 
(04)  [ <!ENTITY % paramEnt "
(05)      <!ATTLIST simpara level (unknown|novice|trainee|practitioner|expert) 'unknown'>
(06)    ">
(07)    <!ELEMENT exampleElement (#PCDATA)>
(08)    <!ATTLIST exampleElement yesNo (yes|no) 'no'>
(09)    %paramEnt;
(10)    <!ENTITY genEnt "an <emphasis role='bold'>internal (parsed) general</emphasis> entity.">
(11)  ]>
(12)  <?myPI Content of the processing instruction.?>
(13)  <article>
(14)    <title>Lexical Preservation Filter Demo</title>
(15)    <!-- In the following paragraph we reference the entity &genEnt; -->
(16)    <para>This paragraph references &genEnt;</para>
(17)    <para><![CDATA[Content of the CDATA Section text]]></para>
(18)    <simpara>An overridden simpara with a defaulted level attribute.</simpara>
(19)  </article>
(20)  <!-- A post XML body comment -->

The lexical preservation conversion of the example file is discussed in stages where the lines under consideration are reproduced along with the output where everything is being preserved. For clarity, the whitespace aspect of the preservation is not maintained.

2. XML Declaration

The XML declaration data is stored in a preserve:xmldecl element, which is a child of the root element. Here the XML declaration 'attributes' are converted into element attributes with the same name.

(01)  <?xml version="1.0" encoding="UTF-8"?>
(13)  <article>

In order to compare and preserve the XML declaration it is added to the body of the document, as illustrated by line labelled (01b) below. Further, the namespaces that are used by this lexical preservation filter (13b to 13e) are attached to the root element (13a).

(01a)  <?xml version="1.0" encoding="UTF-8"?>
(13a)  <article 
(13b)    xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" 
(13c)    xmlns:preserve="http://www.deltaxml.com/ns/preserve" 
(13d)    xmlns:er="http://www.deltaxml.com/ns/entity-references" 
(13e)    xmlns:pi="http://www.deltaxml.com/ns/processing-instructions">
(01b)    <preserve:xmldecl xml-version="1.0" encoding="UTF-8"/>

Note that for convenience (and later clarity) all three namespaces for the retained information are declared, along with the main DeltaXML versioning namespace.

Warning: although it is possible to detect whether the XML declaration has set a standalone 'attribute' this information currently is not preserved by the output processing.

3. Comments and Processing Instructions

XML comments and processing instructions (PIs) are encoded into preserve:comment and pi:tag elements, where tag is replaced by the name of the PI. If they are not encoded, then the XML comments and PIs are left in situ. Encoded XML comments and processing instructions (PIs) that appear outside the root element are contained in a preserve:pi-and-comment element, which is a child of the root element. There can be up to three preserve:pi-and-comment elements, which are distinguished by their region attribute value:

  • BEFORE_DTD - PIs and comments before the DOCTYPE/Internal-Subset declaration.
  • AFTER_DTD - PIs and comments after the DOCTYPE/Internal-Subset declaration, but before the root element.
  • AFTER_BODY - PIs and comments after the root element (XML body has been completed).

Processing Instructions and Comments before the DOCTYPE declaration

(02)  <!-- A pre DOCTYPE comment -->

In order to compare and preserve comments and processing instructions that occur before the DOCTYPE declaration a {@code preserve:pi-and-comment} block is introduced, with mode {@code BEFORE_DTD}.

(02a)    <preserve:pi-and-comment region="BEFORE_DTD">
(02b)      <preserve:comment> A pre DOCTYPE comment </preserve:comment>
(02c)    </preserve:pi-and-comment>

4. DOCTYPE and the Internal Subset including Entity Refs.

DOCTYPE information is converted into preserve:doctype element that is a child of the root element, where the root element name, public identifier and system identifier are stored as attributes (if they exist). Further the content of the element contains the retained internal subset information.

The internal subset can contain DTD declarations and entity references. Here each DTD declaration type is given its own representation element:

  • Element Decl - converted into preserve:elementDecl element, where the element's attributes are used to store its name and model data.
  • Attribute Decl - converted into preserve:attributeDecl element, where the element's attributes are used to store its name, associated element name, type, and default value.
  • Internal Parsed Parameter Entity Declaration - converted into a preserve:internalParsedParameterEntityDecl element, where the element's attributes are used to store the entity declaration's name and value.
  • Internal Parsed General Entity Declaration - converted into a preserve:internalParsedGeneralEntityDecl element, in a similar manner to the other entity declarations.
  • Entity Reference (usage) - converted into an er:n element, where n is the name of the entity reference and the parameter attribute records whether this entity reference is referring to a parameter entity declaration. The content of these elements may contain the entity replacement text, if this has been asked for.
All the DTD declaration conversion elements also contain a deltaxml:key that uniquely identifies that declaration by its name. This ensures that such declarations can only be aligned with other declarations of the same name by the underpinning comparator.

The content of the entity declarations is escaped using the ASCII exclamation mark (!) characters, where:

EncodingOriginal
!(entRef!)&entRef
!(*lt!)<
!(*gt!)>
!(*amp!)&
!(*apos!)'
!(*quot!)"
!!!
This special form of escaping ensures that it does not interfere with standard XML entity encoding mechanisms. It also provides a straightforward mechanism to detect whether the entity 's replacement text is encoded, which can be useful when creating and debugging complex output filter chains.

The doctype and internal subset conversions discussed above are illustrated in lines (3) to (11) of our example:

(03)  <!DOCTYPE article SYSTEM "http://www.docbook.org/xml/4.5/docbookx.dtd" 
(04)  [ <!ENTITY % paramEnt "
(05)      <!ATTLIST q level (unknown|novice|trainee|practitioner|expert) 'unknown'>
(06)    ">
(07)    <!ELEMENT exampleElement (#PCDATA)>
(08)    <!ATTLIST exampleElement yesNo (yes|no) 'no'>
(09)    %paramEnt;
(10)    <!ENTITY genEnt "an <emphasis role='bold'>internal (parsed) general</emphasis> entity.">
(11)  ]>

These lines are convert to the following XML.

(03a)    <preserve:doctype name="article" systemId="http://www.docbook.org/xml/4.5/docbookx.dtd">
(04a)      <preserve:internalParsedParameterEntityDecl name="paramEnt" deltaxml:key="entity_par_paramEnt" 
(05a)        value="
    !(*lt!)!!ATTLIST simpara level (unknown|novice|trainee|practitioner|expert) !(*apos!)unknown!(*apos!)!(*gt!)
  "
(06a)      />
(07a)      <preserve:elementDecl name="exampleElement" deltaxml:key="element_exampleElement" model="(#PCDATA)"/>
(08a)      <preserve:attributeDecl name="yesNo" deltaxml:key="attribute(exampleElement,yesNo)"
(08b)        eName="exampleElement" type="(yes|no)" value="no"/>
(09a)      <er:paramEnt parameter="yes">
(05b)      <preserve:attributeDecl name="level" deltaxml:key="attribute(simpara,level)"
(05c)        eName="simpara" type="(unknown|novice|trainee|practitioner|expert)" value="unknown"/>
(09b)      </er:paramEnt>
(10a)      <preserve:internalParsedGeneralEntityDecl name="genEnt" deltaxml:key="entity_gen_genEnt"
(10b)        value="an !(*lt!)emphasis role=!(*apos!)bold!(*apos!)!(*gt!)internal (parsed) general!(*lt!)/emphasis!(*gt!) entity."/>
(03b)    </preserve:doctype>

Note that the entity reference, in line (09), is transformed into four lines (09a), (05b), (05c), and (09b); the key point is that the definition of the entity has been expanded, and so can be compared.

5. Comments, Processing Instructions and Entity References revisited

In this section we look at some more examples of comments, processing instructions, and entity references. First we examine a processing instruction that is declared between the DOCTYPE and the XML body.

(12)  <?myPI Content of the processing instruction.?>

This is encoded as follows:

(12a)    <preserve:pi-and-comment region="AFTER_DTD">
(12b)      <pi:myPI>Content of the processing instruction.</pi:myPI>
(12c)    </preserve:pi-and-comment>

The main content of the document can now be processed. The following three lines contain some normal content, a comment, and a line that contains an entity reference.

(14)    <title>Lexical Preservation Filter Demo</title>
(15)    <!-- In the following paragraph we reference the entity &genEnt; -->
(16)    <para>This paragraph references &genEnt;</para>

The normal content is left unchanged, the comment is encoded into a preserve comment block, and the entity reference is encoded into an entity reference block that contains its replacement text.

(14a)    <title>Lexical Preservation Filter Demo</title>
(15a)    <preserve:comment> In the following paragraph we reference the entity &genEnt; </preserve:comment>
(16a)    <para>This paragraph references <er:genEnt>an <emphasis role="bold">internal (parsed) general</emphasis> entity.</er:genEnt></para>

6. CDATA section and Defaulted attributes

The CDATA section is converted into a preserve:cdata element whose content contains the CDATA characters.

The preserve:defaultAttributes attribute is added to any element that contains defaulted attributes. Its job is to record the attribute names, so that they can be stripped out later.

The following two lines illustrate CData Section and defaultedAttribute usage. Here the simpara element has been provided with a defaulted attribute whose label is level and value is unknown. Note that this comes from lines (5) and (9) of the example input.

(17)    <para><![CDATA[Content of the CDATA Section text]]></para>
(18)    <simpara>An overridden simpara with a defaulted level attribute.</simpara>

These lines are converted to:

(17a)    <para><preserve:cdata>Content of the CDATA Section text</preserve:cdata></para>
(18a)    <simpara level="unknown" preserve:defaultAttributes="{}level">An overridden simpara with a defaulted level attribute.</simpara>

7. Ignorable whitespace

Ignorable whitespace nodes, when identified by an XML Schema or DTD, are wrapped in a preserve:ignorable element. The following XML snippet shows an ignorable whitespace node annotated with 'LF' and 'SP' to indicate 'linefeed' and 'space' respectively and highlighted in green:

(13)  <article>LF
(14)  SP<title>Lexical Preservation Filter Demo</title>

These lines are converted to:

(13a)  <article><preserve:ignorable>LF
(14a)  SP</preserve:ignorable><title>Lexical Preservation Filter Demo</title>

8. Completing the example file

The last few lines of the example input complete the body of the document and add a final post comment.

(19)  </article>
(20)  <!-- A post XML body comment -->]]>

Theses lines are converted thus:

(20a)    <preserve:pi-and-comment region="AFTER_BODY">
(20b)      <preserve:comment> A post XML body comment </preserve:comment>
(20c)    </preserve:pi-and-comment>
(19a)  </article>

9. Implementation Notes

The LexicalPreservationBase class provides the underpinning implementation of the input filtering. Its JavaDoc provides a detailed account (based on the same example) of how this works and can be configured in practice.

The LexicalPreservationBase base class is specialised for use in both the Document Comparator and the S9 version of the Toolkit Comparator. These specialisations enable the lexical preservation to be configured by the LexicalPreservationConfig class.

Legacy API: Note that the LexicalPreservationBase class can be specialised for use with the original core (and raw API) versions of XML Compare, as it is implemented by a SAX2 filter. However, in order to make use of the associated lexical preservation output filter (preservation-outfilter.xsl) a Saxon processor with the doctype extension is required.

When using the pipeline configuration formats DXP or DCP, the lexicalPreservation element is used to set the properties of the LexicalPreservationConfig class declaratively.