Lexical Preservation

1. Introduction

When parsing an XML document with a typical XML parser into a model, such as a Document Object Model (DOM) tree, most if not all of its 'lexical' aspects (i.e. syntactic form) are lost. Examples of 'lexical' data include CDATA sections, entity references and 'ignorable' whitespace. However, when configured to do so our 'Core S9API' comparators (PipelinedComparatorS9 and DocumentComparator) can preserve much of this lexical information by:

  1. converting such markup so it can be compared (i.e. encode the 'lexical' markup into XML elements),
  2. providing an opportunity for identified differences to be resolved, and
  3. reinstating the original markup (i.e. decode the encoded elements, which should have been resolved).

Note: Our underpinning comparison technology only processes XML elements, attributes, and text nodes. This means that processing instruction and comment nodes also need to be transformed into 'lexical' preservation elements before the comparison, if they are to be kept.

Some aspects of XML are not reported by typical XML Parsers, therefore it is not in general feasible to ensure complete preservation of all lexical aspects of an input file. Some of these aspects include:

  • whether single or double quotes are used for attribute values
  • attribute order within a start tag
  • any whitespace within a start tag or end tag, for example whitespace or line breaks between attributes
  • any whitespace outside of the root element, including whitespace in the DTD internal subset
  • whether or not an XML Declaration was present in the input

Some of the things that are reported by the parser include:

  • doctype declarations
  • information about the file encoding and XML version (whether from the XML Declaration or otherwise)
  • entity reference information (while the parser expands we still keep reference info)
  • subset declarations for elements, attributes and entities
  • use of CDATA sections
  • ignorable whitespace

2. When to Preserve?

XML document comparison can be performed for a number of reasons, such as to highlight changes between two versions of a document for:

  1. Onward review and editing. The result of the comparison is a document that is intended to be used for onward editing. Therefore, it is useful if this document is as close to the original input documents as possible. For example, it is important not to expand entity references and CDATA sections.
  2. Publication to end users. The result of the comparison is a document that is intended for onward publication. Therefore it is useful to remove differences that are not going to affect the end publication before comparison. For example, differences in the quantity of whitespace between words in a paragraph may not be important (e.g. for HTML output), and can thus be normalised.
  3. Amendment and efficient archiving. The result of the comparison is a patch file that can be used to go from the output to either of the input documents, given the other input document. Here the intention is not to view the changes, but to accurately restore documents. Therefore, the granularity of comparison and change will be configured for faithful restoration, rather than highlighting change. For example, it may be safer to patch at the block level, rather than on in-line elements.

Note that some aspects of lexical preservation may work for patching if suitable care is taken. This is discussed in the Using Deltas for XML Versioning (diff and patch) sample documentation.

3. What to Preserve?

The choice of which lexical elements to preserve depends upon the context. The lexical preservation comparator feature (and supporting filters) can be configured to preserve all or some of the lexical data, including:

  • XML Declaration, Doctype and internal subset. It is possible to preserve most of the XML Declaration and Doctype information, including internal subset entity reference declarations. However, the standalone 'attribute' of the XML Declaration is not preserved due to limitations in the serializing technology that is used. For further details on doctype preservation please see the Preserving Doctypes sample.
  • Entity References. It is possible to both preserve entity references and know if their underpinning content has changed. For further details on entity reference preservation please see the Preserving Entity References sample.
  • Processing Instructions and Comments. It is possible to preserve both processing instructions and comments. This might be expected default behaviour, however the underpinning comparison technology compares only element, attribute, and text nodes. For further details on processing instruction and comment preservation please see the Preserving PIs and Comments sample.
  • Whitespace. It is possible to preserve much of the whitespace as discussed in the Managing White Space guide. Here we present a sample filter for ignoring whitespace modifications, so long as the whitespace is deemed to be insignificant.

3.1. Predefined Preservation Modes

For the moment we present a simplified view of lexical preservation which was first introduced in our format specific products (such as DITA Compare) and is now available in Core also. The implementation details section contains the lower level account, if this is required.

The table below summarises the modes of preservation that are supported by Core in terms of their effect on how various items in the file are preserved. Note that the latter five modes of preservation are also used by our format specific products. The default preservation mode is 'roundTrip'.

Preservation ModePreserve Comments & Processing InstructionsPreserve XML Declaration & Doctype Preserve defaulted attributes Preserve CDATA sections & whitespacePreserve entity replacement textPreserve entity references
baseoffoffoffoffonoff
documentononoffoffonoff
docAndAttribonononoffonoff
roundTripononononoffon
entityRefononononoffon (cont.)
nestedEntityRefononononoffon (nest.)

The effects of turning these preservation items 'on' or 'off' is now discussed in the following list, where the use of 'this column' in an item's description refers to the corresponding column in the above table.

Preserve Comments, Processing Instructions
Comments and Processing Instructions (PIs) in the 'B' document are preserved in the result, whereas comments and PIs in the 'A' document (that are not also in the 'B' document) do not appear in the result. The exception here is that PIs that represent oXygen tracked changes are removed prior to comparison so that they do not get confused with the changes identified by the comparator. Further, neither comments or PIs in the internal DTD subset are currently preserved.
Preserve XML Declaration & Document Type (DTD & internal subset)
Most of the XML declaration, doctype and internal subset data is preserved (for the preservation modes that contain an 'on' in this column). A current limitation is that comments and processing instructions within an internal subset are lost. Another limitation is that XML declaration's standalone marking is not preserved.
Preserve defaulted attributes
Default attribute values can be specified in a DTD and these are automatically put onto the elements in the document by the parser. If they are preserved as defaulted attributes (i.e. an 'on' in this column), then these default values will not appear in the result document as long as they were default attributes in both inputs and their value is unchanged.
Preserve CDATA sections and whitespace
CDATA (character data) sections are preserved in the result (for the preservation modes that contain an 'on' in this column). When DTD or XML Schema grammars are available, ignorable whitespace nodes are wrapped in a preserve:ignorable element, this prevents them being affected by the NormalizeSpace filter and allows control of their serialization through PreservationOutputType and PreservationProcessingMode settings.
Preserve entity replacement text
The default behaviour of an XML parser is to replace entity references with their content; we refer to such content as the entity's (or entity reference's) replacement text. This is a slightly unusual preservation item, as its logic should really be the other way around; i.e. it should be called remove entity replacement text. If this were the case disabling all preservation items would have the expected affect of loading a document's content. However, disabling all preservation items has the affect of completing removing entity references from the output, as neither the reference or the replacement text will be kept. This is why we have introduced the 'base' mode of preservation that turns off all the preservation items, other than the replacement text preservation item.
Preserve entity references
General parsed entities are preserved as entities - rather than expanded (i.e. replaced by their content) - in the result document when an 'on' is in this column. This is usually what you want when you continue to edit the document. For example, consider two documents that differ in how the name of a city - London - is represented: in the first document the city is written as the string 'London', and in the second document the city is written as an entity reference '&city;' whose value is the string 'London'. In this case, modes with an 'on' in this column the two representations of city London are marked as different, because the unexpanded entity is different from the text, whereas those modes with an 'off' in this column mark the two representations of the city London as the same, because the expanded entity reference is the same as the text.
Preserve entity references (content)
This is intended only for expert users who understand how entities work. In roundTrip mode you will not see changes in entity references in the (unusual) situation where the definition of these entities is different in the two documents. For example, consider two documents containing the entity reference '&city;' that differ only in the value of the 'city' entity, which has changed from 'London' in one document to 'Birmingham' in the other. Both of these documents use the same '&city;' entity reference, which would be marked as unmodified as it is identical from the round trip (source document) perspective. If you need to see such changes, then use a mode with an 'on' in this column. In the result document, there can only be one entity definition and this will be either from the original ('A' document) or new ('B' document). Therefore the entities are guaranteed to be the same in the result document, and so any difference is shown by adding and removing an identical element.
Preserve entity references (nested)
This is intended only for expert users who understand the way one entity can reference another. An 'on' in this column means that subtle changes in entity reference structure are shown. The full structure of nested entities is preserved and compared and any changes are shown. This is useful in some complex cases where the overall semantics of an entity does not change, but the way in which it is defined changes. For example, consider a document that contains a reference to the entity '<!ENTITY ent "&nested1;">', where the 'nested1' entity has the value 'val'. Let a second version of the document be the same as the first, except that the nested entity reference is renamed to '&nested2;'. In this case, both the syntactic and semantic analyses will miss this change, as the syntax analysis compares '&ent;' against itself and the semantic analysis compare the text 'val' against itself. An 'on' in this column means the comparator will detect such changes in the internal definition of an entity, and marks them using the same scheme as above: the addition and deletion of an identical entity reference.

Providing one of these preservation mode labels as a string input to the LexicalPreservationConfig class's constructor, creates a LexicalPreservationConfig object that is configured to support that preservation mode. This configuration object can then be set on a PipelinedComparatorS9 or a DocumentComparator by calling their setLexicalPreservationConfig method. For more advanced use the Pipelined Comparator's getLexicalPreservationOutputFilters method can be used to control the position at which the output filter chain lexical preservation occurs (see the Preserving Entity References sample to see an example of this use case).

3.2. Pipeline Definition Elements

The DXP and DCP XML file formats allow configuration for the PipelinedComparatorS9 and DocumentComparator respectively. Here, the lexicalPreservation element controls LexicalPreservationConfig properties. This is achieved by first declaring the default behaviours for all lexical 'artifacts' and then providing a set of overrides for specific types of lexical artifact.

An overview of the DXP/DCP lexicalPreservation element structure is provided in the DCP Schema Guide.

4. Lexical Preservation Format

We now briefly outline the way in which comments, processing instructions, and parser reported information is added to the document before comparison. Further details are presented in the Lexical Preservation Format document; however, these details are not required for straightforward usage.

4.1. Namespaces

In order to keep the elements that are used to retain the 'lexical' information separate from other elements in the document, we introduce three namespaces.

Usual prefixNamespace URIDescription
preservehttp://www.deltaxml.com/ns/preserveAll generated lexical preservation markup uses this namespace unless one of those mentioned below
erhttp://www.deltaxml.com/ns/entity-referencesEntity references are represented as elements using this namespace, where the entity's name is used for the local name
pihttp://www.deltaxml.com/ns/processing-instructionsProcessing instructions are represented as elements using this namespace, where the PI tag is used for the local name

Some examples of prefixed element names using the above namespaces include preserve:doctype, preserve:cdata, er:myEnt, and pi:myPi, which are used for representing a DOCTYPE, a CDATA Section, a myEnt entity reference, and a myPi processing instruction respectively.

4.2. Data relocation

Some marked up items cannot be placed at their original locations whilst maintaining a well-formed result. This primarily relates to information outside the root element. For these areas the markup is moved inside the root element and contained in the first few children of the root element or the last child. Generally only comments and processing instructions can appear outside the root element. However, the internal subset contains other items, as does the XML declaration. When all types of information are present the output will have this structure:

<root>
<preserve:xmldecl xml-version="1.0" encoding="UTF-8" standalone="no"/>
  <preserve:comments-and-pis region="BEFORE_DTD"> ... </preserve:comments-and-pis>
  <preserve:doctype name="root"> ... </preserve:doctype>
  <preserve:comments-and-pis region="AFTER_DTD"> ... </preserve:comments-and-pis>
  <child> first child element of original root element ... </child>
  ...
  <child> last child element of original root element ... </child>
  <preserve:comments-and-pis region="AFTER_BODY"> ... </preserve:comments-and-pis>
</root>

5. Implementation Details

5.1. Compatibility with Releases prior to 7.0

Lexical preservation is now a feature setting on a PipelinedComparatorS9 or DocumentComparator class, rather than being an XMLFilter that is added at the start of the input pipelines. This method of preserving items replaces the previous LexicalPreservation filter which has been deprecated and may be removed in a future release. Note that (as of Release 7.2) if the original LexicalPreservation filter is manually added to the input FilterChain, it should be followed immediately by the built-in XSLT filter 'lexical-whitespace.xsl'. This filter is responsible for rationalising added whitespace information, according to whether it was determined from a DTD or XML Schema.

5.2. Exploiting XML Schemas

As mentioned earlier, Lexical Preservation allows ignorable whitespace to be treated specially: it can either be ignored (the most common need) or preserved using special markup. Information used to determine whether whitespace is ignorable can be extracted from a DTD or XML Schema.

In the case of a DTD declaration in the input XML, this information is available by default. If, however, an XML Schema is used to define the input XML grammar instead, a parser property must be set to either:

  • associate a namespace (or no-namespace) with an XSD file, or
  • use an xsi:schemaLocation or noNamespaceSchemaLocation 'hint' attribute in the input XML
For example, using the Document Comparator API:

DocumentComparator dc= new DocumentComparator();
dc.setParserProperty("http://apache.org/xml/properties/schema/external-schemaLocation",
                     "http://www.deltaxml.com/ns/mini-xhtml mini-xhtml.xsd");

Similarly, with the DCP format, the following XML would have the same effect:

...
<advancedConfig>
  <parserProperties>
    <property name="http://apache.org/xml/properties/schema/external-schemaLocation"
      literalValue="http://www.deltaxml.com/ns/mini-xhtml mini-xhtml.xsd"/>
  </parserProperties>
</advancedConfig>
...

The full set of parser properties available is listed in the Apache Xerces documentation.

5.3. Specifying Preserved Input Items

It is possible to use one of the simplified lexical preservation modes as discussed earlier, but if this is not sufficient for your needs, it is possible to have more control. The following table specifies the ten preservation items that can be either enabled or disabled (i.e. set to true or false respectively) on a LexicalPreservationConfig object.

Preserve Item*DescriptionInput filtering details
attributesEnables defaulted attributes to be identified (and then removed from the output).Adds a preserve:defaultAttributes attribute to any element that has defaulted attributes, which contains the name(s) of the defaulted attributes.
cdataEnables CDATA sections to be preserved.Transforms CDATA sections into a preserve:cdata elements for comparison.
commentEnables comments to be preserved.Transforms comments into the preserve:comment elements for comparison.
contentModelEnables DTD/Schema information about the element content model to be preserved.Adds preserve:grammar attribute to root element and preserve:mixed-content attributes where required.
doctypeEnables DOCTYPE and its internals subset data to be preserved.Transforms the DOCTYPE and its internal subset to a preserve:doctype XML element.
entityRefEnables entity references to be preserved.Transforms entity references into a er:entityname element for comparison.
entityReplacementTextEnables entity reference expansion.Expands entity references either in the text or inside an er:entityname element.
ignorableWhitespaceEnsures that 'ignorable' whitespace is preserved.Wraps ignorable whitespace characters in a preserve:ignorable element.
nestedEntityRefEnables changes in entityRef definitions to be detected.Transforms entity references that appear within an expanded entity reference into a er:entityname element for comparison.
processInstEnables processing instructions to be preserved.Transforms a processing instruction into the pi:tag for comparison.
xmlDeclEnables the version and encoding XML declaration attributes to be preserved.Transforms the XML declaration into a preserve:xmldecl element.

*WARNING: at least one of the entityReplacementText and entityRef items should be enabled; if neither of these items is enabled then entity references are removed completely from the document. Further, it only makes sense to enable the nestedEntityRef item if the entityRef item is enabled, as it is only in this context that the nestedEntityRef item has any effect.

5.4. Specifying the Processing Modes for the Preserved Items

When one of the simplified lexical preservation modes is not sufficient for your needs, it is possible to have more control over the 'lexical' difference resolution processing. Here, you can specify whether and how each 'lexical' element is resolved by setting its corresponding LexicalPreservationConfig object processing mode to one of the PreservationProcessingMode values. These values include 'CHANGE' (also known as 'skip') for leaving the change markings unresolved and 'B' for choosing the second input document version of the 'lexical' markup respectively.

Note that if the changes in an encoded lexical item are skipped then it is your responsibility to either: appropriately resolve these changes before they are decoded by the final output-processing/serialisation filter; or configure the output processing to leave these changes encoded by setting their 'lexical output type' to 'encoded'.

The following table specifies the 'properties' used to specify whether and how each of the preservation items is resolved. Here, the processing mode names are as used in a configuration properties file. The equivalent API method calls are of the form 'setProcessingModeNameProcessingMode', where the ProcessingModeName is either the same and/or expanded version of that provided in the table (ignoring capitalisation).

(Output) Processing Mode NameDescription
defaultThe mode that all other output processing mode items default to.-
xmldeclThe processing mode used for handling XML declaration data.
doctypeThe processing mode for the handling of the doctype data and internal subset as a whole.
outerPiComThe processing mode for handling processing instructions and comments outside of the document's body.*
processInstThe processing mode for handling processing instructions in the document's body.*
commentThe processing mode for handling comments in the document's body.*
ignorableWhitespaceThe processing mode for handling ignorable whitespace.
attributesThe processing mode for handling all attributes (this is not preserve:defaultAttributes specific). See the DefaultAttProcessingMode for more details.
cdataThe processing mode for handling CDATA Sections.
entityRefThe processing mode for handling entity references in the document's body.
advancedERUAn advanced mode for controlling some specialist use cases, where both the entity references and their replacement text are compared. See the AdvancedEntityRefUsage for more details.
*If the outerPiCom mode is set to 'change' then the PIs and comments are handled in accordance with the pi and comment output modes respectively.

5.5. Specifying the Output Types for the Preserved Items

When one of the simplified lexical preservation modes, as discussed earlier, is not sufficient for your needs, it is possible to specify whether these items should remain encoded or be decoded; i.e. converted back into their original 'lexical' forms. Here, each 'lexical' construct by setting its corresponding LexicalPreservationConfig object output type to one of the PreservationOutputType values.

The following table specifies whether the preservation items should be decoded. Here, the output type names are as used in a configuration properties file. The equivalent API method calls are of the form 'setOutputTypeNameOutputType', where the OutputTypeName is either the same or and expanded version of that provided in the table (ignoring capitalisation).

Output Type NameDescription
defaultThe output type that all other output type items default to.-
xmldeclThe output type used for handling XML declaration data.
doctypeThe output type for the handling of the doctype data and internal subset as a whole.
outerPiComThe output type for handling processing instructions and comments outside of the document's body.*
processInstThe output type for handling processing instructions in the document's body.*
commentThe output type for handling comments in the document's body.*
ignorableWhitespaceThe output type for handling ignorable whitespace.
attributesThe output type for handling all attributes (this is not preserve:defaultAttributes specific).
cdataThe output type for handling CDATA Sections.
entityRefThe processing mode for handling entity references in the document's body.
*If the outerPiCom mode is set to 'change' then the PIs and comments are handled in accordance with the pi and comment output modes respectively.