Using Deltas for XML Versioning (diff and patch)

1. Introduction

DeltaXML can be used to provide a 'diff and patch' capability for any XML document or data file. The diff would be generated by comparing two versions and producing a delta that contains only the differences, known as a changes-only delta. The patch capability is provided by the recombine operation.

This 'diff and patch' capability can be useful where updates to XML files need to be sent using minimum bandwidth, for example in mobile or satellite applications. Another application is in situations where it is necessary to keep a full audit trail of changes to an XML file using minimum storage space.

The DeltaXML DeltaV2 format comes in two main flavours: full context and changes-only.

The full context delta efficiently stores the entire contents of the input files. It can be used to create a difference report or a document with changes marked-up.

The changes-only delta format stores only what has changed and the necessary context for those changes. It is an ideal format for storing differences between different versions of XML documents.

Using DeltaXML Core to Changes-only Delta

DeltaXML Core comes bundled with a recombiner which can recreate Document ‘B’ with Document ‘A’ and the changes-only delta. The recombiner can also be used in reverse to recreate Document ‘A’ from Document ‘B’ and the changes-only delta. This process is heavily used at DeltaXML as part of our comprehensive roundtrip test suite (to ensure the comparator is 100% accurate).

Using DeltaXML Core for Recombination

More usefully, these tools can be used when there are many versions of a document. There are various strategies that can be used to achieve this.

1.1. Using Changes-only Generated Deltas Against Version 1

The most obvious approach is storing the original version of the document, and the changes-only delta between the versions.

XML Versioning Against Version 1

Any version can then easily be recreated by using the recombiner with the stored version and the relevant changes-only delta.

Using Recombine to Regenerate Version 3 from Version 1

The big issue with this approach is that as the document gets more revised the deltas will get larger and larger, potentially to the point where the delta is effectively the entire document.

1.2. Using Changes-only Deltas Generated Against the Previous Version

Another possible approach is to store the first and last versions, and changes-only deltas between each version and the last.

XML Versioning Against Previous Version

Any version can then be recreated by either forward recombining from version 1, or reverse recombining form the latest version. The issue here is that the more versions you have, the more recombinations it will potentially take to reconstruct a version.

Using Multiple Recombines to Regenerate Version 3 from Version 1

The sensible solution here would be to store the first, latest and each nth version (what n is depends on your system’s use/data).

2. Lexical Preservation

DeltaXML Core expects to be run in an environment where either its inputs or its output may have been processed by XSLT filters, which in particular conforms to the XSLT processing model's XDM tree model (as specified in XQuery 1.0 and XPath 2.0 Data Model). This model does not contain entries that correspond to all the XML node types, such as 'DOCTYPEs', 'entities', and 'CDATA sections and ignorable whitespace', which are removed, expanded and converted to text characters respectively as part of the XSLT parsing.

Our lexical preservation filters can preserve the four mentioned XML node types, by converting them to and from XML element nodes. These preservation filters also enable processing instructions and comments, which are otherwise ignored by our comparator technology, to be retained and compared in a similar manner. For further information please refer to the How to Preserve Doctype Information sample, the How to Preserve Processing Instructions and Comments sample, and the LexicalPreservation Java API documentation.

In order for the recombiner to work with filtered documents, its inputs need to be identical to those used to generate the delta. The sample ensures this by storing the marked-up version (i.e. after filtering with the LexicalPreservation filter), see graphic below (the inputs with '(Preserved)' are marked-up with the lexical preservation).

Lexical Preservation Input Filtering

The output filter is only run as the last stage when retrieving a version from the system, see below:
Lexical Preservation Output Filtering

3. Implementation

The sample, which is bundled with DeltaXML Core, is an implementation of the second approach discussed in the previous section. It is made up of three classes:

  • DeltasForVersioning.java - this class stores a list of Versions and offers methods for management. Its constructor offers a quick way to disable the lexical preservation filtering (the default is for it to be enabled). The important methods are:
    • addVersion(File) - this adds the File as a version of the document.
    • verifyDeltaForInput(File, File, File) - validates that the changes-only deltaV2 is valid and can be used to recreate either of the inputs, this is called by addVersion(File). This throws a InvalidChangesOnlyDeltaException when the changes-only deltaV2 fails the validation.
    • retrieveVersionForwards(int, File) - this retrieves the requested version of the document using forward recombine, starting with the first version of the document.
    • retrieveVersionBackwards(int, File) - this retrieves the requested version of the document using reverse recombine, starting with the latest version.
    • runLexicalPreservationInfilter(File input) - this runs the lexical preservation input filter, and returns the marked-up version of the input.
    • runLexicalPreservationOutfilter(File input, File output) - this runs the lexical preservation output filter, and returns the non-marked-up version of the input (i.e. the expanded lexical content is back in its original form.
  • Version.java - instances of this class store details about a version of the document (mainly the generated changes-only delta file).
  • InvalidChangesOnlyDeltaException.java - an exception thrown when the generated changes-only deltaV2 is invalid.

3.1. Limitations

There are a few limitations with the sample implementaton, including:

  • Currently the sample is only implemented in Java, but the same approaches will work in the .NET version of DeltaXML Core.
  • The Version object is storing the full copy of each document version, this should be a trivial thing to clean up.
  • There is no persistence, so the version model needs to be built up each run.
  • The LexicalPreservation output filter requires a Saxon PE license as the DeltaXML OEM license cannot be used in sample code. If this is a problem, please contact us.
  • The sample is making use of an older method for LexicalPreservation used for Core releases prior to Core 7.0 as the newer method is not compatible with the XMLCombiner.

4. Running the sample

The sample loads 5 versions (version 1, version 2, version 3, version 4 and version 5) of a document, generates change-only deltas between the versions and reconstructs the requested version (version 3 by default).

The sample is designed to be built and run via the Ant build technology. The provided build.xml script has two main targets

  • run - the default target which compiles and runs the sample code.
  • clean - returns the sample to its original state.

It can be run by issuing either the ant or the ant run commands. If you wish to override which version of the document is included then you should run ant -Dversion-to-retrieve=n (where n is the version you wish to see retrieved).

Alternatively, the sample can be manually compiled and run using the following Java commands, asuming that both the Java compiler and runtime platforms are avaialble on the command-line.

javac -cp ../../deltaxml.jar:../../saxon9pe.jar DeltasForVersioning.java Version.java
java -cp .:../../deltaxml.jar:../../saxon9pe.jar DeltasForVersioning 3

Note that you need to ensure that you use the correct directory and class path separators for your operating system.