Loading login details...

DeltaXML Support Forums

 new topic  post reply 
moderators: chrisc nigelw tristanm
want to know how the comparison logic works
Joined: 03-July-2008
Posts: 1
Posted: 03-July-2008 17:01
I tested the two samples of xml. Samples are given below.
In sample 1, document 1 has an empty element. Document 2 has a new element added in the first position. The empty element (of document 1 ) is present in document 2 at second position. Delta XML comparison results marks the change as,
a) first element is a new (added) element
b) second element is unchanged (the empty element)

I created another sample by adding a new element in both the documents (document 1 and 2). Value of the new element is not changed in document 2. if i compare the document 1 and 2, results are changing. The first element has been marked as "modified". In the previous comparison it was marked as "new". I could not understand how the addition of a new element in both documents changes the results. Need help me in understanding the logic behind comparison.

Sample 1 - Document 1
-------------------------

<root>
<a></a>
</root>

Sample 1 - Document 2
-----------------------

<root>
<a>value0</a>
<a></a>
</root>

Sample 1 - Result
------------------

<root deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context">
<a deltaxml:deltaV2="B">value0</a>
<a deltaxml:deltaV2="A=B" />
</root>


Sample 2 - Document 1
---------------------

<root>
<a></a>
<a>value3></a>
</root>

Sample 2 - Document 2
---------------------

<root>
<a>value0</a>
<a></a>
<a>value3></a>
</root>

Sample 2 - Result
------------------

<root deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context">
<a deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="B">
<deltaxml:text deltaxml:deltaV2="B">value0</deltaxml:text>
</deltaxml:textGroup>
</a>
<a deltaxml:deltaV2="B" />
<a deltaxml:deltaV2="A=B">value3&gt;</a>
</root>
Comparison - subsequences vs. substrings
Joined: 27-March-2007
Posts: 54
Location: Malvern, United Kingdom
Posted: 08-July-2008 16:49
Hello Skrish,


Sorry it took a while to examine/debug this one (we spent some hours
looking at this yesterday).

You only get this odd result using the 'enhanced' or 'document centric' matcher.

If you turn it off (it defaults to 'on' in the sandbox and command line), for example:

$ java -jar /usr/local/DeltaXMLCore-5_0/command.jar compare delta f1.xml f2.xml f12.xml "Enhanced Match 1=false"


You will see an improved result:

<root xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" xmlns:dxx="http://www.deltaxml.com/ns/xml-namespaced-attribute" xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute" deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context">
  <a deltaxml:deltaV2="B">value0</a>
  <a deltaxml:deltaV2="A=B" />
  <a deltaxml:deltaV2="A=B">value3</a>
</root>


The 'enhanced matcher' result is 'correct' but as you observed, it is not obvious or simple.  It is designed to optimize the matching in document centric XML (think of multiple paragraphs of text with some paras add/deleted/modified).  It takes into account all of the PCDATA in an element's subtree (which can be a large number when word-by-word splitting/reconstruction is used). 

In both cases the optimization function used in the matching or alignment is the 'Longest Common Subseqence' (also know as the 'edit path', or 'Levenshtien distance').

It is subtly different from the Longest Common 'Substring'.  Wikipedia has a good example of the difference:

http://en.wikipedia.org/wiki/Substring

http://en.wikipedia.org/wiki/Longest_common_substring_problem

http://en.wikipedia.org/wiki/Levenshtein_distance

In the case of your test-data, there are 2 optimal equal length subsequences when processing the flattened  data-structure used by the enhanced matcher.  Unfortunately, the algorithm returned the subsequence which leads to the non-intuitive result.  While this result may appear complicated it is however 'correct' and more generally there can be multiple 'correct' answers for any pair of inputs.  It is always possible to generate either input from a full-context delta.

So in summary:

  - the enhanced matcher works well with document like data and your data work better with this setting turned off.

  - the result is correct, if not obvious

  - it should work better with larger amounts of PCData and word-by-word turned on.  Your example exhibits some of the problems of microbenchmarks.  We'd be reluctant to look at optimizations/improvements unless you can show us a more realistic or larger example of mismatching.

I hope this answers your question.  I would suggest experimenting with larger test data and if you see any similar problems please get back to us.

Thanks,

Nigel
 new topic  post reply  To find out about new replies to this post as they occur
please subscribe to one of these feeds:
AtomRSS moderate