DeltaXML Support Forums
| new topic post reply | DeltaXML Products and Applications -> DeltaXML Core -> want to know how the comparison logic works |
want to know how the comparison logic works | |||
Posted:
03-July-2008 17:01 I tested the two samples of xml. Samples are given below. In sample 1, document 1 has an empty element. Document 2 has a new element added in the first position. The empty element (of document 1 ) is present in document 2 at second position. Delta XML comparison results marks the change as, a) first element is a new (added) element b) second element is unchanged (the empty element) I created another sample by adding a new element in both the documents (document 1 and 2). Value of the new element is not changed in document 2. if i compare the document 1 and 2, results are changing. The first element has been marked as "modified". In the previous comparison it was marked as "new". I could not understand how the addition of a new element in both documents changes the results. Need help me in understanding the logic behind comparison. Sample 1 - Document 1 ------------------------- <root> <a></a> </root> Sample 1 - Document 2 ----------------------- <root> <a>value0</a> <a></a> </root> Sample 1 - Result ------------------ <root deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context"> <a deltaxml:deltaV2="B">value0</a> <a deltaxml:deltaV2="A=B" /> </root> Sample 2 - Document 1 --------------------- <root> <a></a> <a>value3></a> </root> Sample 2 - Document 2 --------------------- <root> <a>value0</a> <a></a> <a>value3></a> </root> Sample 2 - Result ------------------ <root deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context"> <a deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="B"> <deltaxml:text deltaxml:deltaV2="B">value0</deltaxml:text> </deltaxml:textGroup> </a> <a deltaxml:deltaV2="B" /> <a deltaxml:deltaV2="A=B">value3></a> </root> | |||
Comparison - subsequences vs. substrings | |||
Posted:
08-July-2008 16:49 Hello Skrish, Sorry it took a while to examine/debug this one (we spent some hours looking at this yesterday). You only get this odd result using the 'enhanced' or 'document centric' matcher. If you turn it off (it defaults to 'on' in the sandbox and command line), for example:
You will see an improved result:
The 'enhanced matcher' result is 'correct' but as you observed, it is not obvious or simple. It is designed to optimize the matching in document centric XML (think of multiple paragraphs of text with some paras add/deleted/modified). It takes into account all of the PCDATA in an element's subtree (which can be a large number when word-by-word splitting/reconstruction is used). In both cases the optimization function used in the matching or alignment is the 'Longest Common Subseqence' (also know as the 'edit path', or 'Levenshtien distance'). It is subtly different from the Longest Common 'Substring'. Wikipedia has a good example of the difference: http://en.wikipedia.org/wiki/Substring http://en.wikipedia.org/wiki/Longest_common_substring_problem http://en.wikipedia.org/wiki/Levenshtein_distance In the case of your test-data, there are 2 optimal equal length subsequences when processing the flattened data-structure used by the enhanced matcher. Unfortunately, the algorithm returned the subsequence which leads to the non-intuitive result. While this result may appear complicated it is however 'correct' and more generally there can be multiple 'correct' answers for any pair of inputs. It is always possible to generate either input from a full-context delta. So in summary: - the enhanced matcher works well with document like data and your data work better with this setting turned off. - the result is correct, if not obvious - it should work better with larger amounts of PCData and word-by-word turned on. Your example exhibits some of the problems of microbenchmarks. We'd be reluctant to look at optimizations/improvements unless you can show us a more realistic or larger example of mismatching. I hope this answers your question. I would suggest experimenting with larger test data and if you see any similar problems please get back to us. Thanks, Nigel | |||
| new topic post reply |
To find out about new replies to this post as they occur please subscribe to one of these feeds: | ![]() ![]() | moderate |
want to know how the comparison logic works
