I presented a paper at the Balisage conference recently on the significance (or not!) of element order in XML, “Element order is always important in XML, except when it isn’t”. Part of this discussion included JSON.
One question that arose was how to compare one with the other, i.e. there is data that is in XML and also in JSON and it is necessary to check that the information is the same or to find out where it differs. Related to this is the question we sometimes get in our support channel: how do I compare the same data in XML and in CSV (Comma Separated Value)?
Unfortunately comparing the different formats directly is not possible so some pre-processing is needed. If some of your data is in CSV then it is probably best to convert this to JSON or XML and then compare. The reason is simply that unless your data is small and you can look at it all easily, you will likely need to do some post-processing to get the changes into a form that you can easily review, and this is easier if the changes are represented in JSON or XML. One benefit of DeltaXML tools is that the changes are represented directly in these formats so they are easy to process and you do not need to load large files into an editor to review changes.
This depends on the tools you are familiar with and also how you would prefer to see the changes, whether in JSON or XML. I would suggest converting from XML into JSON because data is easier to compare in JSON because it is a simpler representation. If you have access to XSLT then this is a good way to convert from XML to JSON. The Saxon XSLT processor will write JSON out from an XML representation though you may need to do some transformation to get it into the correct XML first.
Another reason for using JSON is that it understands numbers whereas XML content is handled as text. If you have a schema you can use this to normalise the data, but that is an additional step in the process.
Very often data that is in both formats will have some differences, and one that is pertinent to the paper mentioned above is order of the data. Multiple similar data items would probably be represented in XML as child elements within a single parent and this might be converted to a JSON array. If the JSON and XML have come from different sources (which were meant to be the same) or from the same source but by different processes, then the order of the data items may differ. In this case, although a change to the order of items in a JSON array is deemed by the JSON standard to be change, in this case it is not – so the array needs to be compared without worrying about any change to the order.
Orderless comparison sounds quite easy but in practice it is simpler to compare items if they are ordered. This is why most comparison tools, including the ubiquitous diff utility, compare ordered data and cannot handle orderless or unordered comparison. You can of course sort the data first then compare it but this is not always as easy as it sounds if you have complex data items because it is difficult to define a good sorting order. By ‘good’ I mean one that will put similar items adjacent to each other and then you can more easily see if something has changed. If the sort order is not ‘good’ then a small change may result in items that should correspond with one another being a long way apart and so you cannot determine whether an item is missing or has simply changed slightly. So a different sort will produce different results if there is data that is not equal, and this is probably not what you want.
Is it possible to have a comparison that ignores the order, so there is no need to sort first? Let us say we have two JSON arrays, one array or items from a data source A and one array of items from data source B. If an item in A is exactly the same as one in B then these two can be aligned and are not a candidate for a difference. So exactly equal items can be aligned with each other fairly easily. And now the fun starts: we are left with those that are not exact matches.
We have recently done a lot of work in this area to get the best possible results. We use statistical methods to align as many items as possible, starting of course with those that are equal and then moving to those that are nearly equal and so on. There is always the hope that the two data sets are almost the same, i.e. most of the items in A have an exactly equal item in B. Often though this is not the case, for example all the A items have an additional field in them or a systematic difference, e.g. ‘10’ in one file is represented as ’10.0’ in the other one and if the comparison is text-based (which most are) it will not match these.
Sometimes each item will have some key value and this can help the alignment of the items. If the key is one data value then it is fairly easy to use this, for example it could be used as the member name in a JSON object and then the alignment will be based on this key value. In this case, the data items are not put into an array but into an object and then any JSON aware comparison will use these names to align the items.
If the key value is made up from two or more values in the item then these need to be combined to form a key, so things are a bit more complicated but it can still be done.
But there is one snag to using keys. Sometimes in two data sets the problem is that a key is wrong in one of the data sets – then they do not match but it is hard to see this because there will be an item in A that is not in B and vice versa but no clue that only the key has changed. This problem often arises when trying to synchronise data in two different systems, because the two systems probably use different values as their key and the one common value that is supposed to be the key has been entered manually and a mistake has been made.
When we were synchronising our data between our licensing and CRM system we found a number of small errors in key values that would have thrown a simple or sorted comparison. For example a key value “9746” in one appeared as “7946” in the other. So although using key values as keys seems a good plan, but if the keys are not 100% reliable then it may be better not to identify each data item with the key.
The first step is to convert CSV data into JSON, or perhaps XML as appropriate. If one data set is in JSON and the other in XML then a decision needs to be made about which to use as the comparison format – if you have more expertise in one then use that, otherwise it is probably best to go for JSON.
Next, put all data items into a JSON array and then use comparison software that can handle changes to the order, i.e. can do an orderless comparison between two JSON arrays. Choose a comparison tool that will represent the changes in JSON directly so you can process it to produce whatever report you need.
This is quite a long blog for what seems a simple question, but there are a more aspects that need to be considered when we are handling more complex data.