I presented a paper at the Balisage conference this year on the significance (or not!) of element order in XML, “Element order is always important in XML, except when it isn’t”.

Part of this discussion included JSON. In fact the way that order is handled in the two formats is remarkably similar: XML elements are by definition ordered as are JSON array members. Similarly, an XML attribute is a name/value pair as is a JSON object member and both of these can appear in any order in a file. So any orderless information must have a name or key in both XML and JSON. This is not always the case though in real world data.

Should I specify the order in XML and JSON data?

One question that generated quite a lot of discussion was how best to treat information that has no order, i.e. the order of the elements makes no difference to the meaning. Let’s look at a small example to illustrate this.

<author>
  <name>Robin</name>
  <affiliation>DeltaXML</affiliation>
  <address>Malvern Hills Science Park</address>
</author>

In this very small example it makes no difference to the meaning if the address appears before or after the affiliation. How do we specify this in our schema? Unfortunately, XML Schema does allow us to do this directly. It allows complex types to contain ‘sequence’, ‘choice’ or ‘all’. A sequence or choice may occur more than once. Choice is just one of a list of elements and sequence is a defined order of elements, ‘all’ is unordered. It is tempting to think that ‘all’ is what we need here. But what ‘all’ says is that the elements can appear in the file in any order but it does not say that the order has no meaning. This might appear a subtle distinction, but it is not!

If the order of affiliation and address is not significant then there is a good technical argument to just pick an order and say in the schema they must appear in that order. This makes it clear and is easier for a reader of the data to have a defined order. There can then be no ambiguity, for example if the order is not specified then it is tempting to think that putting the address after the name is appropriate when the address refers to the person, and after the affiliation when it refers to the location of the affiliation.

So when the order is not important, just pick one, even though it is counter-intuitive to specify an order when the order really does not matter. However, the discussion at Balisage then went further because perhaps there are 25 or 50 elements where the order is not important and then if we pick an order it becomes much more difficult to edit the file to add one extra element in – you have to find the right place to add it even though it does not matter to the meaning! On the other hand perhaps it is easier to determine if a particular element is present if it has to be in a specified order.

There is an exception to this principle though: if we allowed multiple address elements then we could not use this technique to indicate that the order in which they appear is not significant. So the principle only works in some situations, i.e. when only one instance of a particular element is allowed.

So the problem remains, and it is a shame that it is not possible to change the default that order is significant in XML and JSON arrays.

A solution to normalise orderless XML and JSON data

How could this be fixed in some future update of the XML Schema language? Attempts to add to the syntax rules themselves is likely to result in quite lot of extra complexity. But there is an easier way to do this, because XML Schema already handles a similar situation for white space. By definition, white space is normalised by any reader of XML so multiple spaces have no more significance than one space – making it easy to pretty-print a file without changing its meaning, very convenient. But it is possible on any element to add an attribute xml:space=’preserve’ to indicate that the white space within this element is significant and must be preserved and must not be normalised.

So this suggests a simple solution for order also. We could have an attribute to indicate that the normal significance of order is overruled for a particular element, e.g. dx:ignore-order=’true’ . That would seem a simple solution to this problem. It might be good if this was in the xml namespace but that is restricted so this would need to be an agreed standard first, hence the use of dx as the namespace prefix here.

It would be remiss of me not to mention that in our XML and JSON comparison products, we do allow for orderless data! It is more challenging to find corresponding elements if they can appear in any order but it is very often really useful to be able to do this.

As with so many things, the devil is in the detail here, but the detail is important. Time to think about other types of order now, and order some refreshment!

1 reply
  1. Syd Bauman
    Syd Bauman says:

    Hi Robin! And thank you for both the talk and the post. But I have to admit, I am having a bit of difficulty wrapping my head around the problem, here. Maybe it’s because I only had 4 hours of sleep last night, but I am puzzled a bit by which consumer of the data needs to know if the order is significant, but doesn’t intrinsically.

    That is, my first hazy pre-dawn thought here is that order is significant to some consumers of the data and not others. For example a tool that is formatting my TEI (or DocBook or whatever) document for print publication had darn well better put the (heading) of my chapter before the first (paragraph); and for that matter, the first before the 2nd . A tool that is reporting the number of paragraphs per chapter might well want to put the number before the heading (or after), and does not care about the order of with respect to s or s among themselves.

    (Strikes me that the same is true of pretty much all data, not just XML elements or JSON arrays. For example, consider the data content of my XML elements. If the tool reading my document is a spell-checker, the order of characters in a token (or letters in a word) matters very much: “ordre” need be flagged, “order” must not be. But if the tool is a character-counter (like the one I presented at the Balisage 2020 pre-event; see https://github.com/NEU-DSG/wwp-public-code-share/tree/main/character_counts) then “order”, “ordre”, and “deorr” are the same.

    My 1st guess, in answer to my own question, is that a comparison utility might care whether order is significant. It might like to report that <a / rel=”nofollow ugc”><b /> is the same as <a / rel=”nofollow ugc”><b /> iff order is unimportant. But wouldn’t the best place for figuring out whether order is important to the comparison utility be a parameter passed to it? No, he said, answering his own question again. The problem is that some parts of a document carry semantics in the order (for example, the children of a ) while others do not (for example, the children of a ).

    So maybe @dx:ignore-order is a reasonable approach. Even if it is for a niche case. Like @xml:space it would generally speaking not be very useful, except when you need it. 🙂

    However, if we were to use a PI (“”), we would not have to change our schemas or specs. And, to my mind, it is about processing.

Comments are closed.

A new approach to representing change in CALS tables

Our new CALS tables algorithm means complex change isn’t complicated to understand.