I presented a paper at the Balisage conference this year on the significance (or not!) of element order in XML, “Element order is always important in XML, except when it isn’t”.

Part of this discussion included JSON. In fact the way that order is handled in the two formats is remarkably similar: XML elements are by definition ordered as are JSON array members. Similarly, an XML attribute is a name/value pair as is a JSON object member and both of these can appear in any order in a file. So any orderless information must have a name or key in both XML and JSON. This is not always the case though in real world data.

Should I specify the order in XML and JSON data?

One question that generated quite a lot of discussion was how best to treat information that has no order, i.e. the order of the elements makes no difference to the meaning. Let’s look at a small example to illustrate this.

<author>
  <name>Robin</name>
  <affiliation>DeltaXML</affiliation>
  <address>Malvern Hills Science Park</address>
</author>

In this very small example it makes no difference to the meaning if the address appears before or after the affiliation. How do we specify this in our schema? Unfortunately, XML Schema does allow us to do this directly. It allows complex types to contain ‘sequence’, ‘choice’ or ‘all’. A sequence or choice may occur more than once. Choice is just one of a list of elements and sequence is a defined order of elements, ‘all’ is unordered. It is tempting to think that ‘all’ is what we need here. But what ‘all’ says is that the elements can appear in the file in any order but it does not say that the order has no meaning. This might appear a subtle distinction, but it is not!

If the order of affiliation and address is not significant then there is a good technical argument to just pick an order and say in the schema they must appear in that order. This makes it clear and is easier for a reader of the data to have a defined order. There can then be no ambiguity, for example if the order is not specified then it is tempting to think that putting the address after the name is appropriate when the address refers to the person, and after the affiliation when it refers to the location of the affiliation.

So when the order is not important, just pick one, even though it is counter-intuitive to specify an order when the order really does not matter. However, the discussion at Balisage then went further because perhaps there are 25 or 50 elements where the order is not important and then if we pick an order it becomes much more difficult to edit the file to add one extra element in – you have to find the right place to add it even though it does not matter to the meaning! On the other hand perhaps it is easier to determine if a particular element is present if it has to be in a specified order.

There is an exception to this principle though: if we allowed multiple address elements then we could not use this technique to indicate that the order in which they appear is not significant. So the principle only works in some situations, i.e. when only one instance of a particular element is allowed.

So the problem remains, and it is a shame that it is not possible to change the default that order is significant in XML and JSON arrays.

A solution to normalise orderless XML and JSON data

How could this be fixed in some future update of the XML Schema language? Attempts to add to the syntax rules themselves is likely to result in quite lot of extra complexity. But there is an easier way to do this, because XML Schema already handles a similar situation for white space. By definition, white space is normalised by any reader of XML so multiple spaces have no more significance than one space – making it easy to pretty-print a file without changing its meaning, very convenient. But it is possible on any element to add an attribute xml:space=’preserve’ to indicate that the white space within this element is significant and must be preserved and must not be normalised.

So this suggests a simple solution for order also. We could have an attribute to indicate that the normal significance of order is overruled for a particular element, e.g. dx:ignore-order=’true’ . That would seem a simple solution to this problem. It might be good if this was in the xml namespace but that is restricted so this would need to be an agreed standard first, hence the use of dx as the namespace prefix here.

It would be remiss of me not to mention that in our XML and JSON comparison products, we do allow for orderless data! It is more challenging to find corresponding elements if they can appear in any order but it is very often really useful to be able to do this.

As with so many things, the devil is in the detail here, but the detail is important. Time to think about other types of order now, and order some refreshment!

10 replies
  1. Syd Bauman
    Syd Bauman says:

    Hi Robin! And thank you for both the talk and the post. But I have to admit, I am having a bit of difficulty wrapping my head around the problem, here. Maybe it’s because I only had 4 hours of sleep last night, but I am puzzled a bit by which consumer of the data needs to know if the order is significant, but doesn’t intrinsically.

    That is, my first hazy pre-dawn thought here is that order is significant to some consumers of the data and not others. For example a tool that is formatting my TEI (or DocBook or whatever) document for print publication had darn well better put the (heading) of my chapter before the first (paragraph); and for that matter, the first before the 2nd . A tool that is reporting the number of paragraphs per chapter might well want to put the number before the heading (or after), and does not care about the order of with respect to s or s among themselves.

    (Strikes me that the same is true of pretty much all data, not just XML elements or JSON arrays. For example, consider the data content of my XML elements. If the tool reading my document is a spell-checker, the order of characters in a token (or letters in a word) matters very much: “ordre” need be flagged, “order” must not be. But if the tool is a character-counter (like the one I presented at the Balisage 2020 pre-event; see https://github.com/NEU-DSG/wwp-public-code-share/tree/main/character_counts) then “order”, “ordre”, and “deorr” are the same.

    My 1st guess, in answer to my own question, is that a comparison utility might care whether order is significant. It might like to report that <a / rel=”nofollow ugc”><b /> is the same as <a / rel=”nofollow ugc”><b /> iff order is unimportant. But wouldn’t the best place for figuring out whether order is important to the comparison utility be a parameter passed to it? No, he said, answering his own question again. The problem is that some parts of a document carry semantics in the order (for example, the children of a ) while others do not (for example, the children of a ).

    So maybe @dx:ignore-order is a reasonable approach. Even if it is for a niche case. Like @xml:space it would generally speaking not be very useful, except when you need it. 🙂

    However, if we were to use a PI (“”), we would not have to change our schemas or specs. And, to my mind, it is about processing.

    Reply
    • Robin La Fontaine
      Robin La Fontaine says:

      Thanks for your comments, Syd, you always have an interesting angle to explore!

      “I am puzzled a bit by which consumer of the data needs to know if the order is significant, but doesn’t intrinsically.” You have an optimistic view here I think, Syd, because where a standard is not clear there will always be two people who interpret it differently! Indeed, even when the standard is clear that can happen. I think what you say is probably more true in the ‘document world’ than the ‘data world’ of XML, i.e. there is more scope for different interpretations of data than documents.

      “Like @xml:space it would generally speaking not be very useful, except when you need it.” Yes! Like 4-wheel drive, not often useful but when it is you really do need it.

      “And, to my mind, it is about processing.” This is an interesting comment, but I am not sure I agree: it is not just about processing, it is about the semantics of the information. The order of paragraphs really is important, even if for some processing it has no effect. Similarly, the order of items in a set (in the mathematical sense) of data is not important (it has no effect on the members of the set), and a process should never depend on the order. Perhaps we would both agree that “for some types of processing, it is not important.”

      Reply
  2. Gerrit Imsieke
    Gerrit Imsieke says:

    Maybe you don’t need to allow dx:ignore-order as an attribute in the document. You might let people supply this information as a schema annotation, or in a configuration file.

    Reply
  3. Max Zhaloba
    Max Zhaloba says:

    Hi Robin,

    Thank you for sharing your thoughts on this issue. I personally deal with content ordering when creating XSL transforms which update the DOCX (OOXML) content. The OOXML XML Schema requires order in many cases where it could be logically ignored, e.g. when I apply numbering to an existing paragraph I have to add the element at the defined position along other elements within which might or might not exist on it’s own. I solve this issue in XSLT the following way:

    1. Read all existing paragraph properties from (if any)
    2. Add to the end of the properties list
    3. Sort the paragraph properties according to the order defined in “CT_PPrBase” complex type in wml.xsd. I have a generic template for this purpose.
    4. Generate with ordered properties as child elements

    Instead of performing the sorting on each editing operation in XSLT I thought I could just add elements as last children to the resulting XML and then apply some XML Schema validating engine which would be capable of rearranging the XML elements on the fly and generating a formally valid XML file provided that input XML file satisfied all other validation constraints except for element order.

    In case if the corresponding XML Schema allowed orderless content then such validating engine could still rearrange the content using the order available in the XML Schema (e.g. “EG_RPrBase”) or alphabetically which would facilitate content comparison at later stage.

    Talking about the “ignore-order” attribute I’d rather implement it as another switch for XML parser. It would rearrange the content to match the order defined in DTD or XML Schema. We already have such features which result the DOM to differ from original XML — (1) whitespace representation (controlled by xml:space attributes in the source XML) and (2) auto attributes expansion using DTD or XSLT Schema (e.g. adding “class” attribute to DOM when parsing the DITA content).

    Reply
    • Robin La Fontaine
      Robin La Fontaine says:

      Thanks for your description of how you handle this in OOXML, and I understand why it would be good to have this handled automatically based on the Schema. This relates to Syd’s comments about handling this as a processing issue but as I comment above I have reservations about this. It also works for single elements that need to be in a specific order but could not work for multiple occurrences of one element type if order is significant (see original blog, para beginning “There is an exception to this principle though:”). So at best this would be a partial solution but I do like the idea otherwise.

      Reply
  4. Elisa Beshero-Bondar
    Elisa Beshero-Bondar says:

    As I was listening to your Balisage talk, Robin, and reading your post now, I am wondering about the ways we work with schemas to designate that any element may appear in any order. In Relax NG (compact syntax), we can write something like this for your “ element:

    author = element author { (name | affiliation | address)* }

    And this of course provides a very flexible content model that permits any of these elements to appear zero or more times without predetermining the order. Because I can do that by grouping clusters and providing choices, I can even go on and define a sequence precisely where I want it, if I decide I want to allow more than one address and that should always appear last:

    author = element author { (name | affiliation)+, address+}

    That is a combination of no sequence and sequence. And I can even cluster the address(es) together but allow them first, second, or third position:

    author = element author { (name | affiliation | address+)* }
    though I don’t really need to use the internal repetition indicator since the grouping of options permits them to appear in any order.

    I realize you are writing about XML Schema and not Relax NG, and that this is to support tooling written in XML Schema. My question now is, if XML Schema is necessary and Relax can’t be used, what happens when we try to convert a Relax NG schema that permits a mix of elements in any order into XML schema ? I have never tried this, and Relax is my native XML schema idiom, so I am curious about the relationships (of nonrelationship) between these schema languages.

    Reply
  5. Elisa Beshero-Bondar
    Elisa Beshero-Bondar says:

    Re-reading your post, I think I can see how XML Schema would handle my Relax NG groupings of options, and I suspect your response may be that Relax NG, too, has no way to specify that the order does not matter! Hmm.

    In my world of projects I do tend to prefer more order to less, so I suppose even in the clustering of labels in a schema, even when I wish to allow a content model in which elements appear in any order or not at all, I write the options in an order to do with human readability and logic to make the schema easier to edit!

    Lots to think about here. In my work (even with lots of exciting mixed content with text flowing round the elements like soup) I am not sure I need to designate absolute orderlessness, but I would like to better understand the circumstances when I might want that!

    Reply
    • Robin La Fontaine
      Robin La Fontaine says:

      Thanks for your thoughtful comments, Elisa. As you say, Relax and Schema specify the syntactic order but do not indicate if it has semantic meaning or not.

      You raise an interesting point about mixed content – that is a special case that I did not deal with. In comparison, we always say that order is always significant for mixed content. This is because order of characters in PCDATA is significant and typically in mixed content the elements (siblings to PCDATA within the mixed content) may contain more PCDATA and this needs to be kept in some sequence with the text in the ancestor element. I cannot really think of a situation where you could re-arrange the order of text and elements in mixed content without changing the meaning. So I should have said dx:ignore-order=’true’ would not be allowed on mixed content elements.

      Thank you for bringing up that important case.

      Reply
  6. Ari Nordström
    Ari Nordström says:

    Hi Robin,

    Thanks for the talk and the post, both.

    I think the key here is that the importance of order varies depending on the user. It might be important to some while not to others, so I’d probably not want to define it directly in my schema, be it a DTD, XSD or RNG, even if I could. I can’t help but think back to SGML and the AND groups that the XML WG did away with. They were certainly preferable to repeatable OR groups in so many cases.

    But I sort of understand why they were removed, especially since there is an excellent tool to add that kind of check, both for the AND groups and for those situations where some require order while others do not.

    Schematrons.

    The cool thing is that you can define one for each group of users, really as many as you like. I’m not sure they would necessarily help you to do the diffing, though – unless you were able to use the Schematron rules as input to that process. Think an initial setup using XPaths, before performing the actual diff.

    I need to finish that thought.

    In the meanwhile, thanks again.

    Best,

    /Ari

    Reply
    • Robin La Fontaine
      Robin La Fontaine says:

      Thanks, Ari, for your suggestion about Schematron. However, my thinking is that order is not to do with the user but rather it is inherent in the data/document. This is similar though not quite the same as the discussion about whether order is to do with processing. But this is all useful to discuss and work through as you say! There is more to consider here than I first thought.

      Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *