We have recently open-sourced our CALS validity code and this blog post is about the background to doing this and issues relating to CALS validity more generally that I think need some awareness.
About the code
The code is designed to check all of the semantic constraints associated with CALS tables. We wanted it to support the CALS tables in both DocBook and DITA and so we have tried to cover both the ‘full’ CALS spec and also the XML exchange (subset) specification.
I don’t want to say too much about how it works or how to use it. I’ll just point you to:
https://github.com/nigelwhitaker/cals-table-schematron
There is one outstanding issue that’s in progress is that it is pushing the limits of progressive validation beyond some of the current schematron implementations. That’s something I will try to talk about separately.
Why did we develop it?
When we first did table comparison work we tended to look at our results in browsers (after conversion to HTML) or XML editors. Jagged right hand sides of a table, due to uneven column lengths was something that most browsers could handle and didn’t complain (why would they, its not an HTML error, or at least it wasn’t when I last looked).
Then we had a bug report from a customer, Apache FOP didn’t like some of our tables! These were DocBook files with CALS tables that we had compared and produced a comparison result, they were then successfully converted to XSL:FO by docbook-xsl which was then being processed by FOP. Both of the input tables in this case were valid, however our comparison result was invalid and FOP was throwing an exception. We were trying to align the table entries or cells according to their content (using LCS optimisation) and for some content it wasn’t working in a column-by-column manner. It took a while to issues with the table comparator and in the process we:
- throughly read the CALS specs
- tried to understand what FOP and the other XML publishing systems/tools/pipelines required.
We were more successful at the former, it did take a few read through to pick out the important bits of text. Finding the constraints required/implemented elsewhere has proved much harder, there are many systems which we don’t have access to in order to test them and/or which don’t document their requirements/checks.
Our CALS table validity policy
Given our failure to even list all of the software that could take DocBook and DITA and turn it into HTML, PDF, e-pubs etc, never mind its CALS requirements/checks we decided to have a cautious policy, but one that would be customer friendly:
if both of the inputs to a comparison are valid CALS tables then so will the output table
In order to do this we needed to first of all implement full validity checking and we then started using it in our testing process. There is another aspect to the statement above, if we inputs are invalid what do we do?
We pondered several approaches, and eventually came up with some options, that were partly guided by the wording in the exchange specification:
“It is recommended that an authoring or editing implementation or any implementation that verifies the compliance of the table markup to this Memorandum offer the option of producing a warning message when it encounters such markup. It is an error for an authoring or editing implementation to produce a table with such markup.”
So we decided to provide options for failing with an exception, adding warnings using XML comments, PIs or markup (eg a warning para before a table describing the problem) or logging/messaging the result. The schematron code we’re open sourcing is the basis for our input checking.
Is CALS validity important?
We’ve seen different opinions, including “what spec?” and someone who has read this wording in the DITA Spec: “The DITA table is based on the OASIS Exchange Table Model…” and concluded that the CALS spec was not obligatory for a file to be valid DITA.
The statement in the DocBook spec about validity is stronger: “This element is expected to obey the semantics of the CALS Table Model Document Type Definition [calsdtd]“. However, the there is a chance that this could be misinterpreted – the title of the spec ends in “DTD” and a casual user of DocBook may just think it needs to be DTD valid and not look at the referenced specification.
We think validity is important because we don’t know the requirements of unknown tools 2 or 3 steps further down our customers publication pipeline that may be being used now or in the future.
What next?
We’ve implemented the semantic rules in CALS. We were sort of surprised that it hadn’t been done before. A few of the examples in some schematron tutorials talk about aspects of CALS tables, but nothing was approaching completeness.
Everyones life would be easier if XML editors and authoring/content creation tools fixed the problem at source and to do that a couple of things would be useful:
- Firstly, if the DITA spec had a stronger statement about CALS validity it may make implementors take more notice of CALS.
- Users should be aware of the problems of bad tables in publication pipelines/systems, they may then ask for tools creating tables to enforce validity or at least provide facilities to validate/check validity.
I hope this post helps with the 2nd bullet above, but I would initially hope the developers of editor/authoring/creation tools start to use the code (and please contribute, fix, enhance as needed).