Introduction
It is essential that organisations which publish reference material relied upon by professionals ensure that their content is accurate and up-to-date. This is particularly vital in fields such as law, engineering, and medicine. It is desirable, then, that such material should be stored in a single repository for ease of updating, version control, and dissemination.
Background
The Karnov Group Denmark is a Danish legal publishing house which provides a legislative overview to the legal profession, courts of law, accountants and local government employees in Denmark. It has been established for over 150 years and currently has 150 employees at its Copenhagen headquarters and 550 associated professionals. It is part of the Karnov Group AB, based in Sweden, which had a turnover in the last financial year of SEK 878M. Every year it publishes Karnov’s Law Books. In addition, Ugeskrift for Retsvæsen (Weekly Journal for the Judiciary), other professional journals and casebooks are published weekly or annually.
The Challenge
Manually merging XML sources
In 2017, Karnov Group Denmark bought Norstedts Juridik, a Swedish legal publisher, and set about merging their document sources. Both companies published the Swedish legislation, known as the Swedish Code of Statutes or SFS, online and in print, and both used in-house XML sources based on PDF sources provided by Regeringskansliet, i.e. the Swedish government. But the companies used separate tag sets and their own interpretations of the semantics in the PDFs. They also enriched the basic law text with extensive annotations, links to (and from) case law, and so on. And, in the case of Norstedts, the XML source is continually updated to support their flagship product, the so-called Blue Book, a printed book of the entire in-force SFS, updated yearly as the law itself changes.
It made little sense to continue maintaining the two SFS sources separately, of course. Instead, what was needed was, essentially, the sum of the two, with annotations, missing versions (significant gaps existed in the versions’ histories), and so on, added. There were 8-9,000 separate statutes, occupying a total of about 16,000 files, each varying in size from a few Kbytes to 10 or 12 Mbytes, with an average of around 200 Kbytes. So, obviously, the task of merging these two datasets was enormous and needed to be automated somehow.
At first glance, in order to maintain a single source of legislative information it might be enough to simply pick one XML source and use that. However, both companies had their existing online systems and customer bases with differing product offerings. As suggested above, the sources described the same thing, the Swedish Code of Statutes, but the sets were not an exact match. Sometimes one company would have SFS documents, the other didn’t, and often, there were significant gaps with older versions of individual chapters and paragraphs. It made little sense to throw any of that away just to make the merge simpler.
There was also the matter of the Norstedts flagship product, the printed law book, that had to be included in any future offerings, so anything written specifically for that book had to be included.
Similarly, Karnov included extended notes in their SFS content and made that available as a separate product, which meant that they would have to be preserved, too. Therefore, in one way or another, the respective SFS sources would have to be merged.
The Solution
Consult and automate XML merging
It would obviously be a huge undertaking to carry out this merge manually so in early 2018 Karnov Group recruited Ari Nordstrom as XML guru to address this problem. Ari is a content architect and XML expert with over twenty years of experience in single-source document management and publishing, encompassing most XML standards in use today, from schema languages to XSLT, XQuery, and XProc. His past clients include organisations such as Volvo Cars, The Swedish Federation of Farmers, LexisNexis, and many others. Ari was tasked with managing the project to merge the two disparate SFS sources, and then to devise a repeatable method for updating the documents as laws are made, revised or repealed.
Ari’s proposal was as follows:
In order to merge the SFS content into a single XML source first convert both sources into a single exchange format. Then compare these versions using DeltaXML’s XML Compare, and then do the actual merge based on the difference file produced. Finally, convert the merged content into a future editing format. This process was to be done in six stages:
- Create a DTD2 describing a common exchange XML format (EXC DTD), essentially the sum of the semantics found in the respective sources.
- Update an existing, or create a new, authoring format (KG++ DTD), in which the merged SFS corpus can be maintained and updated in the future.
- Convert both sources to the common exchange format, at times up-converting and fixing various semantic constructs so they’d match each other.
- Compare the converted sources with each other using an XML differencing tool that merges the sources and adds markup to indicate any differences.
- Address those differences, one by one, to produce properly unified SFS documents with a single main law text and clearly identified enhancements, still in the exchange format.
- Convert the unified SFS documents to the authoring XML (KG++ DTD) format.
The various transformations were performed using XSLT pipelines managed by XProc. Once both sources were in exchange format, they needed to be compared with each other. For this, Ari chose what many see as the industry standard for comparing XML, the XML Compare tool from DeltaXML Ltd. Ari was familiar with XML Compare when he joined Karnov, and it was one of those tools that he really wanted to use in a big project, having listened to DeltaXML and others discuss the product and differencing at XML Prague and Balisage.
XML Compare compares two XML files, “A” and “B”, with each other according to predefined rules and inserts differencing markup to show where the differences lie. XML Compare can optionally output an HTML representation of the differenced A and B files, which proved helpful in this project when discussing the merge with various project stakeholders. Crucially, it was in Step 4 where the unique functionality of the XML Compare product from DeltaXML Ltd proved invaluable. For example, when comparing sources that supposedly have the same base, along with some definable differences, it is useful to tell the compare process that certain nodes are intended to be the same. In XML Compare, you tell the application by adding deltaxml:key attribute values to any nodes that are the same in both sources.
“I have close to 30 years in the field and in my experience XML Compare doesn’t have an equal at what it does. Again, I’d tested it but hadn’t had the chance to use it in a real project. Basically, I told Karnov what I needed, and they were good enough to trust me. I’d used oXygen’s built-in diffing, of course, but it’s nowhere near as powerful. The various software diffing tools – there are a few decent ones – aren’t XML-aware, so they were never an option.”
Ari Nordstrom, Content Architect and XML Consultant, Karnov Group
The Results
Changes found, systems merged, processes undisturbed
Ari was fully occupied on this project for 6 months at the end of which the two disparate systems had been successfully merged, while both companies’ regular publishing activities had continued undisturbed.
DeltaXML’s XML Compare formed a vital component of this project. The fact that the process had to be repeatable – new laws were being written all the time – made most other approaches unworkable. While there is legislation from at least 1750 and onwards, most of the code was written very recently, within the last decade or two. A manual approach would not have worked.
The advantages of using XML Compare for this project can be summarised as follows:
- The ease of use of the injected difference markup was hugely important. It was a perfect match for the approach in this project; a pipelined, step-by-step approach to converting and merging, where one problem or aspect could be tackled at a time. Most XSLT conversions tend to be rather monolithic, but this project needed to run in sequences of steps to allow for an easy revision of the order of operations. The XML Compare markup made this easy; it is very logical and easy to process, and it made that kind of reordering simple to do.
- The product was extremely reliable. No bugs were hit during the entire project, and the performance was a lot better than expected. The XProc pipelines turned out to be the limiting factor.
- The format lends itself well to XSLT unit tests (XSpec tests). The unit tests were an absolute necessity.
- For the project management, it was useful to have the HTML representation of the difference files. It was easy to present management with visible progress.
- The pipeline produced a valid output that included automated quality checks. Having a hundred students type in the sources and add the commentary, citations, etc. would have taken longer and still required proofing afterwards.