exact phrase  any/all
Managing the enterprise information network
denotes premium content | May 26 2012 

Feature

posted 3 May 2006 in Volume 2 Issue 10

The pros and cons of XML

When and how should organisations adopt XML as a document storage and publishing format?

By Neil Bradley

Extensible mark-up language, or XML, plays a huge role in the way many companies now plan their content management and distributed document processing projects, and its development has generated much interest and widespread adoption around the world.

There is a simple reason for that: the XML data format encodes documents in a way that makes them amenable to efficient and flexible exploitation through automated processes – such as extraction and re-use of titles to build a contents list, for example.

But not all documents are suitable for representation in XML. Some of the potential benefits of using XML introduce new complexities and costs that need to be understood if a company is to assess the pros and cons of possible XML adoption. In particular, some content management systems offer advanced capabilities for XML documents, but what are these features, and does your business case justify their cost?

Why XML for documents?

The original purpose of XML, and SGML [standard generalised mark-up language] before it, was to divide the content of documents into meaningful sub-components, arranged in sequential and hierarchical structures that could be easily explored, extracted, re-ordered and formatted for delivery to a variety of publishing media and audiences.

Unlike formatting-oriented languages, such as HTML [hypertext mark-up language], there are no pre-defined tags for XML. Instead, a set of appropriate tags is devised for each document type, and these tags focus on meaning instead of formatting, thus facilitating intelligent querying of the content. Figure one shows how a simple memo might be encoded using XML tags, where <from> and <to> tags unambiguously identify the names of the sender and the recipient.

While not all documents are suitable for XML encoding, those potentially able to conform to tightly controlled narrative structure templates are certainly good candidates. For example, XML is eminently suitable for such document types as reports, journal articles and reference books, user guides and training manuals and, less obviously, for poetry, concert programmes, long forms, simple catalogues and many other types of ‘document’.

Unfortunately, the enthusiasm with which XML has been adopted as a powerful data format for carrying highly structured data between databases and software applications has tended to overshadow its document-focused origins. Only with the development of word processing packages that include ‘save as XML’ options has its original purpose begun to re-assert itself. Yet even that step has promoted a misleading account of the purpose of XML. Microsoft Word, for example, uses XML as a replacement for its RTF format, and it therefore includes XML tags for each of the styles that Word can produce.

In contrast, figure one showed the use of <emph> tags, rather than explicit formatting tags such as <bold> and <underline>, because the decision on how to render the content can (and often should) be made later. More significant is the need to distinguish between, say, foreign words, proper names and emphasised text. XML can do this easily, as shown in figure two, regardless of whether or not the content of the different XML tags will ultimately be displayed in the same italic style. Briefly, some of the other benefits of XML include:

  • An open, international standard. This is often a requirement, especially in government organisations, and the choice of tools from different vendors avoids dangerous vendor ‘lock-in’;
  • Maturity. XML was born in 1998 to its parent, SGML (dating back to 1986), which in turn emerged from theories going way back to the 1960s. It can therefore be described as evolutionary and has developed based on the strengths and shortcomings of predecessor technology;
  • Popularity. XML has replaced SGML. There are no other competing data formats with the same goals, so there are lots of books, tools and trained software developers to choose from;
  • Supporting standards. XML is supported by other standards, such as extensible stylesheet language transformations (XSLT), for converting XML to other formats for manipulation and presentation.

When to ignore XML

Even the most dedicated XML evangelist will admit that XML is not always the most appropriate solution to a particular information management problem. Indeed, there are times when it should not even be considered.

As indicated above, XML is most suited to structured narrative content, consisting of unpredictable sequences of such text structures as paragraphs, lists and tables. More highly structured data has a natural home in database systems.

At the other extreme, highly designed material, such as a children’s picture book or a brochure (typically created using a design-oriented desktop publishing package such as Adobe InDesign or QuarkXPress), may have no obvious structure at all. But note that, in both scenarios, XML may still play secondary roles in data capture and publishing processes.

Even when the documents are structurally suitable, if the value of the content is very low, and quality is not a goal, then the cost of XML implementation is unlikely to be worthwhile. In particular, short-lived content that will never need to be re-worked for secondary media or alternative audiences rarely warrants the extra costs associated with XML.

XML document models

Automated processing of XML documents can only work reliably if all of the documents are tagged in a consistent and predictable way. This requires a document model to be created, using a document type definition (DTD) or a schema (the Worldwide Web Consortium – or W3C – standard or one of its competitors), and for each document to then be tested against that model.

The big decision is whether to simply adopt a suitable industry standard model (if available) and perhaps tailor it to your needs, or to start from scratch with a new model crafted specifically for your organisation’s document characteristics and functional requirements.

It is also very important to get the model right first time from the very beginning. Changes made to the model after starting to create conformant XML documents can be expensive, because the changes often affect the document tagging too.

A happy and important result of having a carefully planned document model is that, when using an XML-sensitive word-processor, the model acts as an advanced template with which authors are forced to comply (though they may need to be placated by emphasising that the model merely ‘guides’ them). This enforced consistency automatically raises the perceived quality of the product.

Converting content to XML format

The decision to adopt XML often results in an immediate headache: how to convert masses of existing material into XML format. However, different strategies can be adopted, depending upon the nature of the source content, the scale of the task and any security and schedule requirements.

The problem with most other data formats is that they are not as tightly structured as XML and therefore cannot be reliably converted to XML using fully automated processes. Costly manual checking, correction and enhancement steps are almost inevitable.

It can be cheaper to hand over all the existing data to an offshore data conversion bureau (assuming there are no data security issues), but then it is important to ensure that quality standards are maintained through a water-tight service level agreement (SLA) that takes into account tagging and content quality rates, along with implementation of an effective sample checking process to ensure conformance with those quality standards.

New documents may be created in one of the popular word processing packages, then converted to XML, either using a ‘save as XML’ option, if available, or by use of a specialised batch-processing conversion tool, which may be better at image handling (that is to say, extracting images and creating references to them in the XML). An XSLT engine is then typically used to convert from the output document model to the required document model. Depending upon the complexity of the document model, it may then be necessary to use an XML-sensitive word processor (see below) to correct the occasional mistake, or to add structures that could not be represented within the original word-processor.

Creating content in XML format

It is usually much easier to originate content in XML format, rather than create it in another format and convert it to XML later. There are many XML-sensitive authoring tools to choose from, though only the professional XML word processors should be considered for non-technical authors, as cheaper tools that are aimed at XML developers are not sufficiently intuitive and would certainly meet justifiable author resistance.

Some of these word-processors hide the XML from authors in an attempt to simplify the authoring experience, but (in my opinion) cause more user interface problems than they solve. Others take the sensible approach of bringing XML to the fore, as in the example screenshot in figure three. Of course, author training is required, but this generally takes no more than one day of instruction and practice.

Most of the high-end document management systems have integrated one or more of the best of these word processors (such as Arbortext’s Epic editor or SoftQuad’s XMetaL), though these word processors also work as stand-alone tools, so are also ideal for remote, offline authoring.

However, XML word-processors are also relatively expensive. If the cost of purchasing them (along with authors’ training costs) would be too high, then the alternative strategy would be to split authoring into two steps. Authoring can be done in any popular word processor, then the content converted to XML (as discussed above) and completed by a small team of specially trained editors using an XML word processor.

Content management issues

Not surprisingly, a content management system (CMS) does not have to be aware that it is handling an XML document for it to be able to offer the basic features of secure storage, workflow control, search and other retrieval options. But some CMSs are able to detect XML documents and offer advanced XML-specific capabilities.

So, what are the features to consider when shopping for a new CMS? It depends on the detailed business requirements, of course, but here are two factors to consider when creating and archiving XML documents within a CMS.

Editorial System Factors

An XML document that claims to conform to a specific document model might have this claim tested as it is added to the content management system, and thereafter each time it is checked-in after amendment. If the document fails the validity check, it might be rejected or highlighted for further attention. To do this, the CMS must be able to read the XML file to find the information it needs to identify the appropriate document model, then find and use that model in order to validate the document.

A similar issue arises if the CMS is configured to launch an XML word processor whenever an XML document is checked out. The CMS will pass the document to the word processor, which will want to validate the document before displaying it. If the required DTD or schema is also managed by the CMS, then the CMS will need to know and be able to copy out the latest version of the model to the location that the word processor expects to find it.

There are also features of some XML documents that challenge content management systems. In particular, the fact that XML-based documents are not always single data files. For example, image data is usually held in external data files that conform to a non-XML-based image data format, such as EPS, TIFF or GIF.

Similarly, an XML document may contain references to other XML documents that are to be merged into the main text. A complete document might therefore be composed of a combination of files and, for the sake of operational simplicity, this bundle may need to be managed as a single object. This might include automatic check-out or copy-out of images or sub-document files referenced from an XML document that is being checked out.

Some content management systems allow an XML document to be ‘shredded’. In its simplest form, this is a single-level process. For example, an XML document representing a book might be split into chunks at the chapter level. This is very useful if parts of a large document are regularly updated and authors want to work simultaneously on different sized chunks of the document.

There are also a few document management systems on the market that focus on XML (and SGML) content. These provide all or most of the features mentioned here. In addition, they can provide a more sophisticated version of the shredding feature, which is extended to all levels of the document structure hierarchy so that components at all levels can be versioned, locked, checked out and even shared with other documents, so that an edit made to the shared components automatically updates all of the documents that reference it.

All CMSs maintain metadata about their content items. This typically includes the document title, author name, date and time of creation, and any classification keywords. A CMS may allow mapping of metadata to XML structures as an XML document is checked-out and checked-in. The benefits include avoidance of data duplication and the simplicity of single-interface authoring of both data and metadata.

Although most CMSs allow specialised content editing tools to be either integrated or launched externally, they also typically have a built-in content editor. This would be a cheaper option. Some CMSs offer a web-based form for creating new content, and may even stress the fact that the content will be stored in an XML format. However, the XML generated will usually conform to a standard, very restricted model. It may not, for example, allow for a narrative structure that includes mixed sequences of paragraphs and lists, or inline tagging beyond simple bold/italic/underline styling. It is very rare for a built-in authoring system to rival the flexibility of true XML word processors. Any built-in editor should therefore be tested for suitability against the document model.

Archive Retrieval Issues

XML data can be searched, like text-based documents, or queried, like database records, using languages similar to SQL. This dichotomy of access techniques reflects XML’s nature as an intermediary between uncontrolled text and highly structured data. Search tools are invaluable for finding specific documents within a large archive. Of course, being text-based, an XML document can easily be indexed by any search engine. However, the structural nature of XML documents enables more refined searches to be performed.

Some search technologies allow for ‘zoning’ of XML document content, where each zone represents the content of a specific pair of tags. The context within which a word or phrase is found can then be taken into consideration. For example, it becomes possible to find documents that contain an important word or phrase, but only when it appears in a summary or within titles. It can be useful to build new documents automatically from components of other documents.

For example, a catalogue containing titles and summaries of the document archive might be needed. There are simple ways to achieve this when working with small collections of documents, using nothing more than a batch process with an XSLT stylesheet at its heart. But this approach would be inadequate for large collections of documents. This is where an XML database comes into its own, with its advanced retrieval features, typically based on the XPath standard or the more advanced XQuery language.

Finally, regardless of how it is found, the content of an XML document is not suitable for direct display to anyone but an XML geek. XML tags therefore need to be replaced by suitable formatting of their contents. A CMS may provide the means to generate a preview of the document, often in HTML or PDF format (typically using a basic XSL-FO [XSL-formatting objects] engine, for which a suitable stylesheet would need to be developed).

Conclusion

In the course of this article, it has hopefully been shown that if the disadvantages of XML include complexity and cost of setup, then the advantages are quality and the low cost and high speed of information re-use. But while adoption of XML can bring economic and quality benefits, these can only be achieved if care is taken to implement XML with due consideration to the potential pitfalls.

In summary, every organisation considering using XML for content management and document delivery needs to ask the following questions: Does the business case support the adoption of XML, is the potential document model appropriate for its intended use and have the most appropriate supporting tools been chosen?

Neil Bradley has worked in publishing technologies for 20 years and specialises in XML-centric solutions. He is the author of The XML Companion (published by Addison Wesley) and several other books on related standards. He is also one of the team at Ixxus Limited, a consultancy specialising in building web-based information management systems for data-rich and knowledge-intensive enterprises and can be contacted at neil.bradley@ixxus.com.

Sponsored links

Subscribe to the EI e-newsletter. Keep up-to-date with the latest news from EI magazine

Intranets and Portals report
Copyright ©1994-2005 Ark Group Ltd All rights reserved. No part of this site or the publications described herein
may be reproduced in any form without the permission of Ark Conferences Ltd, Registered in England, No. 2931372.