XML is Not the Answer

XML is not the answer. Structured writing may be the answer. XML is one way to implement structured writing.

More and more these days, I am hearing technical publication managers (and not a few consultants) talk about the need to “move to XML”. This may be a shorthand way for them to describe a very sensible plan to implement structured writing, but if so, I wish they would say “move to structured writing”. Structured writing is a potential answer to many content challenges. XML, by itself, is not.

There is a very real danger in all this XML talk that people can get the impression that if they move their content to XML — any XML — then they will instantly have the ability to reuse content across the enterprise and to seamlessly and effortlessly deliver content to any and every channel in a form that perfectly matches the optimal information design for that channel. It just isn’t so.

Let’s examine this by looking at what XML does. XML, in fact, does one very simple and elementary thing. It lets you apply labels to content. More specifically, it allows you to add labels to content inside of a document. It even allows you to apply a label to a single word or phrase, as in this example that I use a lot:

<p><director name="Howard Hawkes">Hawkes'</director> final film is a lighthearted Western in the <movie>Rio Bravo</movie> mold, with <actor name="John Wayne">the Duke</actor> as an ex-Union colonel out to settle some old scores.</p>

Here the <actor> tag is used to apply a label to the words “the Duke”. The label consists of the name of the tag itself (“actor”) and the value of the name attribute (“John Wayne”). These two parts of the label identify what type of thing the words “the Duke” refer to (an actor) and the specific instance of that type (John Wayne).

The ability to label content at a granular level like this is very useful, because it enables us to write algorithms to process the text at granular levels, which lets us do all kinds of content engineering such as publishing in different formats, reusing in different publications, or generating links automatically.

But all these content engineering possibilities depend not on the format of the labels, but on what is labeled, how it is labeled, and how reliable the labels are. If the right things are reliably labeled in the right way to enable a particular content engineering operation, then you can perform that operation with an algorithm. If not, you will have to do it by hand.

As long as the labels are correct, it does not much matter what format the labels are in. Below is another way of applying the same labels to this content.

[Hawkes'](director "Howard Hawkes") final film is a lighthearted Western in the [Rio Bravo](movie) mold, with [the Duke](actor "John Wayne") as an ex-Union colonel out to settle some old scores.

The point I am making here is not that you should choose a different format than XML. The point is that XML is doing a trivial task here. If you had a whole bunch of content whose labels were applied in the format shown above rather than XML, a few lines of code would be sufficient to turn it into XML on the fly so that an XML-based content processing system could read the labels and use them in exactly the same way.

What matters is that the labels are correct for the operations you want to perform. Obviously enough, different labels enable different operations. For instance, Howard Hawkes and John Wayne are both people, Rio Bravo and War and Peace are both titles. If you decided to use the label “person” for Howard Hawkes and John Wayne and the label “title” for Rio Bravo and War and Peace, you could do a content engineering operation such as making an index of people mentioned in the content or preparing a bibliography of all the titles mentioned in the content.

But you could not make separate lists of actors and directors or a separate list of books and movies, because the labels you have chosen do not contain that information. On the other hand, if you use separate labels for actors and directors and for books and movies, you can still create an index of people or a bibliography of titles simply by telling your content engineering algorithm that actors and directors are people and that movies and novels are titles. The more specific your labels, the more useful they tend to be.

Everything depends, therefore on what you decide to label and how you decide to label it. Whether you decide to label it in XML or in some other format is a trivial implementation detail. In most cases, of course, you will choose XML for this implementation detail, but to say “I need to move my content to XML” is to focus on a trivial implementation detail, not on the the thing that makes a real difference: labeling your content to enable content engineering processes — in other words, adopting structured writing.

When you do set out to implement structured writing, the key question you have to ask is, what content engineering processes do I want to apply to my content, and what labels will I have to apply to my content to enable those processes to be automated? All the choices you make in specifying the system you need — XML or not, DITA or DocBook or something else — should be based on the answers to this question.

Another reason that it is important to recognize the fact that correct labeling of content, not XML, is what matters, is that you may find that you have lots of usefully labeled content all around you that you can bring into your content engineering process. That content may not be labeled in XML. It may be in a relational database, or in a stream of JSON data, or in product source code (programming languages are just carefully labeled instructions and data), or in source code commenting systems such as JavaDoc. But if it contains labels that you can reliably access with algorithms you can integrate it with the rest of your content in all sorts of interesting, useful, and time-saving ways.

XML is not the answer. Structured writing may be the answer. XML is one way to implement structured writing.

Author: Mark Baker

Mark Baker is the author of Every Page is Page One: Topic-based Writing for Technical Communication and the Web and Structured Writing: Rhetoric and Process as well as other books on content and content technologies and dozens of articles on technical communication, content strategy, and structured writing. He has worked as a technical writer, tech comm manager, director of communications, trainer, programmer, copywriter, and consultant and has spoken frequently at industry conferences. He has designed, built, and used multiple structured writing tools and systems, including the one used to write this book. He blogs at everypageispageone.com and tweets as @mbakeranalecta. For more information, see analecta.com.

15 thoughts on “XML is Not the Answer”

  1. Blasphemy!

    Seriously though, do we have any examples of a doc repository in JSON or a relational database? I’m sure there are. I just haven’t seen any.

    1. Thanks for the comment, Scot.

      This blog is an example of a doc repository stored in a relational database — WordPress stores all its data in a MySQL database and uses a system it calls short tags for labeling content below the document body level. Labels applying to the whole post are stored in relational fields. In general, most Web CMS systems are based on relational databases and feature varying degrees of structured based on breaking down elements of content into relational fields. Such systems certainly outnumber the XML based repositories used in technical communications.

      I worked on another project that stored content as part of a semantic network in a MongoDB database which uses a modified form of JSON for storage. We used some XML on the front end, though that was not essential, but MongoDB seemed to be the right choice to store the semantic network relationships that were the real core of the project.

      But my point about information in relational databases and JSON feeds is not so much about basing your whole documentation system on them as taking advantage of the labeled information already available in your organization.

      For example, in a project for one client, I was able to generate a component database from information stored in format intended to be read by a C program and integrate it with content derived from an in-house source code commenting system. We did use XML as an integration format for that system, and later integrated authored content from an XML source, but the bulk of the information that went into the reference came from information sources that were structured and labeled in different formats. It was the labeling, not the format, that enabled that content engineering solution.

      Again, my point is not to discourage people from using XML, but to encourage them to think about the substance of what they are doing when they apply labels to content, rather than to the trivial fact that they are using XML to apply the labels.

      1. Indeed. I expect we will see increasing use of JSON to deliver certain types of content. It is not a format that is optimized for content by any means, but it will often be the format of opportunity for Web APIs and similar things. As long as it lets you label content in the ways that support the automation you want to do, there is no reason not to use JSON — economic considerations aside.

  2. You knew you’d hear from me.

    So I agree with you completely on the naming role of XML vs shortcodes, at least within the example you give.

    Since shortcodes are simply text, they can be inserted as valid markers by popular in-browser editing tools, but those tools normally lose touch with that markup after they insert it–they provide no validation and no subsequent markup-aware editing on inserted codes. WordPress programmers have persevered to put a rudimentary shortcode tracking capability in place, but users can easily hand code wrong values since shortcodes are plain text. And there is no universal definition of shortcodes… they are conventions, not standards.

    By contrast, what XML offers is a system for managing content in an object-oriented manner, with at least a prayer of assurance that your markup is correct and organized in a way that allows querying the inner structure, should you need to. Not everyone needs those assurances, nor do I defend the draconian nature of XML validation or the cost of fully structure-aware XML tools. I’m just pointing out that XML provides a set of system-level services that are already fairly ubiquitous, contrasting that situation with the limited scope of support for shortcodes outside the WordPress universe.

    And even for non-XML structured content tools, there’s need for some level of schema-informed methodology to guide authors and to build tools with common behaviors and that are more widely supported and deployed. Are you willing to ditch XML and build all that infrastructure yourself? Or would you not build that system on top of XML, hiding the complexity but exposing those benefits?

    My main concern about all the adaptive content authoring tools out there is that they also follow the presumed mantra that XML is not the answer, and therefore each one ends up being its own expensive, siloed solution with no prayer of widespread open source or vendor support or of having a self-sustaining user community outside of that singular solution. SPFE is a partial solution, but it cannot meet the content management and publishing needs of the Web as a whole. So while “XML is not the answer” predictably is a zinger title for this post, the simple premise overlooks an entire value ecosystem that is available to utilize. I want a solution that is widely accepted, makes use of existing services, brings delight to all potential authors, publishers, and content consumers, and cures the common cold. What say?

    1. Thanks for the comment, Don.

      I certainly hoped I’d hear from you! You always raise important and interesting points, as you do in this comment.

      I agree that the main economic motive for choosing XML as the format to expressing structure in content is that is comes with a full suite of validation and processing tools, not to mention editors and a variety of repository options that are optimized for storing and retrieving it.

      The fact that XML is designed to be validated, and that there are editors that can validate content on the fly, and can suggest valid options to authors at any point in the document are major advantages in ensuring that content is labeled reliably — and are largely wasted if people use generic formatting-oriented schemas that don’t require any interesting labels to be applied.

      As you know, my critique of many DITA implementations is that in sticking to the generic task/concept/reference topic types, they forego the opportunity to enforce a rhetorical structure appropriate to their business and to label things in their content in ways that make sense for the business, and that would enable greater degrees of automation for their content. DITA can, of course, provide ways of doing this, but many fail to use them, and many advocates discourage their use.

      But that, you see, is the point I am making. XML is a useful implementation tool, but it is not the answer. The answer is the reliable and correct labeling of content. If you implement XML without thinking through how your content should be labeled to meet your business needs, you are missing the point of the exercise.

      In pointing out that there are other useful ways to label content I am not telling people to run away from XML, I am telling people to think about how they need to structured and label their content before rushing into a conversion project. I am also telling them to recognize that any source of content or data that is reliably labeled is a potential input source for content engineering, even if it is not labeled using XML.

      Once someone has a clear idea of how they want to structure and label their content to support their business processes, XML may be the best tool for implementing that structure. But simply converting to XML — any XML is not the answer and may not yield the benefits expected, or any benefits at all.

      XML, therefore, is not the answer, or, at least, not the answer to the most important question, though it may be the answer to a significant secondary question about implementation.

      As for SPFE, as an architecture it is just as general as DITA in its potential applications, though obviously not anywhere like as mature or well supported as DITA, and optimized for a different set of operations. But I agree that neither one is able to meet the content management and publishing needs of the Web as a whole. There is no one ring to rule them all and in the darkness bind them.

  3. Hi Mark,

    Thanks for this post, as always you get to the heart of the problem :).
    I hear a lot from people that want to obtain structured content but they do not want their authors to know anything about the labels they want to set (XML element names in most of the cases), they want the authors to work as they used to work in Word – and that shows exactly that they are missing the point for moving to XML/structured content.
    What I have seen also is that in many cases people add markup/labels to content but they do not use those labels in any way – that makes it difficult for people that create that content to understand why they should take the effort to add those specific labels/markup to the content.
    Anyway, the title is a little misleading, I think something like “XML is just one answer” or “XML is part of the answer” express better the ideas from your post.

    Best Regards,
    George

    1. Thanks for the comment, George,

      It is exactly that kind of consumer confusion that I am talking about. If people wanted an editor that worked exactly like Word but produced XML, then the right product for them is Word, since Word’s file format now is XML. The only reason not to choose word for this purpose is that the semantics of Word’s XML are purely those of Word’s internal document model.

      Word (like just about every other tool on the planet) gives you XML. But it does not give you configurable structure. You need a tool like oXygen when you want to implement configurable structure, and that will always mean that the writers need to be aware of the structure they are being asked to create. oXygen gives you multiple elegant ways to build and interface to capture that structure, but the writer has to actually apply the structure.

      The title has certainly mislead some people. It was chosen to be provocative. I hoped the subtitle would clarify the argument, but for some readers it has clearly failed to do so. A more correct title would have been, “XML is the Answer to the Second Question, but You Need to Answer the First Question First,”, but that is a bit long to fit in a Tweet.

      Perhaps the reason for the confusion is exactly the thing I am trying to address: that people have conflated XML with structure, and now say XML when they mean structure (which is somewhat like saying wood when you mean house, or hammer when you mean blueprint).

      The problem this creates in the market place is that it gives consumers the idea that XML is a kind of magic pixie dust that you have only to sprinkle on your content to give it magical powers. Having your customer believe in magic can sometimes help you make a sale in the short term, but it can lead to disappointment and frustration in the long term.

      This potential for long term disappointment is clearly shown by you second point about people creating labels they never use. This industry has started to use the word “standard” as a marketing tool, creating the impression in some people that they have to create these labels they never use because it is required by some standard. But this is just spending money for nothing, and sooner or later, someone is going to question why that money is being spent.

      “Because it’s a standard” is never a reason to do anything. Standards are operational, not normative. You should use them where they give you an economic advantage, not simply because they exist.

  4. Hi Mark,

    thank you for highlighting an important aspect in our industry.

    In my experience people often times confuse XML markup with content processing. Just by stating in the content <para language=”EN”> or <safety level=”critical”> nothing is achieved yet. You are definitely right that algorithmic content processing is not confined to XML markup or to a specific form of XML markup. People moving from one XML implementation to another one often have a hard time understanding that it’s more efficient to adapt the way markup is applied than the way the XML engine is processing the markup. It’s the old syntax vs. semantics confusion.

    Best regards,
    Sebastian

    1. Thanks for the comment, Sebastian.

      Indeed, the syntax vs semantics confusion just will not go away. What particularly irks me about it is the people who should know better who talk about XML being essential for information exchange.

      XML is syntax. Given semantics in an unambiguous form, you can easily convert from one syntax to another for exchange. On the other hand, incompatible semantics make exchange impossible regardless of syntax.

  5. Well said, Mark! Your post explains well the XML/content management confusion, but it also speaks to a broader problem we have in tech comm these days–that of mistaking tools for talent or skill or even effort in general. Great post!

    1. Thanks for the comment, Karen.

      “Mistaking tools for talent” is an interesting way of putting it. What I am seeing today is a great deal of mistaking standards for business processes, which may be very much the same sort of thing.

Leave a Reply

Your email address will not be published. Required fields are marked *