DocBook resurgent: what it tells us about structured writing and component content management

By | 2016/08/03

A new XML-based content management system that is not based on DITA. Bet you didn’t see that coming. But I think it tells us something interesting about the two sides of structured writing.

Tom Johnson’s recent sponsored post explains the origins of Paligo, a relatively new CCMS out of Sweden. Paligo was developed by a company that had formerly been a DITA consulting shop in an attempt to come up with something that was easier to use (and less expensive) than the DITA solutions they were implementing.

What is interesting to me about Paligo is that they chose DocBook rather than DITA as the underlying XML vocabulary. Why? Quoting Tom:

And it turns out that building on a foundation of Docbook XML is considerably easier than building with DITA. DITA tends to impose more restrictions about what you can and can’t do, Svensson says. Even so, Paligo is only “based on Docbook.” Paligo extends from this foundation, adding what they need and not letting the content model restrict the system, while maintaining full capability to export to the open standard.

This is interesting because DocBook is a large and complex specification. (I want to say larger and more complex than DITA, but I’m not sure if that is true anymore.) Why use it as the basis for what is supposed to be a system that is easier to use than DITA?

The answer seem to be lack of restriction. DocBook may be as large and as complex as DITA, if not more so, but it is much less restrictive. DITA has lots of restrictions on what you can and cannot do in each topic type. DocBook has very few. You can combine DocBook elements in just about any way that might occur to you. And apparently Paligo loosens DocBook even further.

Why is this significant? In a CCMS (Component Content Management System) the whole point of the system is to let you assemble documents out of pieces. The main benefit of this is content reuse. The main problem standing in the way is composability: the ability to put pieces together and have them work.

Composability is an interesting problem. Lego sets and Mechano both have composability within their own systems, but there is little composability between Lego and Mechano.  You cannot freely exchange the bits. Composability is similarly a problem between the many different file types used across a typical enterprise. If you want to practice component content management on an enterprise scale, you need a composable format across the enterprise.

There are two ways to get this. One is to allow each group in the enterprise to have their own format, but insist that it must be transparently convertible into a common composable format. The other is to get everyone to use the common composable format directly. The latter sounds easier, so that is what most people choose. (It is not always easier, but people only find that out later.)

To get everyone in the enterprise using a single composable format, you need it to be easy to use, as well as being flexible enough to serve everyone’s needs. What to choose?

DITA has been the default choice for a while now, but the problem is that DITA is not really easy to use. Several companies have tried to make DITA easier to use. (Tom mentions EasyDITA in his article.) But DITA comes with restrictions and restrictions are hard to learn and annoying to comply with unless you really understand the point of them.

Lightweight markup languages such as Markdown and ReStructuredtext have become popular as well, with various CMS and publishing systems being built around them. But while they are simple and easy to use, their simplicity can be limiting. There are things that occur in complex publications that they cannot easily represent.

DocBook offers a far richer set of markup structures that can represent all of these things, but without the restrictiveness of DITA. It makes sense, therefore, for a company like Paligo to choose it for their underlying document structure.

There is a rub here though, and it has to do with the two sides of Structured Writing that I mentioned at the beginning. Those two sides are composability  and constraint. I am writing a book about structured writing (currently being serialized on TechWhirl) . That book focuses on the constraint side of structured writing.

The constraint side of structured writing is about expressing and enforcing constraints on content. It is about limiting and shaping what is written to meet a particular need. For example, it may constrain a recipe to follow a particular format and include particular pieces of information.

Here is a constrained version of a recipe:

 recipe: Hard Boiled Egg
     introduction:
         A hard boiled egg is simple and nutritious.
     ingredients:: ingredient, quantity
         eggs, 12
         water, 2qt
     preparation:
         1. Place eggs in pan and cover with water.
         2. Bring water to a boil.
         3. Remove from heat and cover for 12 minutes.
         4. Place eggs in cold water to stop cooking.
         5. Peel and serve.
     prep-time: 15 minutes
     serves: 6
     wine-match: champagne and orange juice
     beverage-match: orange juice
     nutrition:
         serving: 1 large (50 g)
         calories: 78
         total-fat: 5 g
         saturated-fat: 0.7 g
         polyunsaturated-fat: 0.7 g 
         monounsaturated-fat: 2 g 
         cholesterol: 186.5 mg 
         sodium: 62 mg 
         potassium: 63 mg 
         total-carbohydrate: 0.6 g 
         dietary-fiber: 0 g 
         sugar: 0.6 g 
         protein: 6 g

Why would you want to impose these constraints as opposed to just letting writers write and format as they go? Some of the more important reasons.

  • It keeps things consistent.
  • It guides the author and ensures they don’t forget things.
  • It takes away the formatting task so the author can focus on content.
  • It makes the content accessible to algorithms. The content not only follows constraints, it records the constraints that it follows. If you have a collection of recipes marked up like this and you want to make a low calorie cookbook, you can easily query the collection to pull out all recipes with a calorie count under 100.
  • It allows you to implement sophisticated validation and auditing systems to verify the correctness and completeness of your content.
  • It allows you to factor out other constraints. Even when writers are working in a freeform environment, they are expected to follow all kinds of constraints, often laid out in style guides or requirements documents. Structured writing allows you to factor out many of those constraints and to encode others in the structures to help guide authors.

The composability side of structure writing is, as we noted, simply about making sure that all the bits you create can be put together and published. You can use minimally constrained structured writing systems like MarkDown or DocBook to achieve that composability without introducing constraints into the writing process. It makes sense, therefore, that as DITA has popularized the idea of component content management, less constrained rivals have come along to challenge its position.

But here is that rub I have been promising: Constraints are a powerful aid to composability. Here’s why:

The first requirement of composability is simply to get the bits of text to format and print correctly after you put them together. As long as they all use the same markup language, and as long as the bits go together in a way that is valid in that markup language, then this requirement is satisfied. The looser the constraints of the markup language, the easier this requirement is to meet since bits can go together in more ways.

But the second requirement of composability is to get the bits of text to work together as a coherent piece of writing after you combine them. This is a much more difficult requirement to meet. In the early days of a system, it may seem easy enough to meet it with human effort, tweaking as required each time the pieces are composed. But the larger the collection grows, and as more bits are being put together in new ways, the harder it becomes. What fits in one place does not sound right in another, or contains information that duplicates what has already been said, or leaves a gap that needs to be filled ad-hoc, or uses different terminology from the surrounding text.

Maintaining a collection of content chunks that can be smoothly combined in different ways to create different publications actually requires fairly strict constraints on what each type of content chunk contains and how it is expressed. Without such constraints, there is no guarantee that the document put together from those chunks is going to read well, that it will be complete and free from repetition or even conflicting information.

Composability on any scale, in other words, requires content constraints as much as it requires a universal format.

And that is why DITA starts with the idea of topics and its basic topic types: task, concept, and reference. They are an attempt to define the basic content constraints that composability requires. By dividing content into these three types, it is hoped, you make sure each topic does its own job and does not conflict with other topics when combined with them.

The problem with this is that generic constraints like these don’t work well for many people’s content. Writers end up chafing at the constraints without benefiting from them. This is the nature of content constraints. Like a pair of shoes, they have to fit well or they are agony. And when they don’t fit well, they don’t achieve their end. Many people complain that the content coming out of their DITA systems is choppy and does not read well. The content constraints are not doing their job.

Of course, DITA’s basic topic types were not necessarily intended to be used out of the box. The were intended as a basis for specialization, DITA’s process for defining more precise topic types as “specialized” versions of the base types. In theory, at least, specialization should enable you to create content constraints that fit better and therefore do a better job.

I have made no bones about the fact that I am not a fan of the base topic types. They are based on a theory of information design that I find to be naive. I am not a fan of the specialization mechanism either. But they exist to play a vital role. DITA’s main application is as an enabler of component content management. Component content management requires composability. And manageable composability requires content constraints. It requires a constraint mechanism, and topics and specialization are the constraint mechanism that DITA provides.

Many DITA practitioners avoid specialization like the plague. I don’t know if this is because they find the constraints of the base topics types sufficient, if they don’t understand the role of content constraints in composability, or if they just don’t believe in customizing systems in the way that specialization requires. In any case, I suspect that many unspecialized DITA applications are not really taking great advantage of these basic content constraints, and that their users may therefore be open to the blandishments of a less constrained systems.

For my part, I seek a third way. First and foremost, I look to structured writing as a means to improving content quality and making authoring easier and more effective through appropriate constraints that fit the author’s task like an old and well loved pair of hiking boots.

In the realm of content management, I see that many of the tasks we now do by hand, juggling generic bits of content, could be automated if we had appropriately constrained content that algorithms could understand well enough to manage. In particular, I think we can factor out many of the structures that we currently have to create and manage in content or in a CCMS, making it possible to do sophisticated content management without authors needing to interact with a content management system in any substantial way.

And as I have discussed before, I believe that Every Page is Page One information design and hypertext information delivery greatly reduce the need for large-scale content management.

For all of these reasons, I continue to work on the SPFE and SAM projects.

But I applaud DITA for understanding that content constraints are essential to composability and therefore to component content management. But DITA exists at what may be the point of maximum complexity on the curve. It is vulnerable to systems like Paligo on the one side that are willing to minimize content constraints for greater ease of use. But it is not seemingly well poised to take people in the other direction towards the greater use of content constraints to factor out the management tasks and restore simplicity (albeit a different kind of simplicity) to the authoring process.

If DITA cannot find a way to make content constraints a more attractive proposition, and easier to understand and to implement, it seems vulnerable to having its share of the market chipped away by systems like Paligo that offer composability and component management without significant content constraints.

13 thoughts on “DocBook resurgent: what it tells us about structured writing and component content management

  1. Barry Schaeffer

    Great post Mark; and highly timely. I have always thought that DITA, for all its hype, was a surrender to the challenges of structured content: make everything essentially horizontal, then build a map to create the structures you need. While that can work, it involves a level of complexity that grows almost exponentially as the granularity and nuance of the content grows.

    Having grown up in the structured content world of the 60s and 70s, I have always believed that if content is structured, then build the structure into the content model (DTD and Schema) and manage it at a level commensurate with its intended purpose.

    Let’s hope that this philosophy will at least gain parity with the DITA world, providing content managers a choice of approaches.

    (-:

    Reply
    1. Mark Baker Post author

      Thanks for the comment, Barry.

      I think that is a great analysis of the DITA model. Essentially, it avoids the real content modeling exercise, which is the model of the whole. It models a set of more or less generic pieces and lets you put them together however you like. But as you say, the process is complex, and because there is no model for the whole, there is no guidance or validation for it.

      Certainly it can be argued that if you make the parts smaller they are more likely to become generic, or at least to have a clear relationship to a generic metamodel. But that largely obviates the point of structured writing, which is to impose particular constraints for a particular purpose.

      And if the whole is not constrained, many people are not going to see any virtue in constraining the pieces. A partially constrained model is much less obviously useful, and much less easy to understand, than either a fully constrained or fully unconstrained model.

      I understand the appeal of a metamodel for many people. Metamodels can be wonderful when they really reflect the nature of their subject area. But it is all too easy to come up with a naive metamodel that is not a real reflection of its field. Sometime individual models that serve individual purposes are all you can devise and making them fit a naive metamodel just breaks everything.

      Anyway, the book I am writing will beat the drum for that constraint-based model. I hope it makes a difference in getting it closer to parity.

      Reply
  2. Diego Schiavon

    Go DocBook!

    I used to write in DocBook on my first job, and I ended up loving it. I still miss it at times, and I am glad to know it is making a comeback with Paligo (also, I did not know Paligo was based on DocBook).

    Many people love/hate DITA and the topic types. I think of DITA as a sort of avanguarde,a battering ram to break writers’ bad habits and force them to think in topics. But unless you do lots of specialization, and you do it well, it is a blunt tool.

    DITA topic definitions had to be simpler than Information Mapping, because it had to be accessible by writers with different backgrounds. So it ended up being IMAP’s ugly stepsister.

    It had to have a simpler syntax than DocBook, because DocBook’s is indeed too large. So it feels dumbed down compared to Docbook (although I hear DITA syntax has now grown considerably).

    It had to look a bit like HTML to come across as familiar after all, but it feels too nerdy compared to HTML.

    But all in all, DITA has done the job really well. It has helped important ideas to break though: separation of content and presentation, structured authoring, topic-based writing.

    Even though we do not use DITA in my company, I have a copy of DITA Best Practices on my desk for inspiration.

    And now it is the time other, more refined tools take over from where DITA took us.

    Reply
    1. Mark Baker Post author

      Thanks for the comment, Diego.

      That seems like a pretty fair analysis. And I agree that DITA has done yeoman service in driving structured writing and content management techniques into the mainstream.

      People like me and Barry who were there before DITA may well lament the loss of the full constraint based models that we were used to, but we would also have to admit that DITA broke through in a way the stuff we were doing never did. DITA deserves every credit for finding a model capable of gaining wider acceptance. What we can hope for is that DITA paves the way for a simpler and more constrained approach to come back.

      DITA has in fact achieved two breakthroughs. It has pushed both constraint and composability into the mainstream. It has, I feel, been much more successful in pushing composability than constraints, however, which naturally opens the door for less constrained systems.

      So where do we go from here? On which side will we find the more refined tools to take over from where DITA took us? (And we should note that it is almost always the fate of breakthrough technologies to be supplanted by more refined successors, designed by people who are in a position to cherry pick the best of the breakthrough ideas.) Paligo seems to be an attempt to develop a more refined approach to composability.

      Where is the more refined approach to constraint going to come from? Or the more refined approach to the integration of the two?

      Reply
  3. Richard Hamilton

    Interesting article. Regarding size and complexity, DocBook is larger than base DITA (if you’re counting number of elements: 382 vs 182). If you include the DITA Technical Content and Learning & Training, you get 251 and 180 more, respectively, for a total of 613. Depending on your perspective, you could say that DITA still has 182 elements, since the TC and L&T elements are all specializations of base elements, but you could also say that at least some of those TC elements have equivalents in DocBook and should be counted separately. At that point, though, we’re probably arguing angels on the head of a pin, and I’ll agree that DocBook is larger than DITA:-).

    DITA is more complex. It has capabilities that DocBook does not have (keys, conrefs, and specialization, to name a few), and many of them are quite complex. And the specification is much larger, despite there being fewer elements.

    I think that most DITA experts would argue that the complexity is primarily an issue for implementers, and that, properly implemented, DITA should be as easy for writers to use as any XML Schema, if not easier. I have sympathy for that point of view, but I suspect it is rarely achieved.

    One place where I think it has been achieved, at least on a small scale, is Don Day’s expeDITA framework, which gives writers a forms-based interface to a DITA-based wiki. We used expeDITA for a recent project (The Language of Technical Communication (http://tlotc.com)) where we received contributions from 52 different people, most of whom not only hadn’t used DITA before, but also didn’t even know that DITA was behind the scenes of the wiki. I think Don has done a great job of using the strengths of DITA to provide a useful back end, while keeping the user experience reasonably clean and simple. I’m just not sure how many other implementations work as well.

    Reply
  4. Mark Baker Post author

    Thanks for the comment, Richard, and thanks for the numbers!

    One of the features of DITA is that it contains so many what I would call “management domain” structures. They do not exist to represent content but to implement component content management features. That definitely increases the complexity because it combines two different sets of concerns in one language, and because component content management is inherently complex to begin with.

    The issue with implementation is an interesting one. It is possible to see both DITA and DocBook as languages for writers to write in directly. It is also possible to see them as underlying technologies that implementers can use to build custom authoring systems. This, clearly, is how Paligo is using DocBook. And as a contributor to The Language of Technical Communication, I can affirm that authors would not in any way know that they were writing in DITA. They were writing in a custom structure developed for a particular purpose. Clearly that structure could also have used DocBook under the surface and it would have made no difference to the writers. It may well have made a difference to the implementors, but that is an entirely separate question.

    Or rather, whether it is an entirely separate question depends on how deep the implementor buries DITA (or DocBook). If the implementor’s role is to modify some XSLT stylesheets, then the writer is still writing in DITA or DocBook and the specific structures of DITA or DocBook are still relevant to the writer and the writer’s experience. If the implementor creates a model that is completely customize to a particular writing requirement, then the author is using that model, not DITA or DocBook, and they don’t care what is under the covers.

    The question then becomes, who is best placed to design the model? If designing the model is a complex business that requires a lot of time and programming talent, then that job is outside of the writer’s role. Yet it is hard to see how you get a model to fit like an old pair of shoes if the writer is not involved in its development.

    So I think the search for the more refined tool that Diego talks about may be about finding ways to let the writer be involved in the development of the model, or, ideally, be able to develop it themselves whenever they need a new one.

    Reply
  5. Anders Svensson

    Hi Mark,

    Great article and analysis. And you hit the nail on the head on many of the things we’re trying to do with Paligo.

    I agree DITA has done a lot to pave the way, but after years of consulting around it we found that, as you point out, it was just too restrictive (especially topic types) and also too complex (keyrefs, conkeyrefs, etc that customers never understood how to use). Some of this I would say ironically comes from the fact that I think DITA was made to make advanced features available without being a programmer, by building those features into the content model itself.

    But that is also where it becomes too complex because the burden instead then falls on the user in their day to day work. We felt that it was the responsibility of an actual system to take care of those complex features, hiding its complexity from the user. Building on DocBook made it easier to let the system take care of such things, and let the content model be just that. It’s an ongoing process to try to make the benefits of structured authoring available, and at the same time make it feel easy to do and not too restrictive, but we’re working on it 🙂

    Regards,

    Anders

    Reply
    1. Mark Baker Post author

      Thanks for the comment, Anders.

      I agree entirely about the keyrefs, conkeyrefs, etc — what I call management domain structures. They make the barrier to participation very high and they are a burden on the attention even of the experienced writer who has been trained to used them.

      Still, we have a problem if we are going to do component content management. We need some way to manage content at a very granular level. There are several approaches:

      • The DITA approach with embedded management domain structures.
      • The general buried markup approach, which would seem to cover Paligo (I have not looked at it in detail), where these functions are transferred from the markup to the system. This does not mean that the writer does not have to perform these functions, only that they perform them using a different mechanism, one that is presumably easier to understand and use.

      • Ad hoc burying of the markup for specific purposes, as in the system that Don Day built and that Richard describes in his reply.

      • Reduce the need for component content management by adopting a loosely coupled information architecture. This is the primary approach I recommend — the Every Page is Page One bottom-up architecture.

      • Factor out the management domain by using what I call subject domain markup (there is an example of this here: http://techwhirl.com/single-sourcing-algorithm/). This means that the writer is no longer performing these functions at all. This has the additional virtue that the writer already understands the subject matter, so the don’t have to learn a system as well as the subject matter. It is limited in that it cannot do ad hoc forms of management that can’t be derived algorithmically from the subject domain, but I think that is often a more than acceptable trade off.

      However one chooses to bury the management domain markup, though, whether it is behind a system like Paligo, or subject domain markup, or an ad hoc system like the one Richard describes, the effect is the same: the writer sees a new model. And what really matters in the end is the model the writer sees. What is underneath ceases to matter once it is buried in the system.

      Reply
      1. Don Day

        So true, Mark. In a fully general system, only templates can convey the sense of preferred organization for “things of a kind.” This works, but users can revise the intent. For many situations, that freedom of revision is a good thing (to wit, general word processors). The constrained interface that Richard and I used for his book series fully enforces a simple but immutable structure for a specific purpose, and I am glad that you were unaware of the DITA underpinnings. This particular data model, though, could just as well have been stored as HTML in an SQL record (and was, in Richard’s previous, Confluence-based system), and it is exported as DocBook to interface with Richard’s existing book production process.

        Corporate information and marketing content are other good fits for the type of semi-structured writing that web-based writing and publishing systems could enable (in contrast to writing that necessarily encodes deep semantic data structures). Unfortunately most web-based CMS systems are as moribund by the rigor of SQL data tables as XML systems are by their schemas. The challenge is for these web-based “intelligent content” systems to appear agnostic about the details of implementation that supports this content. Richard and I will have much more to say about why DITA was natural for his particular use when we present our experiences at Lavacon 2016 in Las Vegas in October. And I’ll also say more about the future of expeDITA, which now has been outed in a real application (“used in anger” as the British say about new weapons systems).

        Reply
  6. Mark Giffin

    Thanks for the interesting discussion Mark, it’s useful. At the beginning you mention Tom Johnson’s “sponsored post” about Paligo. What does this “sponsored” mean? Was Tom Johnson paid to write or post the article about Pailgo? If so, I don’t see this mentioned on the post.

    Reply
    1. Mark Baker Post author

      Thanks for the comment, Mark. At the end of the main post, just above the Interface Tour, Tom writes “Note: This post was sponsored by Paligo, which is one of the advertisers on this site.”

      Reply
      1. Anders Svensson

        Hi Mark(s) 🙂

        I can’t speak for Tom, of course, but I suppose the post is sponsored in the sense that we are advertisers on his site and pay for that advertisement. He got a trial to test run Paligo and offered to interview us. But no, Paligo didn’t pay Tom specifically to write this article, although I’m happy he did and I was glad to get the chance to tell our story to him.

        Regards,
        Anders

        Reply
  7. Denis Bradford

    I’ve been out of the loop for a while, but while I was in it I took a slightly more cynical view of this topic: Docbook has fallen victim to its own success, and Dita is a solution looking for a problem.

    What I mean is that Docbook works too well. It’s too transparent and easy to implement to suit a lot of people. It didn’t require an army of specialists (well, maybe one in-house XSL hacker who knew where to find Bob Stayton), so how were vendors to make money? And what were doc managers supposed to do if they couldn’t have grandiose, endless retooling projects like the ones I suffered under at IBM (hi Barry, Hi Don).

    Bottom line: In my career in software documentation I never saw a Dita specialization implemented that amounted to anything, and no doc set that couldn’t have been supported more easily in Docbook, for a hell of lot less money and angst.

    Reply

Leave a Reply