Why we need constrainable lightweight markup languages

By | 2016/06/05

This post is in response to a Twitter conversation that started with:


This led to a discussion about extensibility and constraints in markup languages. Markdown is a pretty simple lightweight markup language (and popular for that reason) but it is what it is. You can’t add anything to it except by creating your own variant by modifying its code (which lots of people have done).

ASCIIDoc and reStructuredText are two slightly more complex and slightly more able lightweight markup languages. Both of them are extensible, to a degree. If you want something more than they offer, there is a standard way to add it. Or at least to add certain kinds of things. A lot of people use them to write technical documentation, and regard them as more suitable than Markdown for this purpose due to their additional features and extensibility  (which is the point of Eric Holscher’s post that Stefan Gentz tweeted about). On the other hand there a lot of people, like Tom Johnson, who think MarkDown is as good as anything for tech docs.

And then there are people like me who argue that ASCIIDoc and reStructuredText are not good enough either because while they are extensible, they are not constrainable. Which led to this:

So here is that blog post.

Why constraints?

So, what are constraints and why do we need them? A constraint is simply a limitation, a rule you have to obey. Writers are familiar with constraints. Your company style guide is a collection of constraints. Everything you write is suppose to conform to all the constraints in the style guide.

The problem with style guides is that they tend to be the dumping ground of every complaint and dispute resolution in the life of a department and they quickly grow to be longer than some of the documents they govern. There is no way that most writers can remember all of the constraints they dictate and apply them all correctly to every sentence they write. Only hyper-diligent editing will catch all the style guide errors writers inevitably make, and no one has the time or money for that anymore.

In structured writing (and all use of markup languages, including Markdown, are cases of structured writing) we can use structure to express and enforce constraints. In fact, every markup language is essentially a set of constraints. You can only do what the markup language lets you do, only express the structures that it lets you express.

Right away, this helps with your style guide issues. If you use a markup language like Markdown or ASCIIDoc to create your titles, lists, and codeblocks, they will be formatted automatically by a stylesheet, and you can rip the section on formatting them out of your style guide.  Writers no longer have to remember those constraints, and you no longer get these structures formatted incorrectly in your docs.

The problem with a small language like Markdown, though, is it is too constrained for some projects. It does not give you a way to do all the things you want to do. So some turn to richer languages like reStructuredText that are less constrained and allow you to do more stuff. And if they still find them too constrained, they can use the extensibility features to add still more stuff.

The problem is, as you add more stuff, either by choosing a richer language or by extending one, you give writers more options. And then of course you have to tell the writers which choices to make when, and your style guide starts to grow again, and as it gets larger it gets harder to remember it all, and writers start to make mistakes again, and your output becomes inconsistent again.

If people start getting enthusiastic about structured writing they pretty soon want to spread it across the entire organization. And then they decide the best way to do that is to have one markup language for everyone in the company. That means a language that can do everything that everyone wants to do, and that means most of your constraints go out the window. Your style guide starts blowing up again, inconsistencies abound, and before too long everyone is complaining and saying they should be going back to WYSIWYG tools.

What gets missed in the enthusiasm is that what makes structured writing and markup languages great is their constraints. It is constraints that make life simpler for writers, constraints that make information more complete and accurate, constraints that make presentation and formatting more consistent. Yet it is the constraints that are jettisoned as more and more people come on board.

There is not one generic set of constraints that works for everyone, that cover all subjects, all organizations, and all readers. If there were we would have discovered and implemented it long ago. The constraints that make content better, authoring easier, and output prettier are different from case to case.

What we need, therefore, is not only the ability to create your own extensions, but the ability to create your own constraints. Not only the ability to introduce new things that writers can do, but the ability to introduce new rules about what they cannot do.

This is one big reason why you want languages that are constrainable as well as extensible. But it’s not the only reason.

The other virtue of constraints is that it enables algorithms (computer programs) to process your content. Every markup language has at least one program to process it and turn it into output (at minimal, HTML). Those programs work because they know the constraints of the language. They know all the structures that are allowed to exist in the content, and all the combinations they are allowed to exist in, and they know how to format each of them.

As languages grow larger and become less constrained, there are more structures and more permutations of structures that can occur, and the publishing programs have to grow larger to handle all those permutations. In some cases, the growth of the language and its permutations outstrips the growth of the publishing program and there are things that are legal structures in the language that the publishing program cannot handle correctly. This requires more entries in the style guide to tell writer not to use these structures. (The entries are constraints, of course, but they are not being enforced by the markup language.)

If you want efficient reliable processing programs that are easy to create and maintain, you need to have well constrained markup languages.

And with the right constraints in place, there are far more things that you can do with algorithms than simply apply formatting to text. A common example is API documentation languages like JavaDoc and Doxygen. These are lightweight markup languages that are constrained very specifically for describing the properties of API functions and they are usually written as part of the API code itself. The algorithms that process them not only read the JavaDoc or Doxygen source, they also read the program code itself to pull out information on function calls, parameters, and return values. Not only do these tools combine these two sources of information into one reference entry, they also validate the written content to make sure it conforms with the actual function definitions in the code.

By combining well constrained markup languages containing the right structures for the job with the right set of algorithms, you can generate all kinds of content and do all kind of useful validation. In one project, for instance, I created a constrained language for writing programmers guides that automatically linked every mention of an API routine to the API reference and verified every API mentioned against the authoritative list of APIs in the reference. This process caught some significant errors in the previous programmers guides, including the use of deprecated APIs, misspelled API calls, and legitimate API calls that were missing from the API reference. These were serious and longstanding errors that had been missed in multiple rounds of review over multiple releases.

Constraints, then, can address far bigger issue than correct list formatting. They can govern issues like what pieces of information are required for a particular type of topic or how precisely it is supposed to be expressed. With the best will in the world, writers tend to be forgetful and inconsistent about these things. (Writers are always astonished at how inconsistent they have been in the past when they apply an appropriately constrained markup language to their old content.)

How do I get constraints?

So where do you find a markup language that lets you add constraints? The main answer to that is XML. But XML is not a lightweight markup language. In fact, XML is the heavyweight markup language. There really aren’t any others of significance today.

One of the key defining characteristics of a lightweight markup language is that you can comfortably write a document in it using a plain text editor. This is how people write in Markdown, ASCIIDoc, reStructuredText, JavaDoc, and Doxygen. But writing that way in XML is next to impossible. You really need a sophisticated XML editor to write comfortably in XML, and even then, there are problems that make the experience frustrating.

What makes XML constrainable is its combination of generic abstract syntax and support for schema languages. Schema languages are languages for expressing constraints. They state what is allowed and what is not allowed. In XML terms, a document is valid if it meets all the constraints specified in the appropriate schema. A program called a validator is used to validate a document against it schema. Validators are commonly built into editors to guide writers as they are writing.

In theory, at least, if an XML document is valid, any program that can process a document that is valid by that schema should be able to produce the intended result without error, enabling you to have a fully automated and fully reliable publishing process.

It does not always work out that way in practice, because people tend to take the same approach to XML as they do to other markup languages: they try to create one language for everybody and everything in the organization, which means a language with few constraints and too many permutations of structure for any program to handle with complete reliability. The result is style guide bloat, no support for content quality, and inconsistent output.

The need for a lightweight markup language with constraints

The main reason that XML is such a heavyweight markup language is that it attempts to be perfectly general. You can create an XML markup language for any data structure you can imagine. This leaves it with a verbose syntax that is hard to read and write.

Lightweight markup languages give up generality in favor of simple clear syntax for particular types of data structures — mostly documents. This is what makes it possible to read and write them comfortably in plain text.

But even within this much more limited scope, there is still a substantial case to be made for the ability to add constraints — something that is only reinforced by the existence of highly constrained lightweight languages like JavaDoc. However, JavaDoc’s constraints are baked into its code. It does not provide a mechanism for writers or information architects to create new constraints suitable to their individual projects.

But as far as I have been able to discover, there is currently no extensible and constrainable lightweight markup language. That is why I have started building one myself. It is called SAM, and I am currently using it to write my series on structured writing on TechWhirl. (This series will be turned into a book from XML Press.) Many of the examples in the series are also written in SAM.

SAM is a work in progress, and the schema support is not implemented yet, but if you are interested, you can check out the project on GitHub.


Category: Structured writing

About Mark Baker

I am an aspiring novelist and former technical writer and content strategist. On the technical side, I am the author of Every Page is Page One: Topic-based Writing for Technical Communication and the Web and Structured Writing: Rhetoric and Process. I blog at everypageispageone.com and tweet as @mbakeranalecta.

7 thoughts on “Why we need constrainable lightweight markup languages

  1. Tom Johnson

    Mark, it’s impressive that you’re creating your own lightweight markup language, one that fits well into your SPFE architecture. But will you get enough traction with the language to make it appealing for widespread adoption? Will it only work with SPFE? If so, aren’t you asking authors to make a pretty big step of confidence into your tooling world?

    Maybe SAM will take off and become popular, and maybe tool vendors will start to support it. But I think it’s hard to create that kind of momentum. This is why I tend to piggyback onto existing platforms and work within their “constraints.”

    Re Markdown, I wouldn’t consider my position to be “MarkDown is as good as anything for tech docs.” I would say that although many tech writers think you need XML to do more robust, sophisticated authoring, you can actually do robust, sophisticated authoring when you combine Markdown with Liquid, HTML, and Jekyll. But I do recognize and admit that Markdown with its various flavors can be frustrating.

    I would agree that if you just stuck with authoring in Markdown alone, it’s too constraining/limiting. But there’s an easy way to extend Markdown through include templates in Jekyll. I currently do this for images, notes, and other features. But links are still the weak point. (Maybe I’ll elaborate briefly on this in a post.)

    At any rate, I agree about finding the right balance between constraints and options, and what you say about XML being too general when the audience is too large seems spot on. I glanced through some of the SAM files in your github project to get a sense of the syntax, constraints, and other features you’re writing about.

    My initial, quick gut reaction is that I’m in awe of your ambitiousness and the thoroughness in which you’ve approach the problem, but I’m wondering if the syntax is going to be too complicated for mainstream adoption (though maybe that’s not a concern). Markdown’s popularity is mainly due to its simplicity and the fact that so many platforms parse it and transform it in cool ways.

  2. Mark Baker Post author

    Thanks for the comment Tom.

    Will SAM get traction? I don’t know. All the ones that did get traction started with the same question. It is the same problem every new proposal faces, and, of course, most proposals don’t get much traction.

    I think new systems arise one of two ways. One is an industry consortium is formed to address as shared problem. The other is that one person just gets sick of the available tools and decides they would rather invent something they like better. Markdown certainly is in the latter category, just as XML is in the former.

    The industry sponsored initiative have the advantage that they have a substantial built in audience and the disadvantage that they are the product of a committee and thus usually overly complex and difficult. The products of individuals have the advantage of lightness and fitness for purpose, and the disadvantage of no built in audience.

    So I recognize both SPFE and SAM as labors of love — or perhaps labours of frustration. If they catch on, wonderful. If they don’t, I can live with that too. I enjoyed creating them, and learned a lot from the process. My book and series on structured writing would not be possible without what I learned from developing both of them. And when they are done, I will be happier working with them than with other tools.

    If others find the same pleasure and utility in them, that’s icing on the cake.

    The syntax is one of the most interesting design issues of SAM. It is more complex than Markdown, but considerably less complex than ASCIIDoc or reStructuredText (at least to my eye). More specifically, it relies less on the use of non-standard punctuation, which I find distracting in those languages. The added complexity is because it needs named blocks and annotations, which Markdown does not have. I tried to design the syntax, and its shortcuts, to feel natural and intuitive, but that is difficult to judge. The formal specification, though, requires a lot of detail to spell out what all the shortcuts do. It is really hard to judge ease of use for something you have invented yourself. What works for me may work for millions or for no one else. In particular, the use of indentation to define structure (a la Python) is something I find really intuitive but others may balk at. Time alone will tell.

    Thanks for nuancing my statement of your views on Markdown. I knew I was probably stating that with too broad a brush, but trying to nuance it myself seem like a sidetrack. The beauty of blogging is that we get to nuance these things in the comments.

    It is the combination of pieces that is really interesting about your solution. Lots of systems (like DITA) end up embedding what are essentially programming languages in their markup (not Turing complete, by any means, but imperative none the less, and capable of interesting and problematic side effects). Your approach is much more up front about doing that, and I suspect it is the more robust for it.

    My preference, though, is for purely declarative content, or as close as I can get to it. This means a more complex tool chain (or at least one with more steps) and more case-specific markup languages, but the gain in ease of authoring, audit capability, and future proofing is worth it in my view.

    1. Mark Baker Post author

      Thanks for writing this up, Tom. For comparison, here is what your include example would look like in SAM.

                  $stuff=ice cream 
                  Here is some >($stuff).
                 $stuff = special text

      ~~~(#some) is a fragment definition. The fragment defines a variable $stuff and uses it in the text of the fragment.

      (~some) inserts the fragment. By itself, this would insert “Here is some ice cream”, but with the redefinition of $stuff, it outputs “Here is some special text”.

      This mechanism is not tied to a file insert instruction. Insertion of files is treated separately. But the fragment could be defined in a separate file and brought in by inclusion.

      More generally, what you are doing, essentially, is reproducing the source/template structure that we find in all structured writing. It is formalized in the XML world in the XML/XSLT pairing.

      The crucial question in the design is what goes in the source and what goes in the template. The more complexity you can factor out of the source into the template, the easier authoring becomes.

      The key to making it easier to use is to factor it so that the source is declarative and the template is imperative. That is, all the commands are in the templates; all the declarations about the text are in the source. Thus you have a purely declarative source that is easy for people to write.

      This business of factoring things out of the source and into the template is the essence of what I am describing in my series on structured writing on TechWhirl. http://techwhirl.com/series/structured-writing/

  3. Pingback: Links for June 2016 – foreXiv

Leave a Reply