This post is in response to a Twitter conversation that started with:
— Stefan Gentz (@stefangentz) June 3, 2016
This led to a discussion about extensibility and constraints in markup languages. Markdown is a pretty simple lightweight markup language (and popular for that reason) but it is what it is. You can’t add anything to it except by creating your own variant by modifying its code (which lots of people have done).
ASCIIDoc and reStructuredText are two slightly more complex and slightly more able lightweight markup languages. Both of them are extensible, to a degree. If you want something more than they offer, there is a standard way to add it. Or at least to add certain kinds of things. A lot of people use them to write technical documentation, and regard them as more suitable than Markdown for this purpose due to their additional features and extensibility (which is the point of Eric Holscher’s post that Stefan Gentz tweeted about). On the other hand there a lot of people, like Tom Johnson, who think MarkDown is as good as anything for tech docs.
And then there are people like me who argue that ASCIIDoc and reStructuredText are not good enough either because while they are extensible, they are not constrainable. Which led to this:
— Eric Holscher (@ericholscher) June 3, 2016
So here is that blog post.
So, what are constraints and why do we need them? A constraint is simply a limitation, a rule you have to obey. Writers are familiar with constraints. Your company style guide is a collection of constraints. Everything you write is suppose to conform to all the constraints in the style guide.
The problem with style guides is that they tend to be the dumping ground of every complaint and dispute resolution in the life of a department and they quickly grow to be longer than some of the documents they govern. There is no way that most writers can remember all of the constraints they dictate and apply them all correctly to every sentence they write. Only hyper-diligent editing will catch all the style guide errors writers inevitably make, and no one has the time or money for that anymore.
In structured writing (and all use of markup languages, including Markdown, are cases of structured writing) we can use structure to express and enforce constraints. In fact, every markup language is essentially a set of constraints. You can only do what the markup language lets you do, only express the structures that it lets you express.
Right away, this helps with your style guide issues. If you use a markup language like Markdown or ASCIIDoc to create your titles, lists, and codeblocks, they will be formatted automatically by a stylesheet, and you can rip the section on formatting them out of your style guide. Writers no longer have to remember those constraints, and you no longer get these structures formatted incorrectly in your docs.
The problem with a small language like Markdown, though, is it is too constrained for some projects. It does not give you a way to do all the things you want to do. So some turn to richer languages like reStructuredText that are less constrained and allow you to do more stuff. And if they still find them too constrained, they can use the extensibility features to add still more stuff.
The problem is, as you add more stuff, either by choosing a richer language or by extending one, you give writers more options. And then of course you have to tell the writers which choices to make when, and your style guide starts to grow again, and as it gets larger it gets harder to remember it all, and writers start to make mistakes again, and your output becomes inconsistent again.
If people start getting enthusiastic about structured writing they pretty soon want to spread it across the entire organization. And then they decide the best way to do that is to have one markup language for everyone in the company. That means a language that can do everything that everyone wants to do, and that means most of your constraints go out the window. Your style guide starts blowing up again, inconsistencies abound, and before too long everyone is complaining and saying they should be going back to WYSIWYG tools.
What gets missed in the enthusiasm is that what makes structured writing and markup languages great is their constraints. It is constraints that make life simpler for writers, constraints that make information more complete and accurate, constraints that make presentation and formatting more consistent. Yet it is the constraints that are jettisoned as more and more people come on board.
There is not one generic set of constraints that works for everyone, that cover all subjects, all organizations, and all readers. If there were we would have discovered and implemented it long ago. The constraints that make content better, authoring easier, and output prettier are different from case to case.
What we need, therefore, is not only the ability to create your own extensions, but the ability to create your own constraints. Not only the ability to introduce new things that writers can do, but the ability to introduce new rules about what they cannot do.
This is one big reason why you want languages that are constrainable as well as extensible. But it’s not the only reason.
The other virtue of constraints is that it enables algorithms (computer programs) to process your content. Every markup language has at least one program to process it and turn it into output (at minimal, HTML). Those programs work because they know the constraints of the language. They know all the structures that are allowed to exist in the content, and all the combinations they are allowed to exist in, and they know how to format each of them.
As languages grow larger and become less constrained, there are more structures and more permutations of structures that can occur, and the publishing programs have to grow larger to handle all those permutations. In some cases, the growth of the language and its permutations outstrips the growth of the publishing program and there are things that are legal structures in the language that the publishing program cannot handle correctly. This requires more entries in the style guide to tell writer not to use these structures. (The entries are constraints, of course, but they are not being enforced by the markup language.)
If you want efficient reliable processing programs that are easy to create and maintain, you need to have well constrained markup languages.
And with the right constraints in place, there are far more things that you can do with algorithms than simply apply formatting to text. A common example is API documentation languages like JavaDoc and Doxygen. These are lightweight markup languages that are constrained very specifically for describing the properties of API functions and they are usually written as part of the API code itself. The algorithms that process them not only read the JavaDoc or Doxygen source, they also read the program code itself to pull out information on function calls, parameters, and return values. Not only do these tools combine these two sources of information into one reference entry, they also validate the written content to make sure it conforms with the actual function definitions in the code.
By combining well constrained markup languages containing the right structures for the job with the right set of algorithms, you can generate all kinds of content and do all kind of useful validation. In one project, for instance, I created a constrained language for writing programmers guides that automatically linked every mention of an API routine to the API reference and verified every API mentioned against the authoritative list of APIs in the reference. This process caught some significant errors in the previous programmers guides, including the use of deprecated APIs, misspelled API calls, and legitimate API calls that were missing from the API reference. These were serious and longstanding errors that had been missed in multiple rounds of review over multiple releases.
Constraints, then, can address far bigger issue than correct list formatting. They can govern issues like what pieces of information are required for a particular type of topic or how precisely it is supposed to be expressed. With the best will in the world, writers tend to be forgetful and inconsistent about these things. (Writers are always astonished at how inconsistent they have been in the past when they apply an appropriately constrained markup language to their old content.)
How do I get constraints?
So where do you find a markup language that lets you add constraints? The main answer to that is XML. But XML is not a lightweight markup language. In fact, XML is the heavyweight markup language. There really aren’t any others of significance today.
One of the key defining characteristics of a lightweight markup language is that you can comfortably write a document in it using a plain text editor. This is how people write in Markdown, ASCIIDoc, reStructuredText, JavaDoc, and Doxygen. But writing that way in XML is next to impossible. You really need a sophisticated XML editor to write comfortably in XML, and even then, there are problems that make the experience frustrating.
What makes XML constrainable is its combination of generic abstract syntax and support for schema languages. Schema languages are languages for expressing constraints. They state what is allowed and what is not allowed. In XML terms, a document is valid if it meets all the constraints specified in the appropriate schema. A program called a validator is used to validate a document against it schema. Validators are commonly built into editors to guide writers as they are writing.
In theory, at least, if an XML document is valid, any program that can process a document that is valid by that schema should be able to produce the intended result without error, enabling you to have a fully automated and fully reliable publishing process.
It does not always work out that way in practice, because people tend to take the same approach to XML as they do to other markup languages: they try to create one language for everybody and everything in the organization, which means a language with few constraints and too many permutations of structure for any program to handle with complete reliability. The result is style guide bloat, no support for content quality, and inconsistent output.
The need for a lightweight markup language with constraints
The main reason that XML is such a heavyweight markup language is that it attempts to be perfectly general. You can create an XML markup language for any data structure you can imagine. This leaves it with a verbose syntax that is hard to read and write.
Lightweight markup languages give up generality in favor of simple clear syntax for particular types of data structures — mostly documents. This is what makes it possible to read and write them comfortably in plain text.
But even within this much more limited scope, there is still a substantial case to be made for the ability to add constraints — something that is only reinforced by the existence of highly constrained lightweight languages like JavaDoc. However, JavaDoc’s constraints are baked into its code. It does not provide a mechanism for writers or information architects to create new constraints suitable to their individual projects.
But as far as I have been able to discover, there is currently no extensible and constrainable lightweight markup language. That is why I have started building one myself. It is called SAM, and I am currently using it to write my series on structured writing on TechWhirl. (This series will be turned into a book from XML Press.) Many of the examples in the series are also written in SAM.
SAM is a work in progress, and the schema support is not implemented yet, but if you are interested, you can check out the project on GitHub.