What constitutes a “real” XML editor? The question is perennial, but is made topical by Tom Aldous’ surprisingly shrill defense of FrameMaker as an XML editor. It is unusual for a market-leading company to indulge in myth-busting aimed at tiny competitors. It is an approach more common to the small and desperate. But if we look past the oddness of Adobe employing this tactic, we see that the question of whether FrameMaker is a real XML editor, as with almost all debates about what makes a “real” anything, is not a debate about the product’s features, but a debate about what “real” means in the context.
Since an XML editor is just a tool for editing XML, and XML is just a tool for structuring content, the question at the heart of the matter is: what is structured writing?
The Adobe position is what I would describe as “desktop publishing plus angle brackets”. If you take this position, the thing a real XML editor must do is to present the writer with a desktop publishing interface, and produce XML under that skin. It is not at all surprising that Adobe takes the “desktop publishing plus angle brackets” view of structured writing. Desktop publishing is their baby, after all, one of the key technologies the company was built on.
But those of us on the “FrameMaker is not a real XML editor” side of the question tend more to the view that desktop publishing is the disease and structured writing is the cure. One of the most common phrases used to describe both XML and structured writing is “separating structure from formatting”. But what has become of that separation if the interface in which the author works is a WYSIWYG representation of the published form of the document? Structure and formatting may have been separated under the skin of the applications, but they have been promptly merged again in the authoring interface, never more to part, in most cases.
The truth is that in many structured writing systems, the separation of structure and formatting is not very great. Many schemas/DTDs are little more than an abstraction of the presentation artifacts of a printed page, such as substituting <emphasis> for <italic>. The design of such systems is done from the published artifacts backwards. In many cases, the abstraction goes no further back than the point where the same abstraction can be used to drive both print and on-line publication.
Additional semantic tagging may certainly be added. This is sometimes what you do when you create a DITA specialization (sometimes you just create another publishing abstraction). For instance: you add some tags to your topic type that are specific to the subject matter of that topic type. But when you do this, you are adding a semantic gloss to an element which is actually an abstraction of a publishing artifact. The whole point of the specialization mechanism is that even with the semantic gloss in place, the element can still be published as the underlying artifact.
Adding a semantic gloss to a schema that is fundamentally an abstraction of a publishing artifact can, of course, yield a number of benefits, but the method is definitely limited, and can sometimes present problems — for example, how exactly is the semantic part of the semantic gloss supposed to be represented in the publishing-oriented view that the author is working in?
The alternate approach to structured writing — the view in which FrameMaker is not a real XML editor — does not work from the published artifact backwards but from the content store outwards in two directions, one toward the author and the other toward the published document. The aim of this approach, plain and simple, is to get data that is as reliable as possible. This is how database systems are designed. You begin with a model of the data that you want to capture, taking into account all of the operations you want to perform on that data, and structuring your data model in such away that it allows you to write queries that will perform all of those operations reliably.
What I mean by “reliably”, here, is that the database structure is designed in such a way that, provided the data is correctly entered, you can run the query and be confident in the result without the need for a human being to check it over. Having human beings check over data is expensive and time consuming, so we naturally want to minimize it as much as possible. Thus when you log into Amazon and check your list of recommendations — a page that is entirely generated by machine based on all the books you have read, bought, rated, or placed on your wish list — there is no one on the Amazon staff who has to look over that page and approve it before it is sent to your browser. Such an inspection would be too costly for Amazon, and too time consuming for you. If such an inspection were needed, the feature simply would not exist. But the inspection isn’t needed, because the data in Amazon’s databases is reliable enough that they can generate this page for you automatically.
The more functions a database system can perform reliably, in this sense, the more efficient and productive it is. A good database system, therefore, is designed to be as reliable as possible for as many functions as possible. But no amount of reliable structure does you any good if you don’t have reliable data entry. This is the issue that the term “garbage in, garbage out” was coined to describe.
To ensure that you get good data in, you must design your data gathering system to be as clear and unambiguous as possible, to guide authors as fully as possible, to prevent errors as far as possible (for example, by only allowing people to choose values from a list of valid values), and to detect and report any errors that do occur as soon possible. You must also audit your content regularly to make sure errors are not creeping in, and change how data is collected to avoid such errors in the future.
An XML document is a database. A collection of XML documents is database. To make your structured writing system as efficient and productive as possible, those databases should be designed to be as reliable as possible for as many operations as possible. Creating a schema as an abstraction of a printed page is not generally the best design to achieve this. Similarly, in order to get reliable data from authors, you need to provide them with an authoring interface that ensures that the content they create is as complete and error-free as possible. Having people author in the visual representation of a published page is generally not the best strategy for achieving this.
In this view, structured writing is not desktop publishing with angle brackets; it is database design with content. In a database system, the data structures are designed for reliability, the authoring interfaces are designed for accurate data capture, and the publishing is then based on a programmatic transformation of the data into publishable format by a reporting system. Publishing, in any media, is not a paramount design criteria for the data store or the authoring interface. These activities proceed in the well-founded confidence that a well-structured data store containing reliable data can be published successfully to any media. Structure now is well and truly separated from formatting — separated so much that formatting is not the driver in the design and implementation of the structure or the authoring system.
A structured writing system that is based on the idea that structured writing is about making content reliable will be designed much more like a database system. Not every type of topic can be captured by a form, of course, but we can take an approach to designing our authoring schemas and our authoring interfaces to maximize the reliability of the content data we collect, rather than to mimic the formatting of the document presentation we will eventually create from that content data.
A real XML editor, for those of us who look at structured writing as database design with content, is not one that presents a DTP interface over an XML schema that is an abstraction of the printed document, but one that allows us to create a highly reliable data entry system for the capture of highly structured, highly reliable content data, which we can then publish any way we like.