Are structured writing and crowd-sourced content on divergent paths, or can you have both? It’s a pretty hot topic right now. Sarah O’Keefe recently tweeted:
Must push XML and structure out to masses and need better tools for that
Linda Urban recently tweeted this:
@finiteattention: The problems of crowdsourced user assistance: http://xkcd.com/979 #ua #techcomm
And most recently, Tom Johnson has blogged about Wiki Culture, Reader/Writer Distinctions, and Divergence from Structured Authoring. No surprise that this is a concern, since structured writing and wikis are two of the hottest trends in technical communication at the moment.
So here is the key question: can you get structured data from the crowd? The answer is an unequivocal yes. In fact, it is not only possible, it is common. So common, in fact, that you have been involved in it many times, and probably in the last week.
Before we go there, though, it is important to establish what we mean by structured writing. The term structured writing is used in several different ways. Each is legitimate in its own way, but it is important to distinguish them if we want a useful answer to the question of whether structured writing and crowd-sourcing can mix. Structured writing can simply mean writing to a consistent template. As I have argued recently, real topics (as opposed to shredded books) tend to naturally conform to a type. In this sense, every cookbook is an example of structured writing.
Structured writing also means creating content in a format that can be read and processed by computers. The most common means of structuring content in this sense is XML. In this sense, all XML is structured content. But that would mean that XHTML is structured content, since it is XML, and most people in the structured content business will tell you that HTML, whether prefaced by an X or not, is not structured.
So what does structured mean in the sense that excludes XHTML from the set of structured markup languages? It means a markup that is more specific to the content, markup that imposes limits and restrictions on the author, markup that tells you something about what the content means, not simply about how it should be displayed.
It is probably easiest to make the point by way of analogy. Consider Microsoft Excel: it allows you to create spreadsheets. If a ledger book is structured in the first sense — that the content is organized in a consistent way, Excel allows you to make it structured in the second sense, by marking up the rows of numbers in a form that can be read and processed by the computer.
But Excel is generic. It imposes almost no limits or restrictions on the data you can enter. You can do pretty much anything you want with Excel: do your taxes, catalog your record collection, keep score for your softball league. But you do have to do it yourself. If you wanted to use Excel to do your income taxes, for instance, you could set up a spreadsheet that looked like the tax form and did the calculations, but it would be a lot of work. For doing your taxes, you would be much better off buying a tax preparation program like TurboTax.
TurboTax is structured in the third sense: the structure tells you what the content means, that imposes limits and restrictions the data the user can enter. TurboTax is about taxes from the ground up. Every field in the program is pre-coded to do exactly one thing. It comes with a huge amount of validation capability as well. It can tell you if you have claimed a deduction you are not entitled to. It can also optimize deductions between spouses. It can do all of this because it is built to do one thing and one thing only: your taxes.
TurboTax is much more powerful than Excel for doing taxes. But it is also much more limited than Excel. You can’t use it to catalog your record collection or keep the scores for your softball league. This is the essential point about structure: structure means limits. The more structured something is, the more limits it has. Limits are good. The more limits you put on data, the more reliable that data becomes, and the more reliable data becomes, the more processes you can apply to it reliably.
Structure equals limits; limits equal reliability; reliability equals processability – that is all ye know on Earth, and all ye need to know.
In Canada, the Canada Revenue Agency (our version of the U.S. IRS and the UK Inland Revenue), certifies certain tax preparation packages for submitting tax forms using the E-File service. (I’m sure most other countries have something similar.) Even if you created an Excel spreadsheet to do your taxes, Revenue Canada would not let you submit it through E-File. The strict limits that TurboTax and other packages place on the data that is entered into them make them more reliable, to the point where Revenue Canada is willing to allow that data to be fed directly into their tax processing system.
Which is the point we have been working towards: how crowd-sourcing and structured data can mix. Because that is exactly what Revenue Canada, and probably every other first-world tax authority is doing: They are crowd-sourcing tax data. Millions of ordinary taxpayers around the world are entering tax data directly into government computer systems, speeding up tax processing and saving millions by avoiding the need to captured data from printed forms.
Governments are not alone in this. Banks crowd-source financial data from ATMs, point of sale terminals, and online banking systems. Amazon crowd-sources order entry. The airlines crowd-source flight booking and check-in. Filled out a form online lately? Welcome the the wonderful world of crowd-sourced structured data.
So, can crowd-sourcing and structured data mix? Absolutely. In fact, structured data is an absolute requirement for crowd-sourcing. Revenue Canada, your bank, Amazon, your airline, and all the other businesses that now have you do their data entry for them, are not going to accept an Excel spreadsheet of your taxes, your transactions, your order, or your check-in. Nor is Revenue Canada going to accept your taxes through Amazon.com, or your bank allow you to withdraw money through an airport check-in terminal. Each system is specific to its purpose. Each institution is only going to accept highly structured data: reliable data that is specifically structured and verified according to their exact specifications.
Excel may be ubiquitous, and generic enough to use for almost any purpose, with an appropriate amount of ingenuity and effort, but being generic and ubiquitous are not the keys to successful crowd-sourcing. Quite the opposite, successful crowd-sourcing of data requires highly precise and specific structure that ensure the reliability of the data.
How does this translate to the crowd-sourcing of tech pubs content? That is a subject for another post.