It is easy to think of metadata as a few information fields that you tack onto an article when you submit it to a CMS. In this vision, metadata is something fairly small, a triviality in comparison to the content itself.
The reality is just the opposite; for any piece of content or data, the metadata is bigger than the data. Metadata is data that explains data. And since data needs to be explained to be useful, metadata also needs metadata to explain it. This means that behind every piece of data there is a metadata cascade that eventually leads back to unstructured stories. Therefore the metadata for any given piece of data or content is all of the data that explains the entirety of that data/content. That’s a lot.
Consider a fairly simple example. When you create a table in an HTML document, you are adding metadata to the content in the form of table markup. The table markup allows a browser to display the content as a table. But how does the browser know what the HTML tag <table> means or how to display it?
The table tag is metadata to your content, but it is also data, and as such it needs metadata to explain what it means. That metadata is the HTML specification. Take a look. It’s pretty big. Bigger than your table, probably. The metadata is bigger than the data.
Of course, the HTML specification is part of the metadata of millions of documents. The total size of all the HTML documents in the world is vastly greater than the size of the HTML specification. This is as it should be. Content or data that has common metadata can be processed by common algorithms. If we did not have a huge amount of content that shared the metadata of the HTML specification, we would not have the Web.
Useful content always has a combination of common metadata and unique metadata. Without common metadata, it could not share any common processing, and would be very hard to manage. Without unique metadata, it would be redundant. Some of its common metadata it will share with a very large body of content, while other common metadata may only be shared with a limited range of content.
As an author, your responsibility for metadata does not end with the few pieces of data you type into the metadata record when you submit your content to the CMS. Your responsibility goes much deeper than that. You are responsible for making sure that your content conforms to its metadata — all of it.
Whatever format you are writing in, be in HTML, DITA, DOCBook, MarkDown or whatever, you have to tag your content with the appropriate tags. That is applying metadata. The process is fairly simple: you pick from a list of available tags. But the implications are significant.
An entire processing infrastructure relies on the metadata attached to your content, and the entire metadata cascade of that explains that data. You may never have read the DITA specification, but if you are writing in DITA, your content is being processed by algorithms written by someone who did (or should have). And they wrote those algorithms on the good faith assumption that when you wrapped some piece of content in a particular DITA tag, that that tag was the correct tag to use for that piece of content: that your content conformed to the metadata you applied to it.
Which means, in effect, that the algorithm is written on the tacit assumption that you have read the DITA spec, fully understood it, completely remembered it, and consistently applied it to every piece of content that you wrote.
Which, of course, you haven’t. Just, as you haven’t read the data dictionary of the bank’s database tables when you type information into an ATM or an online banking app.
To a very large extent, we could characterize usability in digital systems as avoiding the need for users to understand the full metadata cascade behind the data they create. (The rest of usability is about avoiding mechanical awkwardness — like typing angle brackets for XML tags.)
How is this done?
- By providing natural language prompts that guide the writer (an interpretation of their true metadata).
- By designing systems such that you only ask writers for data in a domain they understand well (meaning that they actually do understand most of the relevant metadata).
- By designing systems that do lots of validation of data input and give good feedback when the data provided is inconsistent with the metadata.
Unfortunately, all of these things are difficult to do when it comes to content.
- Not only is it hard to fit natural language prompting onto the screen without obscuring the document itself, the author is creating so much metadata so quickly as they type that a high level of guidance would kill productivity. And in many cases, the markup schema provides so many choices at every juncture that the full guidance to selecting between them is overwhelming. This forces the author to learn the schema and hold it in their head when writing, a cognitive load that is made worse by the size and complexity of many schemas.
- Much of the metadata that authors are asked for relates to the publishing domain or the content management domain, which the average author is not familiar with. Even the abstract document structures that typically make up a good portion of many schemas are not part of the native way that authors think about documents. Authors are therefore forced to learn a good deal about these foreign domains in order to create content that is valid for the metadata they are applying to it.
- It is easy enough to validate if a person is attempting to withdraw more money than they have in their bank account. It is much more difficult to validate that a string of sentences validly fulfil the requirements of the markup attached to them. And this is made significantly worse by the very loose structures found in many schemas. By creating schemas that are largely free of constraints, we put all the burden on authors to follow appropriate constraints themselves, which again means they have to know and understand the metadata cascade behind those constraints.
That the metadata is bigger than the data is true for all systems. The successful systems are the ones that have successfully managed this fact by encapsulating the metadata in a way that does not require extensive metadata knowledge of users. The content management industry has not done a good job of this.
Many content management projects succeed, but many fail — far more than we should expect at this stage in the development of the industry. Some projects that were judged an initial success break down over time. Such failures can plausibly be traced back either to a loosening of metadata discipline, or a loss of understanding of the complex metadata system among users, or to the accumulation of content that does not fully conform to the metadata requirements, but which was not detected at the time it was created. (Such non-conformance can actually become self-perpetuating, as non-compliant workarounds are used to compensate for existing non-compliant content.)
A large part of this failure lies in its failure to design systems with this problem in mind. Complex metadata requirements accumulate during system design without much thought as to how they may be successfully hidden from users. Attempts to solve the problem often take the form of a gloss over the system — putting WYSIWYG interfaces over XML tags. But these approaches really only address the mechanical awkwardness aspects of usability. They do not encapsulate metadata requirements successfully.
As noted above, the encapsulation of metadata is an inherently difficult proposition for content — moreso perhaps than for any other type of data. So we should not expect that content management will ever be as easy as using an ATM. But we could benefit greatly from a ground up rethink of the content management model that had this problem at the heart of it.