Every time I talk with someone about metadata, I have the feeling we are talking past each other, at least a little. We are both using the same word, “metadata”, but it often seems like we mean different thing by it. So, what does “metadata” mean?
Metadata is simply data about data, or, to put it another way, data that describes other data. Metadata is ubiquitous. Indeed most data is useless without metadata to tell us what it means. And because metadata is also data, we need metadata to tell us what the metadata means.
Take, for example, an XML document. An XML document contains markup, which is a form of metadata. Consider this fragment:
<p>The <library-name>foobar</library-name> library contains the routines <routine-name>foo()</routine-name> and <routine-name>bar()</routine-name>.</p>
All the red bits are metadata. The <p> tag is metadata that tells us that the string it contains is a paragraph. The <library-name> and <routine-name> tags are metadata that tell us that the strings they contain are library names and routine names respectively.
The set of tags allowed in an XML document is defined in a schema. A schema that describes the tagging used in the above fragment might contain lines like these:
<xs:element name=”library-name” type=”xs:string”/>
<xs:element name=”routine-name” type=”xs:string”/>
These lines are metadata that describe the <library-name> and <routine-name> elements. They say, for instance, that a <routine-name> element can’t contain any other elements, only string data.
So, a schema is metadata that describes a tagging language, and a tagging language is metadata that describes content.
No, we’re not done. The schema defines the rules for what a <routine-name> element looks like, but it doesn’t tell us what it means. Why should a writer tag text with the <routine-name> element. The meaning and use of the tags described in a schema must be documented in a guide or reference that tells the writer how to use these tags. That document is metadata for the tagging language as well.
If you write docs about data and its meaning and applications, those documents are metadata for that data.
If the documents that you create have tables of contents and indexes, they are metadata too. So are headings, subheadings, and captions. So are running headers and footers on pages. These things are all data that describe the content of the book, which is data, and they are therefore metadata.
The same sort of thing applies in other kinds of data. You will find layers of nested metadata wherever you look. In the database world, we have column names, which are metadata, and data dictionaries, which are metadata that describes the columns and the relationships between them.
Not all these forms of metadata are commonly referred to as “metadata”. Many forms of metadata have their own long-standing names: index, schema, data dictionary, table of contents, catalog, tag, label. But in the internet/CMS space it seems that the development of new forms of metadata, or at least of new ways and means of encoding metadata, outstripped our ability to give them names of their own. Or perhaps it was that many of them already had names, but people felt that the connotations of the old names (such as index, for example) might obscure what was needed in the new environment.
Whatever the reason, people started to use the generic term “metadata” for a number of new kinds/forms/presentations of metadata. So today we have many forms of metadata that have their own names, and many other forms, often closely analogous to the existing forms, which all go under the generic moniker “metadata”. Thus the confusion: “metadata” can refer to a bunch of things that are all called “metadata” and to a bunch more things that are not commonly called “metadata”, but which still are metadata.
Structured writing is all about adding metadata to content so that the content can be processed in various ways. This metadata can enable all sorts of things. It can guide authoring, as in the case of a schema. It can be used to audit written content to improve quality and consistency. It can be use to generate linking. It can be used to select particular units of content for use in a particular situation. It can be used to create tables of contents, and indexes. It can be used to optimize the content for access by search engines. It can be used to enable faceted search of a documentation set.
Each of these functions may require metadata in a different form, but often they actually run off the same metadata. At each stage of the process, it is important to make sure that the appropriate metadata is made accessible to the next function in the chain. In many cases, this does not involve creating new metadata, simply making existing metadata available in the form that is required. The reason that markup languages (whether based on XML, SGML, or something else) are so important to this process is that they provide a way to capture metadata that is both human writable and machine readable. The metadata encoding required for one process and easily be transformed into the encoding required for the next.
Metadata is so ubiquitous in a documentation system that it is very easy to be talking about different types of metadata, serving different functions, and thus talking at cross purposes. In many cases, I think people fail to grasp both how ubiquitous metadata is, and how fluid it is. They recreate metadata in different forms because the don’t grasp that the metadata they want already exists, or because they are dealing with a system which was not designed to capture the metadata at source and let it flow through the whole documentation process. This is a pity, because it often creates large amounts of avoidable overhead in documentation systems, and these overheads can adversely affect both quality and productivity.
Metadata is everywhere. The trick it to recognize it, capture it early, and make it flow through the system so that downstream functions have the metadata they need to operate efficiently without additional overhead. Does the reader want to see all the topics that mention routine foo()? The metadata to enable that search was created by the author when they marked up <routine-name>foo()</routine-name> in the original topic markup. Did that metadata flow through the system so as to become available to a faceted search that the reader can use?
Does metadata flow like that in your system?