The Meaning of Metadata

Every time I talk with someone about metadata, I have the feeling we are talking past each other, at least a little. We are both using the same word, “metadata”, but it often seems like we mean different thing by it. So, what does “metadata” mean?

Metadata is simply data about data, or, to put it another way, data that describes other data. Metadata is ubiquitous. Indeed most data is useless without metadata to tell us what it means. And because metadata is also data, we need metadata to tell us what the metadata means.

Take, for example, an XML document. An XML document contains markup, which is a form of metadata. Consider this fragment:

<p>The <library-name>foobar</library-name> library contains the routines <routine-name>foo()</routine-name> and <routine-name>bar()</routine-name>.</p>

All the red bits are metadata. The <p> tag is metadata that tells us that the string it contains is a paragraph. The <library-name> and <routine-name> tags are metadata that tell us that the strings they contain are library names and routine names respectively.

The set of tags allowed in an XML document is defined in a schema. A schema that describes the tagging used in the above fragment might contain lines like these:

<xs:element name=”library-name” type=”xs:string”/>

<xs:element name=”routine-name” type=”xs:string”/>

These lines are metadata that describe the <library-name> and <routine-name> elements. They say, for instance, that a <routine-name> element can’t contain any other elements, only string data.

So, a schema is metadata that describes a tagging language, and a tagging language is metadata that describes content.

No, we’re not done. The schema defines the rules for what a <routine-name> element looks like, but it doesn’t tell us what it means. Why should a writer tag text with the <routine-name> element. The meaning and use of the tags described in a schema must be documented in a guide or reference that tells the writer how to use these tags. That document is metadata for the tagging language as well.

If you write docs about data and its meaning and applications, those documents are metadata for that data.

If the documents that you create have tables of contents and indexes, they are metadata too. So are headings, subheadings, and captions. So are running headers and footers on pages. These things are all data that describe the content of the book, which is data, and they are therefore metadata.

The same sort of thing applies in other kinds of data. You will find layers of nested metadata wherever you look. In the database world, we have column names, which are metadata, and data dictionaries, which are metadata that describes the columns and the relationships between them.

Not all these forms of metadata are commonly referred to as “metadata”. Many forms of metadata have their own long-standing names: index, schema, data dictionary, table of contents, catalog, tag, label. But in the internet/CMS space it seems that the development of new forms of metadata, or at least of new ways and means of encoding metadata, outstripped our ability to give them names of their own. Or perhaps it was that many of them already had names, but people felt that the connotations of the old names (such as index, for example) might obscure what was needed in the new environment.

Whatever the reason, people started to use the generic term “metadata” for a number of new kinds/forms/presentations of metadata. So today we have many forms of metadata that have their own names, and many other forms, often closely analogous to the existing forms, which all go under the generic moniker “metadata”. Thus the confusion: “metadata” can refer to a bunch of things that are all called “metadata” and to a bunch more things that are not commonly called “metadata”, but which still are metadata.

Structured writing is all about adding metadata to content so that the content can be processed in various ways. This metadata can enable all sorts of things. It can guide authoring, as in the case of a schema. It can be used to audit written content to improve quality and consistency. It can be use to generate linking. It can be used to select particular units of content for use in a particular situation. It can be used to create tables of contents, and indexes. It can be used to optimize the content for access by search engines. It can be used to enable faceted search of a documentation set.

Each of these functions may require metadata in a different form, but often they actually run off the same metadata. At each stage of the process, it is important to make sure that the appropriate metadata is made accessible to the next function in the chain. In many cases, this does not involve creating new metadata, simply making existing metadata available in the form that is required. The reason that markup languages (whether based on XML, SGML, or something else) are so important to this process is that they provide a way to capture metadata that is both human writable and machine readable. The metadata encoding required for one process and easily be transformed into the encoding required for the next.

Metadata is so ubiquitous in a documentation system that it is very easy to be talking about different types of metadata, serving different functions, and thus talking at cross purposes. In many cases, I think people fail to grasp both how ubiquitous metadata is, and how fluid it is. They recreate metadata in different forms because the don’t grasp that the metadata they want already exists, or because they are dealing with a system which was not designed to capture the metadata at source and let it flow through the whole documentation process. This is a pity, because it often creates large amounts of avoidable overhead in documentation systems, and these overheads can adversely affect both quality and productivity.

Metadata is everywhere. The trick it to recognize it, capture it early, and make it flow through the system so that downstream functions have the metadata they need to operate efficiently without additional overhead. Does the reader want to see all the topics that mention routine foo()? The metadata to enable that search was created by the author when they marked up <routine-name>foo()</routine-name> in the original topic markup. Did that metadata flow through the system so as to become available to a faceted search that the reader can use?

Does metadata flow like that in your system?

8 thoughts on “The Meaning of Metadata”

Tom Johnson 2011/05/20

Mark, thanks for writing this post. You take metadata to an entirely new level here and expanded my understanding. There’s so much to learn regarding this topic. I’m wondering if there’s a standard metadata approach for tech comm (is it simply DITA?). Also, do you have any recommended reading on metadata?

Reply ↓

Mark Baker Post author2011/05/20

Hi Tom

There is a fundamental problem with the notion of standard metadata. Metadata is used for many different purposes, and your need different types of metadata, and different granularities of metadata for different purposes. Metadata is expensive to collect, so people are not apt to collect more metadata than their need for their own immediate purposes.

Consider that there is not ever an agreed metadata standard for collecting such simple piece of data as an address. Some forms ask you to break your address down into multiple fields (but different fields in many cases). Others just give you a single box that says “address”. Why the difference? Some collectors of address information want to pre-sort bulk mail so they can get a better rate for their mailing. Some want to gather data on the geographic distribution of their customers. Others just want to print a few mailing labels.

Collecting and validating line by line address information is more extensive than just getting a single “address” field. So people collect addresses not in a standard way, but in a way that is optimal for the use they want to make of it.

The same is true of something as simple as names. Some organizations need to break names down into multiple fields: first, last, middle, salutation. For international applications, several other pieces of name metadata may need to be collected also. But some organizations just want a name to print on a badge.

Standards work well when everybody has the same needs. But if the standard is inadequate for one purpose and burdensome for another, it won’t get used, or it won’t get used properly. It certainly seemed to be the case with Docbook, for instance, that no one used the whole standard, which was far too large for writers to learn, and that many people added things to it to meet needs it did not cover. The result being that there was no guarantee that I could take your Docbook file and process it through my Docbook tool chain and get a usable result.

Suppose you were using metadata to define the facets of a faceted search. Naturally, you want metadata that expresses the facets of your content that people might want to search on. The metadata that enables faceted search of a used car site probably isn’t going to work too well for a shoe store or an electronics retailer.

It might seem like there is a reason for all the used car dealers and all the shoe stores and electronic retailers to get together in their respective industry associations and come up with standard used car, shoe, or electronics metadata.

But if content strategy is king, as we are now being told, using standard metadata has two big disadvantages. First, industry associations take so long to set standards that by the time they agree on used car metadata we may all be traveling by helium powered bicycles. Second, your competitors have exactly the same metadata you do.

If my content strategy is my competitive advantage, I can’t afford to wait for an industry association to provide the metadata definition that is going to drive my content strategy, and the last thing I want to do is to share that strategy with my competitors. I want to be there first, and I want to be always one step ahead of them.

Getting your metadata strategy right is the key to productivity and quality in creating, managing, and delivering content. Adopting the same standards as your competitors is probably not the best way to achieve and maintain your competitive edge.

As for DITA, I don’t know that I would call it a metadata standard exactly. I think of it more as a neutering of XML. XML itself provides you with limitless capacity to define metadata schemas and to capture and encode metadata. It also demands that you then write code to process that metadata for all the purposes you want it for.

DITA, through its specialization mechanism, says, if you are willing to give up much of the flexibility of XML and just use specialization of our base topic types, we will reduce the amount of code you have to write (at least out of the box). It is, if you like, XML with training wheels. It limits your speed and maneuverability, but it keeps you from skinning your knees while you are a beginner.

The notion of content as data is new to most technical writers, and so they tend to think of metadata as something entirely separate from content. I have a blob of content and I attach a blob of metadata to it. But the real genius of XML is that it allows you to integrate data and metadata in a single object, and to apply metadata not just to the whole object, but to every level of it.

Reply ↓

Marcia Johnston 2011/05/21

Great reminder, Mark, of the larger sense of the word “metadata” (headers and footers, for example). Thanks for the clear analysis.

Reply ↓

Pingback: metadata | Early Novels Database

Paul K. Sholar 2011/06/16

XML enables serialized expression of inherently hierarchical datasets. There are lots of datasets (that is, those that are inherently hierarchical) that aren’t good candidates for being expressed using XML. (An aside: Back in the days when SGML was the state of the art in markup language, I remember reading about the controversies among SGML users in the humanities about the inadequacies of using SGML to markup the content of a physical manuscript that has hand-written marginalia.)

Regarding the meaning of “metadata,” I fear you aren’t quite making the right point, especially when you immediately veer into discussing XML-style markup.

Yes, metadata is “data about data,” but when it is useful to examine *any* data with also having additional information that describes what that data is intended to describe or represent? For example, in any conventional two-dimensional table, the column headings (and/or row headings) are metadata because they describe how a human being should interpret a given list of data items.

My POV is that metadata is what provides the *context* for a set of data. That context can be simple, like a tag, word, or phrase places next to other text, or it can be complex, such as a statement (formal or not) of all the relevant factors that were the case when a set of data (such as physical measurements) was produced.

Reply ↓

Pingback: What is XML Really About? : The Dynamic Publisher

Pingback: Qu’est-ce réellement que le XML ? : The Dynamic Publisher

Pingback: Worum geht es bei XML wirklich? : The Dynamic Publisher

Tom Johnson 2011/05/20

Mark, thanks for writing this post. You take metadata to an entirely new level here and expanded my understanding. There’s so much to learn regarding this topic. I’m wondering if there’s a standard metadata approach for tech comm (is it simply DITA?). Also, do you have any recommended reading on metadata?

Loading...

Reply ↓
Mark Baker Post author2011/05/20

Hi Tom

There is a fundamental problem with the notion of standard metadata. Metadata is used for many different purposes, and your need different types of metadata, and different granularities of metadata for different purposes. Metadata is expensive to collect, so people are not apt to collect more metadata than their need for their own immediate purposes.

Consider that there is not ever an agreed metadata standard for collecting such simple piece of data as an address. Some forms ask you to break your address down into multiple fields (but different fields in many cases). Others just give you a single box that says “address”. Why the difference? Some collectors of address information want to pre-sort bulk mail so they can get a better rate for their mailing. Some want to gather data on the geographic distribution of their customers. Others just want to print a few mailing labels.

Collecting and validating line by line address information is more extensive than just getting a single “address” field. So people collect addresses not in a standard way, but in a way that is optimal for the use they want to make of it.

The same is true of something as simple as names. Some organizations need to break names down into multiple fields: first, last, middle, salutation. For international applications, several other pieces of name metadata may need to be collected also. But some organizations just want a name to print on a badge.

Standards work well when everybody has the same needs. But if the standard is inadequate for one purpose and burdensome for another, it won’t get used, or it won’t get used properly. It certainly seemed to be the case with Docbook, for instance, that no one used the whole standard, which was far too large for writers to learn, and that many people added things to it to meet needs it did not cover. The result being that there was no guarantee that I could take your Docbook file and process it through my Docbook tool chain and get a usable result.

Suppose you were using metadata to define the facets of a faceted search. Naturally, you want metadata that expresses the facets of your content that people might want to search on. The metadata that enables faceted search of a used car site probably isn’t going to work too well for a shoe store or an electronics retailer.

It might seem like there is a reason for all the used car dealers and all the shoe stores and electronic retailers to get together in their respective industry associations and come up with standard used car, shoe, or electronics metadata.

But if content strategy is king, as we are now being told, using standard metadata has two big disadvantages. First, industry associations take so long to set standards that by the time they agree on used car metadata we may all be traveling by helium powered bicycles. Second, your competitors have exactly the same metadata you do.

If my content strategy is my competitive advantage, I can’t afford to wait for an industry association to provide the metadata definition that is going to drive my content strategy, and the last thing I want to do is to share that strategy with my competitors. I want to be there first, and I want to be always one step ahead of them.

Getting your metadata strategy right is the key to productivity and quality in creating, managing, and delivering content. Adopting the same standards as your competitors is probably not the best way to achieve and maintain your competitive edge.

As for DITA, I don’t know that I would call it a metadata standard exactly. I think of it more as a neutering of XML. XML itself provides you with limitless capacity to define metadata schemas and to capture and encode metadata. It also demands that you then write code to process that metadata for all the purposes you want it for.

DITA, through its specialization mechanism, says, if you are willing to give up much of the flexibility of XML and just use specialization of our base topic types, we will reduce the amount of code you have to write (at least out of the box). It is, if you like, XML with training wheels. It limits your speed and maneuverability, but it keeps you from skinning your knees while you are a beginner.

The notion of content as data is new to most technical writers, and so they tend to think of metadata as something entirely separate from content. I have a blob of content and I attach a blob of metadata to it. But the real genius of XML is that it allows you to integrate data and metadata in a single object, and to apply metadata not just to the whole object, but to every level of it.

Loading...

Reply ↓
Marcia Johnston 2011/05/21

Great reminder, Mark, of the larger sense of the word “metadata” (headers and footers, for example). Thanks for the clear analysis.

Loading...

Reply ↓
Pingback: metadata | Early Novels Database
Paul K. Sholar 2011/06/16

XML enables serialized expression of inherently hierarchical datasets. There are lots of datasets (that is, those that are inherently hierarchical) that aren’t good candidates for being expressed using XML. (An aside: Back in the days when SGML was the state of the art in markup language, I remember reading about the controversies among SGML users in the humanities about the inadequacies of using SGML to markup the content of a physical manuscript that has hand-written marginalia.)

Regarding the meaning of “metadata,” I fear you aren’t quite making the right point, especially when you immediately veer into discussing XML-style markup.

Yes, metadata is “data about data,” but when it is useful to examine *any* data with also having additional information that describes what that data is intended to describe or represent? For example, in any conventional two-dimensional table, the column headings (and/or row headings) are metadata because they describe how a human being should interpret a given list of data items.

My POV is that metadata is what provides the *context* for a set of data. That context can be simple, like a tag, word, or phrase places next to other text, or it can be complex, such as a statement (formal or not) of all the relevant factors that were the case when a set of data (such as physical measurements) was produced.

Loading...

Reply ↓
Pingback: What is XML Really About? : The Dynamic Publisher
Pingback: Qu’est-ce réellement que le XML ? : The Dynamic Publisher
Pingback: Worum geht es bei XML wirklich? : The Dynamic Publisher

The Meaning of Metadata

Related

8 thoughts on “The Meaning of Metadata”

Leave a Reply Cancel reply

Share this:

Related

8 thoughts on “The Meaning of Metadata”

Leave a Reply Cancel reply