Standardizing Tags in the Metadata Minefield

StartWithXML: Why and How

One issue we haven't discussed much is that of metadata. XML documents are by definition rife with metadata. At what point does metadata cross the line from useful to pollution?

When it's not standardized.

The kind of XML tagging we're primarily talking about can be sectioned into three buckets: rights data ("this picture is good for print products but not electronic ones," "we can use this graphic anywhere," "these animations are exclusively for the workbook"), formatting data ("this is a chapter," "this is a footnote"), and context data ("Paris," "1955," "General Robert E. Lee," "noodles").

This is a perfect recipe for complete chaos. Obviously standards are crucial to the success of using XML in publishing. Even standards within a department -- using tags the same way from one project to the next, from one PERSON to the next -- are crucial.

There's been some talk about the role of the Book Industry Study Group in developing tagging standards, in the same way they've developed BISAC code standards. And this makes a great deal of sense. The rights and formatting tag standards should be relatively easy to establish -- publishing houses, no matter whether big or small, tend to use this data fairly consistently. It's the context tags that pose the more serious challenges.

Library of Congress has done this sort of thing with its subject headings. But, like the BISAC codes, these refer to the subject of an entire book. Many books, however, are comprised of more than one topic - many chapters are comprised of more than one topic. That level of granularity has never been taxonomized before.

Still, it's important to do so in a standardized way, to avoid a cacophony that drowns out meaning. (Is it "pasta" or "noodles"? When you say "diamond," are you talking about baseball or gemstones or Neil? Why is a chapter published by Mosby about dentistry coming up in search results with the chapters on collecting Limoges china published by Antique Trader? Hint: "porcelain.")

If you've ever seen a tag cloud on a website, you'll know what I mean. You never know what you're going to get when you click on it. Standardizing context tags is probably the most thankless, boring job publishers will ever engage in. But it's also the one that's going to ensure that books are actually discoverable the way they're meant to be discovered.

4 Comments


>formatting data ("this is a chapter," "this is a footnote")

While this information is often used to drive formatting, it's not formatting metadata. It's structural metadata. Formatting metadata says things like "output this in italics" or "indent this half an inch."

This is a very important distinction, because structural metadata is what XML is all about. Structural metadata is what makes addressing of document components possible, and therefore what makes component re-use possible.

For an excellent classification of metadata, see slide 3 of Eric Childress's OCLC presentation "Metadata Standards" at http://www.oclc.org/research/presentations/childress/fedlink_20031118.ppt.

Also, when people talk about "tagging" in the Web 2.0 sense, they're talking about assigning key terms as metadata. When people talk about tags in XML, they're talking about adding delimiters to elements. If you're talking about both, you should try to be more careful in how you use the term to make it clearer which you mean.

>When you say "diamond," are you talking about baseball or gemstones or Neil?

And when I say "title," am I talking about the deed to a piece of property, a job title, or the title of a work? When I say http://purl.org/dc/elements/1.1/title, I'm clearly talking about the title of a work. This is why RDF uses URLs.

"this is a footnote"

should be considered as (nearly ?) formatting metadata, considering the "foot" vs "end" note usual alternative.

It can be tricky to set a boundary between structure and format when technological artifacts are widely used for various purposes and there is not — or not yet — a consensus about assigning them semantic and pragmatic features:

For instance, bviously pop-ups are used to display notes, or glosses, translations, ... and I guess most of us would agree that tagging an element as a pop-up should be considered a formatting indication.

Would we have the same consensus if the same element was tagged as a different page or window, with more options (moving, pinning, resizing, ...)?

PS:
I couldn't post my original comment, rejected as wrong text because instead of 'nearly' I had written 'on the v.e.r.g.e of' — without periods.

Silly how 'strong language' filters spoil expression...

@Alain: Sorry about the "wrong text" glitch. The ongoing battle against spam really puts a crimp on conversation.

Leave a comment


TOC Comment Guidelines






Stay Connected
RSS TOC RSS Feeds
 News Posts
 Commentary Posts
 Combined Feed
 New to RSS?
Newsletter Subscribe to the TOC newsletter.
Tarsier Icon Follow TOC on Twitter.
Newsletter Join the TOC Facebook group.
Newsletter Join the TOC LinkedIn group.
TOC Widget Get the TOC Headline Widget.
Search
TOC In-Depth

Impact of P2P and Free Distribution on Book Sales Impact of P2P and Free Distribution on Book Sales

This report tests assumptions about free digital book distribution and P2P impact on sales. Learn more.


StartWithXML: Making the Case for Applying XML to a Publishing Workflow StartWithXML Research Report

The StartWithXML report offers a pragmatic look at XML tools and publishing workflows. Learn more.


Tools of Change for Publishing tutorial DVDs TOC 2008 Tutorial DVDs

Dive into the skills and tools critical to the future of publishing. Learn more.

Tag Cloud
TOC Community Topics