• Print

Taxonomies and Starting With XML

This is an excerpt from a blog post I wrote last week on taxonomies and chunking.

Last October, the StartWithXML team wrote a post called “To Chunk or Not To Chunk,” where we discussed tagging and infrastructure issues, and a discussion ensued about what happens when you don’t know what you’ll be using chunks for. How do you tag those?

Later, in our StartwithXML One-Day Forum, we included a presentation on tagging and chunking best practices, where it was pointed out that no taxonomy for chunk-level content currently exists.

We have taxonomies for book-level content. These include formalized code sets such as theLibrary of Congress subject codes, the BISAC codes, the Dewey Decimal System, among others. There are also informal code sets, like the tag sets on Shelfari or Library Thing. There are proprietary taxonomies at Amazon and B&N.com that enable effective browsing.

But nothing like this exists for sub-book-level content. It’s never been traded before. We’ve never really needed a taxonomy for it before.

Other industries that traditionally distribute “chunks” have their own taxonomies that might prove useful in building a book-chunk schema. These include the IPTC news codes,
which identify the content of a particular news story — that’s the closest analogy I can find for small gobbets of content that require organization.

Industries have proprietary taxonomies to identify certain concepts — culinary arts, music, agriculture, engineering, the sciences, literature and criticism, education, and on and on and on.
But these do not necessarily identify concepts within a book.

Some might argue that we don’t necessarily need taxonomies — why can’t we use natural-language search and the semantic Web to “bubble up” the “right” concepts? I’d argue that words don’t always mean what we think they mean. A classic example from my library days is the term “mercury.” That could mean the planet, the car or the element. Proponents of semantic search would say that the context in which “mercury” is mentioned should take care of defining that term. I’d say that’s true in about 50 percent of all cases but not definitively true enough in 75-100%.

My original post gets into more detail about why taxonomies are important search tools, and how the digitization of books requires a good taxonomy … and who should do it.

tags: , , ,

Comments: 11

  1. yagni. (look it up.)

    or, more accurately, you won’t know the chunks
    you need until you actually have a need for them.

    so you’ll need to prepare for all possible chunks,
    which is _impossibly_expensive._ and even then,
    you won’t have gotten the chunking _quite_ right,
    so you will need to go over the whole thing again,
    at even more expense, when the opportunity arises.

    _if_ the “opportunity” actually does arrive.

    and it probably won’t.

    it’s hard enough to sell your content the first time,
    and it’s only getting harder and harder and harder.

    yet you fall for someone who pitches you a line that
    you’re going to sell the same content over and over?

    yeah, right.

    get real.

    seriously, get real. you’re deluding yourselves, and
    it’s not helping you get on with your need to adapt.

    you need to deliver value. the best way to do that is
    to _lower_your_overhead_, and paying good money
    to have someone fool around with “chunk-tagging”
    — on the basis of unfounded _hope/hype_ that you
    will be able to chop up books and sell the pieces —
    is one of the _last_ things you should be doing now.

    but hey, i’m your “enemy”, the one who _wants_ you
    to fail, so don’t listen to me! listen to the consultants
    who are trying to “save” you (and soak you for fees)…
    do what _they_ say, yeah, that’s the ticket…

    (no offense is intended toward laura dawson, however;
    i listened to the audio of her recent o’reilly presentation,
    and found it to be a refreshingly honest take on things.)


    > mercury

    what an appropriate (if ironic) example…

    because, according to a simple “define: mercury”
    google search, we learn (from the first entry) that
    mercury is the roman god of commerce as well…
    (_plus_ the messenger of jupiter, a role in which
    you might remember him by his winged sandals.)
    and it can also be a reference to the temperature,
    as measured by a _mercury_ thermometer, as in
    “the mercury was falling rapidly”…

    plus, of course, merely calling mercury “an element”
    is far too simplistic, since it manifests itself as many
    different compounds. consult a chemist for details.

    as well, “mercury” is a newspaper (in many places),
    a place in nye county, nevada, northwest of las vegas,
    a genus of plants in family euphorbiaceae (spurge),
    a song by “bloc party” released on august 11, 2008,
    an album which was the debut of “madder mortem”,
    another album, this one from 2003 by “long-view”,
    a record label now part of the universal music group,
    a science-fiction novel by ben bova published in 2005,
    a “playstation portable” videogame, also from 2005,
    an australian functional logic programming language,
    a standards-compliant donationware e-mail system,
    a c.m.u. experimental digital library from 1987 to 1993,
    a 1950s-era cipher machine of the british air ministry,
    a ferranti commercial computer from the early 1950s,
    a century-class cruise ship owned by celebrity cruises,
    a series of 3 u.s. 1990s-era reconnaissance satellites,
    the name of the pioneering program to orbit astronauts
    (the original ones were even called “the mercury seven”),
    a november evening star every 6 years since 1943-44,
    a bi-monthly science magazine (dating back to 1972)
    issued by the astronomical society of the pacific, and
    (last but not least) a superhero from amalgam comics
    _and_ a mutant in the marvel universe, how about that?

    that’s how “mercury” manifests itself in the real world…

    so how should you “chunk” it, at what level of granularity?
    i sure couldn’t tell you. except to give you the advice that
    _no_one_ will be able to answer that question in advance…

    i leave you now with just two other “observations”, both
    also gleaned from the simple google search i did above:
    > Mercury is associated with communication skills,
    > intelligence and cunning. Mercury rules the signs
    > Virgo and Gemini whose native may be
    > clever, high strung and argumentative.

    make sure you aren’t being hoodwinked by consultants
    who are clever and cunning, with communication chops.

    that’s very important, because mercury is:
    > A heavy metal that can accumulate in the environment
    > and is highly toxic if breathed or swallowed.

    don’t swallow the hype. it’s not good for you…


  2. Hi, Bowerbird! As always, you make some great points (is Mercury in retrograde now? Because these last few days, both my communication chops and my devices have been faltering).

    There’s more at stake to tagging than chunking, as I’m sure you know. And chunking isn’t equally valuable in all markets – I’d argue that in educational publishing, it’s far more important than in trade publishing. That all said, effective sub-book-level tagging is crucial – for search (and if you see the rest of my post on my own blog, which I’ve linked to above, you can read about that) as well as for internal purposes (storage in a DAM, etc.).

    In essence, I’m just arguing that one’s house should be kept clean. Because, as you say, you don’t know what is going to happen. And better to be organized than not.

  3. don’t know if mercury is in retrograde, but
    i think your communication skills are fine…
    i just disagree with your focus in this case. :+)

    > There’s more at stake to tagging than chunking

    yes, and those issues get even thicker and stickier…

    > effective sub-book-level tagging is crucial –
    > for search (and if you see the rest of my post
    > on my own blog, which I’ve linked to above,
    > you can read about that)

    i did read your original post, in its entirety, and
    a few other posts on your blog while i was there,
    before i wrote my comment. nothing i read there
    made a difference in terms of what i was gonna say.
    what you’ve said has already been said many times,
    just like what i’ve said has been said many times…

    perhaps y’all have some secrets that _acknowledge_
    the thick and sticky aspects of the tagging arena,
    and move on from there. if so, my advice to you
    would be to springboard dialog from that point…

    you might think you need to start with “the basics”,
    because publishers don’t know anything about this.

    but i don’t think you’re gonna _convince_ anyone,
    especially not in these challenging economic times,
    to take on any _new_ schemes, especially ones that
    might be very costly, with little return on investment.

    that said, there might be publishers who’re _already_
    convinced of the benefits of tagging to their content,
    who are struggling with all the implementation issues,
    and i’d think they’d be good targets for your message.

    but new business, from publishers with no experience
    with tagging? that’s gonna be a very hard sell for you.

    as i said, i admire your honesty. you’re not one of the
    snake-oil salesmen so prevalent in the tagging world,
    who promise a lot of stuff that _cannot_ be delivered…

    when the limitations are brought to _your_ attention,
    you acknowledge ’em, without trying to minimize ’em.
    i _respect_ that — i really do — and if i was gonna hire
    someone to help me learn tagging, i would hire _you_.
    thing is, i wouldn’t hire anyone to help me do tagging.

    > as well as for internal purposes (storage in a DAM, etc.).

    a d.a.m. is more overkill, but we can discuss that later… :+)


  4. Shouldn’t we be talking with indexers there?

    As a matter of fact, serious, dedicated, publishers used to hire skilled help to finalise indexes for works that deserve it.

    I do think, as Laura, that building taxonomies would be useful and great help — without any illusion about universal extensive use: tag or index with a purpose; allow full text, semantic search for the rest. Still, I guess indexers would welcome a joint effort to build relevant taxonomies, with up to date digital tools, don’t you think?

    Some hints,ideas, topics to begin with, from Dublin core to ventures to describe points of interest (POIs)

  5. Alain, d’accord! I think indexers and cataloguers will serve a great purpose in the next decade or so. If publishers are thinking about taxonomies and tagging at all, it’s generally from an SEO perspective – and the marketing folks are just not in touch with the indexers (which is more of an editorial/production function). One thing I love about an XML workflow is that it brings together departments who traditionally haven’t worked together very much.

    Bowerbird, publishers do need to wade into this pool very gradually, you’re absolutely right about that. Nobody’s suggesting that people drop everything and tag (although that would be fun to watch). As I said in my presentation, folks have to do this carefully and consensually. I’d also love to see search get better and for in-depth tagging not to be necessary. That could save everyone a whole pile of trouble. But we’re not there (on both counts) yet.

    And as Mike Shatzkin says, a DAM could conceivably be a well-ordered set of folders on a server. So long as you can find your stuff….

  6. Laura, very good point about XML workflow, fully confirmed by quite a few different experiences I was involved in (or witness to) since … SGML times:

    forget the technicalities, if XML is (just a bit more than 😉 ) an occasion to bring back together departments and make a better job, it’s already a tremendous benefit, if not the core one.

    I recently really appreciated a comment from a graphist, about an exploratory XML workflow input from word processor, into TEI, into InDesign, comment which went approximately so: “Why the heck use XML, this only amounts to 1°) concerting at the beginning of the project between author, editor, graphist [I’d personnally add “local IT guru”] 2°) having a relevant and confident use of styling”.

    To be honest, she added “Your XML isn’t necessary: I tried and input the edited styled word processing file; wow!, I had always told authors/editors should be consistent”…

    Morale of the anecdote:
    1°) if XML triggers good practice, jump to it!
    2°) local gurus, keep focus on getting marginal benefits from technology, consistent with actual goals plus prepare future benefits one step ahead…

  7. alain said:
    > Shouldn’t we be talking with indexers there?

    it depends. how much do they charge per hour to “talk”? ;+)


  8. Hi, bowerbird

    I guess we aren’t talking/agreeing about the same “indexers”:

    I do not mean “web” indexers, but editors/”correctors”/indexers (sorry about — please forgive — my bad english — I’m not sure about the relevant denomination), the guys who help authors/editors to make a good job of building a relevant index, i.e. linking “character chains” from a publication into an index, i.e. a list of sorted entries, presumably relevant for the reader/searcher to find/sort out the places, in a text, where the author(s) phrased something relevant.

  9. alain-

    no, we’re talking about the same people.

    and they don’t just “help” authors/editors to
    build an index, they usually build it themselves.

    i was making a humorous reference to the fact
    that indexers are highly-skilled professionals,
    and that means that they are paid very well…
    (the best can command extremely high rates.)

    now, if you hired indexers to do your tagging,
    you would undoubtedly get excellent output…
    you would also pay a very high fee to get it…
    and you’d pay the fee for each and every book.

    if i understand your point correctly, however,
    you’d instead hire these indexers to help you
    “form a taxonomy” for your tagging workflow,
    and not have those indexers do the tagging…

    in other words, as the indexers would put it,
    you want to hire them to have them show you
    how to benefit from their expertise, so you can
    _replace_them_ with some lower-paid workers…

    i’m not sure you’re gonna get much uptake there.

    even if you found some indexers to agree to that,
    one of the “truths” with which indexers are molded
    is that every book is as individual as a fingerprint,
    and thus deserves its own unique index. so you
    — in trying to develop your “general taxonomy” —
    will find yourself in firm opposition to that “truth”.

    any self-respecting indexer will say “can’t be done”.
    they _have_to_ tell you that. it’s part of the credo…
    (and they’d probably focus their efforts on examples
    that showed you exactly _why_ you couldn’t do it,
    as those are the very things they’re trained to spot.)
    for the same reason, any self-respecting indexer
    will also tell you that “computers can’t do indexing”,

    now, me? i think that’s mostly a bunch of hooey…
    i mean, it’s _kinda_ true. but largely exaggerated.

    i think if you want a _perfect_ index, then sure,
    you need to hire a human indexer. no doubt…
    and be prepared to _pay_; perfection ain’t cheap.

    but if you just want a “good enough” index, then
    you can develop a good-enough taxonomy, and
    a computer program will do a good-enough job.

    if you spread the cost of taxonomy development
    and computer programming over lots of books,
    then i think you can get a “good enough” index
    for each of those books for… let’s say… $5/book.

    if you hire a human indexer, to do a job that is
    somewhere between “pretty good” and “perfect”,
    you’re looking at a cost of maybe $500/book…

    my estimate could well be _half_ of today’s cost,
    since it’s been a while since i priced an indexer.

    but you get the general idea. “good enough” is
    a _lot_ cheaper than “created by a pro indexer”.

    now, for some books, an index might well be
    _worth_ the $500. it might be worth $5,000!
    (an excellent index can create a happy reader,
    and happy readers make good word-of-mouth,
    and word-of-mouth often sells a ton of books.
    so an excellent index can pay for itself quickly.)

    for other books, though, a “pro index” won’t be
    worth quite so much, and maybe very little at all.

    the whole point is, indexers cost a lot of money.
    even just _talking_ to an indexer can cost a lot…

    and if you’re having them slice their own throat,
    professionally, by helping you make a “taxonomy”,
    you can expect ’em to charge you an arm and a leg.
    (and they’ll still walk away telling you it can’t be done.)

    that might be fine if publishers were in good shape,
    financially, and the economy was looking all rosy…

    but at this time, when many publishers are collapsing,
    and the rest are trying to shore up their foundations,
    and the economy isn’t looking all that peachy, well…
    i just don’t think you’re gonna get much uptake…

    but, you know, go ahead and prove me wrong! :+)
    believe in your dreams! don’t let me ruin them!


  10. I only see two good current candidates for sub-book level content indexing:
    * book structure as created by the author: chapters, in some cases more fine-grained units (articles for dictionaries; chapter and verse, or stanza for some kinds of literature)
    * words (finessing for now the difference between white-space delimited tokens and linguistic units)

    Words have the ambiguity problem, of course, as noted in other comments about ‘mercury’. To this end, you should check out WordNet, which provides a hierarchically-arranged dictionary of senses (in the case of mercury, there are four). Though it’s not obvious from the web interface, each sense has a unique ID, so you really can tag words with their senses. Of course, this is a tremendous amount of effort, and automated sense tagging is still a research project in the Natural Language Processing community (though some vendors out there may say otherwise).

  11. sean said:
    > I only see two good current candidates

    i don’t think the situation is nearly that bleak.

    there is a _lot_ that can be done to massage the
    content of the book into index-like structures…
    (and i should have mentioned this route earlier.)

    look at some of the measures amazon provides…

    for instance, consider the book “cradle to cradle”,
    by william mcdonough, which advocates a “lifecycle”
    approach of sustainability concerning what we create.

    > http://www.amazon.com/gp/product/0865475873/ref=cm_rdp_product

    on top of what we’d expect — concordance, readability
    — amazon generates some interesting index measures:
    statistically improbable phrases and capitalized phrases.

    for mcdonough’s book, here they are:

    > Statistically Improbable Phrases (SIPs):
    > technical metabolism, technical nutrients, natural 
    > energy flows, biological nutrients, monstrous hybrid

    > Capitalized Phrases (CAPs):
    > Industrial Revolution, United States,
    > Henry Ford, River Rouge, Herman Miller


    read the descriptions:
    > Capitalized Phrases, or “CAPs”, are people, places, events,
    > or important topics mentioned frequently in a book.
    > Along with our Statistically Improbable Phrases,
    > Capitalized Phrases give you a quick glimpse into
    > a book’s contents. Click on a Capitalized Phrase
    > to view a list of books in which the phrase occurs.
    > You can also view a list of references to the
    > Capitalized Phrase in each book. For example,
    > if you’re looking at a Sherlock Holmes mystery,
    > you can click on “Professor Moriarty” to see a list
    > of books that feature or mention Holmes’s nemesis.
    > You can then browse a few pages from the books or
    > click on the A9.com search link to read more about him.

    > Amazon.com Statistically Improbable Phrases
    > Amazon.com’s Statistically Improbable Phrases, or “SIPs”,
    > are the most distinctive phrases in the text of books
    > in the Search Inside!™ program. To identify SIPs,
    > our computers scan the text of all books in the Search Inside!
    > program. If they find a phrase that occurs a large number
    > of times in a particular book relative to all Search Inside!
    > books, that phrase is a SIP in that book. SIPs are not
    > necessarily improbable within a particular book, but
    > they are improbable relative to all books in Search Inside!.
    > For example, most SIPs for a book on taxes are tax related.
    > But because we display SIPs in order of their improbability,
    > the first SIPs will be on tax topics that this book mentions
    > more often than other tax books. For works of fiction,
    > SIPs tend to be distinctive word combinations that often
    > hint at important plot elements. Click on a SIP to view
    > a list of books in which the phrase occurs. You can also view
    > a list of references to the phrase in each book. Learn more
    > about the phrase by clicking on the A9.com search link.


    amazon only lists the _outliers_ obtained on these measures,
    but if you collect the entire list of them for any specific book,
    you will find that you’ve generated the first pass at an index…

    and if you then go on and have a human create a “real” index,
    you’ll find that that first pass was actually great, _considering_
    it was created quickly, easily, and essentially _free_of_charge_.
    (contrast it with the human index, which was very expensive.)

    moreover, there are other _free_, easy-to-implement analyses
    that can help you improve that “first draft” index considerably,
    all of which emerge gracefully and unavoidably out of the text.

    any publisher who is contemplating an all-out tagging effort
    but has _not_ done research on these “automatic” processes
    _first_ to gauge cost/benefit is not spending resources wisely.