This is an excerpt from a blog post I wrote last week on taxonomies and chunking.
Last October, the StartWithXML team wrote a post called “To Chunk or Not To Chunk,” where we discussed tagging and infrastructure issues, and a discussion ensued about what happens when you don’t know what you’ll be using chunks for. How do you tag those?
We have taxonomies for book-level content. These include formalized code sets such as theLibrary of Congress subject codes, the BISAC codes, the Dewey Decimal System, among others. There are also informal code sets, like the tag sets on Shelfari or Library Thing. There are proprietary taxonomies at Amazon and B&N.com that enable effective browsing.
But nothing like this exists for sub-book-level content. It’s never been traded before. We’ve never really needed a taxonomy for it before.
Other industries that traditionally distribute “chunks” have their own taxonomies that might prove useful in building a book-chunk schema. These include the IPTC news codes,
which identify the content of a particular news story — that’s the closest analogy I can find for small gobbets of content that require organization.
Industries have proprietary taxonomies to identify certain concepts — culinary arts, music, agriculture, engineering, the sciences, literature and criticism, education, and on and on and on.
But these do not necessarily identify concepts within a book.
Some might argue that we don’t necessarily need taxonomies — why can’t we use natural-language search and the semantic Web to “bubble up” the “right” concepts? I’d argue that words don’t always mean what we think they mean. A classic example from my library days is the term “mercury.” That could mean the planet, the car or the element. Proponents of semantic search would say that the context in which “mercury” is mentioned should take care of defining that term. I’d say that’s true in about 50 percent of all cases but not definitively true enough in 75-100%.