• Print

Metadata, Not E-Books, Can Save Publishing…

Metadata is king. I will repeat this as it is important. Metadata is king.

I need not go through the barrage of articles and statistics that show that publishing is in a minor state of panic. Revenues are down and until recently (past 1-2 years) many publishers were unsure how they should play with e-books (many are still not completely settled in their e-strategy). The reason that e-books will not save publishing is that all they are is another format. E-books will not revolutionize reading, nor will they change the content. I’ve seen some social reading projects (copia) but they are in beta and I cannot make a prediction if readers are willing to accept a completely new reading experience.

Some statistics:
* There are roughly 230MM adults in the US.
* US Literacy rate is 99% – leaving 227.7MM adults in the US who can read.
* 28% of US adults are avid (5+ hours/week) readers [Verso] – 64MM Avid reader

* 20% of book purchasing happened online [PW 2007] This has grown, but is still under 30% (have heard via word-of-mouth from some book stats people but cannot quote them).

Why won’t e-books save publishing?
E-books represent a format, just like hardcovers and paperbacks. Because they are a different format, they require different pricing. Things that are consumed and priced differently do open themselves up to a new market but unless that new consumption method is revolutionary, the growth (new readers) to the market cannot be large. E-readers will never be purchased by non-readers in the hopes of becoming readers (until they reach an extremely cheap price-point). The iPad is one such device that can create new readers. Its conceivable that someone who would buy an iPad and is not a book buyer, but because they can do so while sitting in their La-Z-Boy, will buy a book. If they like that book, they may even buy another. Ok. Now re-read that last statement. “If they like that book, they may even buy another.” If they don’t like the book, their sentiment of “this is why I don’t buy books” will be solidified. Another non-book-buyer remains a non-book-buyer.

According to the Wikipedia bestselling books chart Dan Brown’s The Da Vinci Code sold over 80 Million copies. Harry Potter and the Deathly Hollows sold at least 44 Million copies. Does that mean that nearly every avid reader bought the last Harry Potter book? Does that mean that every avid reader bought 1.25 copies of Da Vinci Code? No. It means that people who normally don’t read books opened their wallets to buy and read a book. That means that 163 million non avid reader Americans are potential readers.

How to capture some of those 163 million and get avid readers to buy more.
Simple: Give them what they want and more of it. How do we do this? Metadata. It’s that simple. Tech people love metadata. We eat it up and beg for more and build amazing utilities around it. In fact, Pandora is an amazing example of what metadata can do for music. But, a limiting factor of Pandora is their selection and their metadata gathering techniques (they have to do it manually). How does metadata sell? Let me start with an anecdote.

The book Paradox of Choice talks about how people tend to shut down when shown too many options. If you’re a seasoned book buyer, when you walk into a bookstore (or are browsing Amazon or another online retailer) you know exactly where to go for bargain-bin books, where your favorite genre is located, and where the new releases section. If you’re new to reading, a bookstore is extremely intimidating. Don’t believe me? Go wander into an electronics expo, the car audio section of a Best Buy, or some sporting goods store (assuming you’re not a tech-geek, car tinkerer, or sportsperson). You’ll soon see that there are 14 different types of cables or gloves and all at different prices. How do you make your decision? Thankfully in those stores there are sales people who are trained to spot people like this and offer their help. Brick and mortar stores offer an information booth at best. Online you’re left to your own devices…

Imagine you just finished reading a book. We’ll take the Da Vinci Code. After putting it down you filled out a short survey which asked you what qualities you liked out of the book (lets call them tags). For me, I liked that it was a suspense novel, that it was a religious mystery, and that it took place in present day. Now, assume that this tag data was available for all books and that I could walk into a bookstore, hand them my little survey and they could show me 6 books. That would be much easier to chose from. In fact, if you showed me only 3 books, I may even buy all three. In the current environment, the best I could do was to buy more of Dan Brown’s books (Author is the #1 reason why people buy books) and hope that he’s written more than 1 book, or attempt to use the recommendation engines provided by online retailers. Recommendation engines are OK, but they are based on purchasing habits or in rare cases “those who liked also liked” which is fairly arbitrary and not nearly as good a predictor as metadata.

That is giving a user more of what they want. They read a book, extract from it what they liked and you give them books with similar qualities. Next is giving them what they want in the first place.

Giving a user what they want.
The best metadata we have in mass is category data. This data isn’t exactly easy to wade through, but if you like romance, you can click on the “romance” category and see a list of books that are considered romance. For new readers, the amount of books within the romance category is daunting, plus what is “paranormal” and how do I know if I’ll like it? Categories are also boxes that have connotations. Yes, books can live in multiple genres, but can a book have a vampire in it and not be a book about vampires? Can a book be a love story with an intimate scene without being romance? Tags help narrow down specific traits of a book. Some tags are already gathered but below are a list of tags I feel are important to gather:

  • Page numbers (or word count). Its important for readers to know if this is a short read or long one.
  • Time Period. (1990s, 1870s, future). Some people love historical fiction. Some people hate a specific time period.
  • Categories. A category is a specific tag. But they don’t live in hierarchies. A book can have the category tag “romance” and “vampires” independently. For non-fiction themes make great categories for example: “war” “history” “1900s” “war of 1912″
  • Writing Style. Is this a 3-act play? Is this done in 3rd person or 1st person? Is this dialog heavy?
  • Series. Is this part of a series? What number in the series? Is this an ordered series or just a collection of fiction built around a specific world?

I could go on and on with more data, but these are what I believe is the core. If every book had this data, you could essentially have an eharmony for books. You fill out a small profile of your likes and dislikes and now are shown a much smaller set of books to chose from. The best part of these selections is that there is a very good chance you’ll like them. If you like a book, you’re more likely to buy more books turning new readers into avid readers and avid readers into, well, hyper-avid readers.

To bring it all together, if you want to grow the market, you must do things better and in a new way. E-books aren’t a new way to sell, just a new format to sell the same old books you’ve been selling for years. Make readers happy by providing them with the books they want.

nick-rufillo.jpgAbout Nick Ruffilo: In 1998, Nick Ruffilo help to found CheatWorld.com and online video game cheats website that utilized early forms of metadata for recommendations. Afterward, he worked in helped to defining metadata standards for the financial industry from 2001-2008.

In 2008, he joined BookSwim and has worked to aggregate multiple sources of data as well as gather internal data to redefine the internal recommendation engine of BookSwim.

Comments: 16

  1. The problem with relying on metadata for capturing “qualities” data is that the quality of the metadata itself is often suspect. Big issue: Within whose conceptual or aesthetic framework are the tags created and applied? That is, the usefulness of the tag framework will become as much the issue as the content itself. How to avoid arbitrariness in a tag framework? Which tag/semantic hierarchy wins? Who/what can help us out here?

  2. Paul,

    That is an extremely valid point. Once you can get publishers/authors on board with the concept of metadata, then you propose standards. In many cases, you can have very concrete tag sets. If I were on the panel creating these standards, I would propose 3 levels of tagging. The first level would be required and mainly made up of data that already exists (Author Name, Title, Subtitle, High Level Categories), A 2nd tier which I would consider required as well which would encompass some of the newer tags I’ve described, then a 3rd tier which would be optional for publishers, but create some framework for developers to have consistency.

  3. Great article. I personally don’t see what the issue is with not having better metadata for ‘e’ publications or for that matter any other publications (PDF etc.) At the end of the day, it is about a publisher being able to sell their books more effectively, it’s a bit like profiling ‘your’ target audience against their requirements, as you state in your example.

    It’s also about the education of what is metadata and how to create it & use it properly. For example, how many people actually bother to fill in the properties of a Microsoft Word Document that they are creating and associate good metadata and Key Words to it? It can be a real ‘boon’ when transferring that content into other formats or using the key words in a search.

    For me, metadata is as important as the content, particularly in the XML editorial & workflow world that I work in. How could editorial staff find a lot of what they need, if it were not for good metadata – it still surprises me that many editorial environments have not bothered to set this aspect of their systems up correctly. With the right tools to create metadata (sometimes automatically) and appropriate set-up, metadata can very much be ‘king’! Well, nearly as much as content is already!

    In a recent project for an educational publisher, I would say that there was as much discussion about metadata as there was about content, much of would be created (the metadata that is) at the same time as the authoring of the actual content. The metadata would then be ‘carried’ through production and attached to final e-publications (year, ruler, events, country etc. – history publication examples). As per your example, if a buyer/user of educational content wanted to find out what e-publications covered certain years, cross referenced by ruler & events, having good metadata is the way to get speedier and more targeted results and hopefully it is ‘your’ content that gets purchased first.

    Metadata ‘will’ become much more important as people begin to fully realise the real value it has, for those that have already understood what it can do, they have the real advantage at the moment.

  4. nick said:
    > Simple: Give them what they want and more of it.


    > How do we do this? Metadata.


    “metadata” — at least as it is commonly understood
    within the publishing industry — will do very little
    to direct people to the content that they will enjoy…

    simplistic stuff like author and time period predict
    almost nothing, and more complex concepts such as
    categories and tags will just lead you into a morass
    of the dimensional complexities of the human mind.

    the correct answer is _collaborative_filtering_…

    you use _pandora_ as an example, but they do not
    use metadata (in the sense that you use the word),
    but rather a form of collaborative filtering…

    likewise, amazon uses a form of collaborative filtering.

    neither pandora nor amazon does things correctly,
    but at least they’re working in the correct direction.

    nexflix, especially with their big competition, is the
    big dog out there who is doing _most_ things right,
    but even they don’t have it completely right quite yet.

    but once we _do_ get the system working correctly,
    collaborative filtering will prove to be the killer app
    that sweeps away all the chaff and gives us _wheat_.

    and my dad was a wheat-farmer, so i know what i’m
    talking about when i talk about wheat, believe me…


    p.s. i’m afraid of facebook. their “like”‘ buttons look
    — to me — to be a stab at collaborative filtering, and
    they can collect the massive kind of database that will
    serve to overcome any flaws in their methodology with
    the sheer power of big numbers. they scare me because
    i don’t trust them to use this data _in_ our interest, but
    rather _against_ it, to our own detriment, in favor of the
    big corporations that will be funneling money to them…

  5. 1) You seem to be confusing publishing and bookselling.
    2) The game for everybody is expanding the universe of readers.
    3) Ebooks appear to be doing that. http://www.thomashager.net

  6. I would like to address Thomas and Bowerbirds comments.

    Thomas – I can easily see how you could think I was confusing publishing and bookselling, but that is more because I feel that the line that separates the two needs to blurred. On top of that, A publisher needs to worry about how their books will be sold and cater to the needs of the book sellers. The book sellers need to in return provide feedback to the publishers as to what consumers want.

    Bowerbird – There is much that you have said and some of it is correct although I disagree with other points. To address some of your big points:

    1) What you call collaborative filtering is what I’m calling metadata. With Pandora, they crowdsource musicians to explain things about the music such as the tempo, genre, and instruments used. As long as the data is objective and well spec’ed out, then it will not fall trap to the “morass of the dimensional complexities of the human mind.”

    There are steps that need to be taken to get us from where we are today and where I think we can be in the future, but I only had a blog post to explain them, not a book or long presentation.

    I hope this helps to explain.


  7. Da Vinci Code and Harry Potter “broke out” from devoted readers because everybody was talking about them. Metadata wouldn’t have helped at all.

    Consider the metadata for Harry Potter — children’s book; wizards; English public school. These are not the sort of qualities that provoke people to read.

    To publish a break out book, you need some angle that will get people talking about it dinner parties, at work, in the pub. The Da Vinci Code did an amazing job of this. Not a great book, but gave people the ability to have controversial conversations — and then recommend the book as the source.

  8. David,

    I don’t believe data can cause a book to break out. What data can do is help those who only read break-out books to discover similar books and go from 1-book-a-year readers to 5-book-a-year readers.

    Data helps a book be discovered – helps it bubble to the top (for the most relevant readers) amongst hundreds of thousands of other books.


  9. nick said:
    > What you call collaborative filtering
    > is what I’m calling metadata.

    then i suggest you take a good look at a dictionary.

    only by some huge stretch can these two things
    be considered “the same thing”, and there isn’t
    a bridge that can be built big enough to do that.

    metadata, as you used it everywhere you mentioned it,
    is a description of the content on objective dimensions.

    collaborative filtering, as i define it, is dependent on
    a _subjective_ rating of how much each person liked
    a particular bit of content, be it book, film, song, etc.

    the precise reason i said that pandora and amazon get
    it wrong is because they stray from my pure definition…
    pandora tries to categorize musical style, and amazon
    using a common _buying_ pattern to project “similarity”.
    (i guess they assume you like every book that you buy.)

    > As long as the data is objective and well spec’ed out,
    > then it will not fall trap to the “morass of the
    > dimensional complexities of the human mind.”

    you are wrong. you are totally and completely wrong.

    it’s folly to believe that you can specify the _objective_
    dimensions that will predict what people will _enjoy_,
    because there is a multitude of dimensions on which
    people make the _subjective_ decision of _enjoyment_.

    and that’s why “metadata” is doomed, from the outset.

    it is only when we give up the notion of an “objective”
    proxy variable to predict _enjoyment_, and focus purely
    on the subjective variable alone as our _sole_predictor_
    (via the power of numbers that social networks provide)
    that we will get anywhere with collaborative filtering…


  10. Bowerbird,

    Let me address some of your comments. I’m not saying that the actual definition of collaborative filtering and metadata are the same, what I was saying is that your use of collaborative filtering is actually just crowd-sourced metadata gathering (specifically in the case of Pandora).

    Pandora makes recommendations to users based off ratings, which is collaborative filtering, but when I was talking about pandora I was explaining how the data collection for the metadata works (how they have tons of artists who sit there and select the instruments and qualities of the song, not the recommendation engine).

    Judging by your comments it seems like you believe that “collaborative filtering” is the way to give people exactly what they want and that metadata is wrong. I believe that both can exist and ultimately should exist. Good metadata can help to feed a collaborative filtering application (Pandora is a perfect example of both metadata and filtering working together). As well, I don’t believe that there is any perfect prediction engine that will know what a reader will want, but, if you can boil things down to objective data elements, then based off of user’s buying habits you can tweak the algorithm that weighs different tag sets.

  11. In the midst of the maelstrom that is the publishing world at this moment in time – it is good to see that someone is thinking.
    I agree fully that this tool, metadata, is a very important tool. It has been used effectively in many implementations of music selling and I see no reason why it is not a key to promoting wider reading among book readers. Clearly it is a tool for empowering readers with the ability to find additional reading to suit their taste.
    However I don’t completely buy in to the totalitarian interpretation of it’s use. It is one tool and has the potential to be used far more than it is being used at present.

    I don’t agree at all with your statement “E-books will not revolutionise reading, nor will they change the content. I’ve seen some social reading projects (copia) but they are in beta and I cannot make a prediction if readers are willing to accept a completely new reading experience.”

    eBooks are definitely revolutionising reading and reading habits. People all over the world are, right now, reading more eBooks on their devices than they ever did in pBooks. It is exposing and promoting reading in a completely new way – making it convenient and cool at the same time.

    I also believe this project Copia is a portent of the future. The bookstore is on the way out (slowly … but surely) and the MOST important issue facing readers of the future will be finding good reading.

    Your metadata will play a large roll in this I am convinced. But readers will also be desparate to find good reading and discuss it and share advice and tips and recommendations. I believe that projects such as this, maybe not exactly this, will play a HUGE role in parallel with metadata.

    One thing that I believe is being overlooked in this changing time is ‘added value’. Publishing needs to really take a close look at this as a selling tool because imho readers will be like the customers of almost any other product. They will always be susceptible to the magnetic effect of ‘added value’. There are many very low cost possibilities that can be easily incorporated with eBook sales and I hope and expect forward looking publishers to utilise them if they expect to survive and distinguish themselves from others.

  12. nick said:
    > Judging by your comments it seems like
    > you believe that “collaborative filtering” is
    > the way to give people exactly what they want
    > and that metadata is wrong.

    i’ve just come back from a week at the national poetry slam,
    so pardon me if i’m too much attuned to what words mean…

    but i’m very leery of continuing a discussion when i observe
    that the other side has summarized my position very badly.

    and that’s what you have done here, nick.

    yes, i believe collaborative filtering is a way to “give people
    what they want”, but i never said “_exactly_ what they want”.
    because that would be a pretty ridiculous thing to say, not?
    yet you have put those words in my mouth. not cool, nick…

    even worse, though, was that i never ever said “metadata is
    _wrong_”. indeed, i’m not even sure what that would mean.

    i said _you_ were wrong, when you said “metadata is king”,
    (which you repeated). but i never said “metadata is wrong”.

    what i _will_ say is that “metadata will be largely ineffective
    in the task of pointing people to content which we’ll enjoy.”

    put 10-20 metadata factors in multidimensional space and
    they will predict _some_ of the variance, but after the first
    half-dozen (picked at random), the rest won’t be significant.

    and nothing you can add after that will be significant either;
    it’ll only make your modeling more complex and expensive.

    collaborative filtering, on the other hand, will get _better_ as
    it scales up, with increasing data giving increased accuracy,
    and more content giving more recommendation possibilities.

    but hey, if you can get some of the dinosaur publishers to
    waste some more of their money doing metadata, the faster
    they will go extinct, so best of luck to you in that endeavor.


  13. I think meta data might be an excellent project to open to the public the way they have opened classification of galaxies… other stuff like that ..

  14. Re another dimension of ebooks: As a librarian who believes in “pBooks,” I have been encouraged by my principal to weed 50% of our collection to make room for a library more in keeping with modern technology and the direction libraries are beginning to take: collaboration. She is convinced that ebooks will replace a good number of pBooks simply because of space considerations. This also includes home libraries. She personally “gets rid” of her print books as soon as she reads them, keeping only relevant professional books. She plans to go the way of ebooks as much as possible.

    I know your blog addresses the issue of non-readers or minimal readers and metadata–or how to open the world of books to this targeted group. It’s a grand idea, which could also be applied to books, e or p, found in libraries, themselves part of this new publishing problem. What to put on shelves or in elibraries?

    When I worked part-time in a local branch library, I discovered that people go to the library for many other reasons besides borrowing books: computer use, use of meeting rooms, genealogical research, access to a wide variety of periodical and newspaper publications, job information, research, video borrowing, a quiet place to read and/or study are just a few.

    My point is that libraries must also deal with the issue of nonreaders. If they do come in for one of the other reasons I cited, what can a library do to entice readership? A display of books using metadata! For example, “If you liked The daVinci Code, you might like “blah-blah” and place these books side by side. Or have flyers displayed with various choices. Or in my case of a small school library which is still full of daunting choices for a child, display bookmarks with this type of metadata on it.

    We’re all in this together. Thank you for your information. It stirred some meta ideas!

  15. I apologise in advance if my post seems only loosely connected to your topic.

    Firstly I agree with all of your arguments regarding the value of metadata as a tool for overcoming the paradox of choice felt by inexperienced reader when confronted with the digital marketplace.

    Secondly, I’d like to offer a sample implementation. I am aware of a similar project in a related field, that has a very solid implementation of the princinples you seem to be espousing.

    http://www.gamerdna.com is a social network/crowdsourced metadata harvester that is built around the idea of allowing a user to create for themselves a list and tag cloud for every computer game they’ve ever played. It has successfully created a crowdsourced metadata schema for almost every computer game produced in the last 30 years.

    A similar implementation that allowed the user to painlessly compile a list of books that they had previously read, and then compare their booklist and tag cloud with other readers would do a good job of producing a robust ontology for a large sub-set of the corpus of modern literature.

    The main advantage of gamerdna’s approach is through the use of feedback during the tagging process. This is designed to encourage users to select from tags and game names that have been previously entered into the system by other users, but not outright prevent them from adding a new game or more tags if that is what they need to do.

    The interface and implementation is very slick, I’d encourage you to sign up and take a look at it as an example of your principles in action.

    Especially since the project is designed in a way that would be ideal for borrowing into the world of readers and publishers.

  16. Thank you for your article. I do agree that metadata has its role in providing the means through which people can find books that they like to read. But I’m not convinced that metadata itsself is really a stand alone solution. Marketing books well is the key to selling book. It’s interesting that the two books you brought up in your article also had movies associated with them (Harry Potter Series, and Davinci Code.) I believe that the movie industry provided strong marketing for these books.
    Also, I find Amazon to be a great way to find books (see others recommendations, find similar books, etc.. But Amazon isn’t a publisher, it’s a marketplace. I think that in the end online stores will find ways to market ebooks in a successful way, and you may be surprised at what works in the end. I don’t think organized metadata in and of itself will convince people to buy books. Marketing is more complicated than that.