• Print

"Bite-Size Edits" from BookOven

Hugh McGuire‘s startup BookOven has opened up an alpha version of a project they’re calling the Gutenberg Rally, an attempt to harness collective intelligence Mechanical-Turk style to proofread Project Gutenberg texts for typos and OCR (Optical Character Recognition) errors. In “divide and conquer” style, the system presents just one small snippet of text at a time (with some surrounding context), effectively breaking down a mountain of a task into easily managed molehills:

BookOven Gutenberg

I had a nice chat with Hugh on Wednesday morning, and what he told me about what’s to come from BookOven was quite exciting (though apparently still very much in development).

This isn’t the first attempt to harness eyeballs for finding and fixing OCR errors (see ReCaptcha), but reviewing the text in context is a much more satisfying experience, and left me wanting to read more of several of the books I was seeing only in snippet form.

tags: , ,
  • bowerbird

    i’ve spent years now doing research on
    improving project gutenberg e-texts…

    and, technically, this ain’t how to do it.

    plus, it’s not merely a technical problem…
    there are logistical and political issues too.

    i suppose if this is some “startup” trying to
    bring “web 2.0″ to the p.g. library in hopes
    of attracting money out of the sky, there’s
    little hope of deflecting it, and we will just
    have to let it twitter itself into the oblivion
    that besets efforts that can’t maintain the
    interest of volunteers over the long-term…

    but it sure would be nice if people spent a
    little bit of time doing some research first,
    at least enough to see what has been done.

    -bowerbird

  • Charlie Meigh

    @bowerbird Any chance of some links to your research? I’m interested to read more.

  • bowerbird

    what’s especially puzzling here is that hugh mcguire
    should know — from his experience with librivox –
    exactly why this methodology will not work correctly.

    (for those of you who don’t have hugh’s experience
    – not that it seems to have done him any good –
    the “typos” in p.g. e-texts are fairly easy to locate,
    since an ordinary spell-check will pin-point them.
    the difficulty comes when questions involve a word
    that needs to be evaluated against the book itself,
    or a scan of the page, because the context is either
    insufficient or unclear… more extreme difficulties
    arise about judgments on if the p-book contained
    a publisher/author error, typographic or otherwise.)

    and again, that is just on the _technical_ matters…

    -bowerbird

  • bowerbird

    charlie said:
    > some links to your research?

    sure thing, charlie. just google this:
    > bookpeople bowerbird “comparison methodology”
    and you will get some things, which point to other things.

    in essence, the best success boils down to this:
    1. find the differences in two separate digitizations, and
    2. resolve the discrepancies by referring to the scan-set.

    o.c.r. ain’t perfect; however, aggressive post-o.c.r. cleaning
    can take it very close to a level that’s surprisingly accurate.
    but sadly, the big players aren’t doing that clean-up. yet…
    (google does it in their labs, but won’t give us clean text.)
    i’ve “encouraged” o.c.a. to do it for years, and will persist.

    especially since so many books enjoy multiple digitizations,
    comparison methodology is eminently workable nowadays,
    so word-by-word proofing is grossly inefficient today, but
    distributed proofreaders is too set in their ways to change.

    -bowerbird

  • http://www.bookglutton.com Travis Alber

    Crowdsourcing these small chunks of text make the task of reviewing PG’s library much more approachable.

    Moreover, wrapping the text inside small snippets of context, as well as allowing the user to skip an entry if she’s uncertain, are both nice interface decisions. I think this is great way to approach the problem!

  • bowerbird

    travis said:
    > I think this is great way to approach the problem!

    um, no.

    and i’d expect that you too, travis, should be familiar enough
    with project gutenberg e-texts to know why this won’t work…

    but again, for those of you who do not have such familiarity,
    project gutenberg e-texts are relatively accurate, in general.

    framed yet another way, almost all of the words are correct.
    so the vast majority of these “bite-size” edits will be correct.
    and time spent verifying correct text is time that is “wasted”.
    what we really want to do is _hone_in_ on suspected errors…

    so i’m not saying the “rally” won’t find any errors. but they
    will expend far too much human effort to find any they do;
    that cost/benefit ratio won’t maintain long-term volunteers.

    moreover, many of the errors that _do_ occur in p.g. e-texts
    simply cannot be ascertained with sufficient certainty without
    reference to the page (scan) itself, so this effort is _doomed_.
    (i could give you tons of examples, but anyone who did any
    research on these issues will find ‘em without even looking.)

    what is actually needed is a project that links a p.g. e-text
    to an extant scan-set of the same edition of that book and
    locates suspicious text for people to check against the scan.
    there are two things needed here. one is the routines that
    highlight possible errors. i’ve done a lot of work on these.
    the second is the gruntwork of matching p.g. e-texts with
    extant scan-sets. we’re just now coming to a point where
    we can _expect_ that an “average” p.g. e-text will be found
    in the scan-sets that have been done by google or the o.c.a.,
    which is why i haven’t built such a system before this time…

    after that, we’ll need a companion project that encourages
    people to actually report the errors they find in p.g. e-texts.

    some errors will _only_ be found by interested humans who
    have the full context obtained by reading through the book.

    the problem is _not_ that there haven’t been enough eyeballs
    on these e-texts. the problem is so few people report errors.

    this gets into the fact that p.g. is lousy about fixing errors –
    they readily admit they are _years_ behind on some reports,
    and they are downright hostile to suggestions for speed-up
    – which takes us to the “logistical and political” arenas that
    i mentioned above. a big part of the reason why people are
    largely unmotivated to submit error-reports is because they
    seem to garner no attention, so people wonder “why bother?”

    so we need to create a friendly reading environment which
    also encourages the person to report an error that they find,
    after verifying from the scan that it is indeed an actual error.

    and yes, i’ve been raising these issues since 2003 over at p.g.
    (charlie, i should’ve pointed you to the p.g. listserve archives.)

    aside from founder michael hart, no one wants to hear them…

    -bowerbird

  • http://www.i2s-bookscanner.com Alain Pierrot

    Interesting conference from european project IMPACT in The Hague April 6th-7th:

    OCR in Mass Digitisation
    Challenges between Full Text, Imaging and Language

    http://www.impact-project.eu/news/ic2009/

    “Tuesday 7 April 2009: New advances in OCR technology, such as collaborative correction and adaptive OCR techniques, a possible way forward for future large-scale digitisation programmes.”

  • bowerbird

    alain said:
    > IMPACT
    > New advances in OCR technology, such as
    > collaborative correction and adaptive OCR techniques

    let’s hope they’re smart enough to avoid adopting
    all of the bad habits from distributed proofreaders,
    which has developed a tremendously awful workflow.

    -bowerbird

  • http://blog.bookoven.com Hugh McGuire

    Andrew, thanks for the kind article… Now, let me see if I can address some of bowerbird’s concerns, which I totally understand, and for the most part agree with.

    Our longer-term objective for Book Oven is to build tools for makers of books — writers, editors, designers — and a space where they can come together to make more books. So our prime concern is building tools for new works, rather than being an OCR-improvement service. Still, if we can serve more than one useful purpose, so much the better.

    As we were sketching out Book Oven we were thinking about some of the pain points in getting a manuscript ready for publication. One of them is proofreading, so we started thinking about how we might make that process less painful, while harnessing the kind of latent cognitive energy that flows in great waves around the Internet at any moment, getting sucked up by Youtube and the like. What if we could find a way to make proofreading fun, without requiring the kind of engagement that usually comes with proofing?

    And so was born the idea for Bite-Size Edits – a way to turn proofing into a kind of game, something that would be enjoyable, fun, easy. That was our development goal, and by the feedback we’ve gotten so far, I think we’ve succeed.

    Now, as you mention, I have an intimate knowledge of Gutenberg texts from LibriVox. At LibriVox we make audio versions of those texts, so we pay great attention to every word; indeed some (not all) of our audio gets “prooflistened” against the text, and so we are perhaps more likely than most to see the errors. But it’s never been a clear or easy process to report those errors to Gutenberg; and once they’re reported, Gutenberg has trouble processing them. I have great sympathy with that particular challenge. At LibriVox we face the same problem: making an audio version of a text is difficult and time-consuming, but overall the people involved in LibriVox love doing it. Getting error reports, figuring out if they are legitimate errors, and then downloading the audio, fixing it, reuploading, is a difficult, thankless, and frankly uninspiring process. And we, like Gutenberg (and like Distributed Proofreaders) are run by volunteers. So all that gets done if and when volunteers decide to do it, which is much less exciting than finishing another book. Gutenberg will have, I am sure, similar problems. I’m deeply sympathetic.

    Still, Michael Hart’s December 2008 newsletter referenced a LibriVox reader, who had found 23 errors in a text, and sent them in to Gutenberg. So whether or not there are better ways to fix problems in Gutenberg texts, one way is to have people find the errors and submit them. (Getting the etexts corrected is another problem altogether, and I wish Gutenberg luck & courage in figuring out how to do that in a sensible, and efficient way).

    This brings us back to Bite-Size Edits. Michael’s letter trigged the idea for Gutenberg Rally. We had long planned to test Bite-Size Edits with public domain texts, but we thought we might be able to serve a more specific purpose: trying to perfect Gutenberg texts. So we decided on the Gutenberg Rally.

    I know of Distributed Proofreaders, but I have never participated in the project. But I have spoken with numerous DPers, and had a long chat with Juliet about what we had in mind. And of course the challenges you highlight are indeed something we can’t (right now anyway) address: that is, the DP process needs to compare the OCR text with the original scans to be sure to get the right outcome. So we can’t help with most of the DP process. Where we can help is Smooth Reading, the last (optional) stage of DP’s process. This is the stage where volunteers are asked to read through the text without reference to the scans, and point out any problems they find. The texts we’re editing in the Rally are mostly from that pool. Our hope is that we can continue to help DP in whatever way we can, by putting Smooth Reading texts through Bite-Size Edits, and allowing the DP project managers to figure out if the errors we report need fixing or not.

    We are not trying to change the DP process, or replace the usual Smooth Reading methods, but provide a new and different way to check those texts, which can be additional to everything the good people at DP are already doing.

    So in short: Bite-Size Edits may not be the answer to all the problems of proofing scanned texts, but we do hope that can help improve the quality of free public domain etexts. We think we’ve built a new tool to address certain problems of proofreading, by decontextualizing text so that you’re more likely to spot problems, and finally, and most importantly, by making it fun and engaging.

    But given all that, if you have any ideas of how Bite-Size Edits could be more useful for Gutenberg, I’d love to discuss them: hugh@bookoven.com

    [Andrew, sorry for taking up so much space!]

  • bowerbird

    first, hugh, thanks ever so much for joining the conversation,
    and not just jumping to a too-easy response of “i’m insulted”.
    i greatly respect pioneers brave enough to engage in dialog…

    ***

    > our prime concern is building tools for new works,
    > rather than being an OCR-improvement service.

    yes, the name of your site — “bookoven” — makes that clear,
    so i knew this was just an “extension” of your main purpose…

    (indeed, i thought you had said it specifically at some place,
    or at least implied it, but perhaps i just inferred it, i dunno.)

    > pain points
    > proofreading
    > harnessing the kind of latent cognitive energy
    > that flows in great waves around the Internet

    also very clear.

    and in keeping with your tremendous success with librivox.

    > What if we could find a way to make proofreading fun,
    > without requiring the kind of engagement that usually
    > comes with proofing?

    i have made this suggestion myself, on the p.g. listserve.

    > And so was born the idea for Bite-Size Edits –
    > a way to turn proofing into a kind of game,
    > something that would be enjoyable, fun, easy.
    > That was our development goal, and by the
    > feedback we’ve gotten so far, I think we’ve succeed.

    um… ok… but i’d think it’s still much too soon to tell,
    and the people who don’t agree probably won’t say so.

    from my experience, any joy in doing proofreading is
    finding and fixing errors… and the fact remains that
    most of your volunteers will do that only very rarely…

    but if you can maintain volunteers long-term, i will be
    more than happy to tell you i am wrong wrong wrong.

    (but we’re all just eating the dust of genius luis von ahn.
    and even _he_ couldn’t make proofreading all that fun;
    recaptcha is about as much “fun” as pulling out a tooth.
    the best even luis can say in its defense is that “at least
    the time that’s being wasted is going to a good cause.”
    which — let’s be honest here — is a far cry from “fun”.)

    > But it’s never been a clear or easy process to
    > report those errors to Gutenberg; and once they’re
    > reported, Gutenberg has trouble processing them.

    lord knows i know _exactly_ what you’re talking about…

    i’ve been trying to get them to wake up for over 5 years.

    > And we, like Gutenberg (and like Distributed Proofreaders)
    > are run by volunteers. So all that gets done if and when
    > volunteers decide to do it, which is much less exciting
    > than finishing another book.

    well, actually, the p.g. difficulties with error-corrections
    are _not_ that they’re “run by volunteers”. the problem
    is that there is an inside circle — called “whitewashers” –
    who have decided that only they can act on error-reports.
    and they’ve then given that task an extremely low priority.

    if p.g. really opened up the error-correction process to its
    volunteers, it’d get done _a_lot_ more quickly than now…

    (and that’s what michael was trying to do with his “call”…
    but even michael can’t wrestle it from the whitewashers.)

    > I’m deeply sympathetic.

    i was too, for a year or two. after experiencing hostility,
    i came to realize that they really don’t deserve sympathy.
    it’s a huge problem they have brought upon themselves…

    > whether or not there are better ways to fix problems
    > in Gutenberg texts, one way is to have people
    > find the errors and submit them.

    error-reports are _great_. so to the extent that you will
    generate them, i can be very supportive of your efforts…

    > We had long planned to test Bite-Size Edits
    > with public domain texts, but we thought we
    > might be able to serve a more specific purpose:
    > trying to perfect Gutenberg texts.

    i understand the appeal, i really do! it just won’t work…
    you won’t do much _harm_, and will likely do some good,
    don’t get me wrong. but the amount of human time and
    energy that end up being spent will not be cost-effective.
    moreover, you will only catch a minority of the errors…
    (which means yet _another_ effort to get all of the rest.)

    > I know of Distributed Proofreaders, but
    > I have never participated in the project.

    let me be perfectly clear that i am not recommending d.p.

    they have an absolutely awful workflow, precisely because
    they too provide a horrid return on the investment of time
    and energy that is _donated_ to them by their volunteers,
    and i have documented the problems with their workflow…

    i could rip apart d.p., but this has already grown too long.

    but i will address this one thing:
    > Where we can help is Smooth Reading, the last
    > (optional) stage of DP’s process. This is the stage
    > where volunteers are asked to read through the text
    > without reference to the scans, and point out
    > any problems they find.

    indeed, smooth-reading is somewhat similar to your rally.

    the main difference is one i mentioned above, where some
    errors can best be found only by a human reader who has
    interested knowledge of the whole book. smooth-readers
    have that, in general, whereas your rally volunteers do not.

    on the other hand, neither smooth-reading nor your rally
    has access to the scans. and that’s a problem with _both_.
    it’s a problem that smooth-readers shouldn’t have to have,
    since d.p. has the scans right there, so it’s just another flaw
    in their procedures. but with your rally, the flaw is built-in.

    one of my points is that scans should _always_ be available.
    simply put, you can’t do error-checking without the scans…
    all you do is raise flags about things that might be wrong,
    and then somebody has to consult a scan to see if it _is_…

    one of the things that the p.g. whitewashers use as their
    “defense” in assigning error-reports a low priority is that
    half the “errors” that are reported are _not_ in fact errors.

    if you don’t have the scan in front of you, telling you that
    “no, indeed, that _is_ what the p-book actually printed, so
    it is _not_ a transcription glitch or an o.c.r. bug”, then you
    are going to make a lot of “false” error-reports, which are
    a waste of everyone’s time. (and even if i wouldn’t call this
    “doing any harm”, it’s obviously not a _good_ thing either.
    especially when the whitewashers whose time gets wasted
    are the very people who were “too busy” in the first place.)

    on the other hand, if you _do_ have the scan in front of you
    that informs you that “yes indeed, the p-book has it printed
    differently than the transcription”, you _know_ it is an error.

    > by putting Smooth Reading texts through Bite-Size Edits,
    > and allowing the DP project managers to figure out
    > if the errors we report need fixing or not.

    well, as i always say, the proof is in the pudding.

    so if this process ends up working, more power to you, hugh.

    but my guess is that your unsupported-by-viewing-the-scan
    “error reports” will be wrong nearly as often as they’re right,
    and the d.p. post-processors might find that the time that is
    wasted in checking the false-reports is a considerable drag…

    of course, i think your rally volunteers will peter out _first_…

    > if you have any ideas of how Bite-Size Edits
    > could be more useful for Gutenberg

    i laid it all out above, hugh.

    the best way to correct errors in a p.g. e-text
    – or any digitized book, for that matter — is to:
    1. find an independently-done digitization, and
    2. compare the two to find where they differ, and
    3. resolve differences by referring to the page (scan).

    this saves scads of time, because you simply assume that
    the places where the digitizations _agree_ are _correct_…
    and yes, theoretically, it could be the case that they just
    have the same error in common, but in practice, according
    to extensive research that i’ve done, that’s extremely rare.

    so this “comparison methodology” is actually quite good.

    the p.g. listserve archives (gutvol-d) have many examples
    of cases where i document research where i have used it…

    the most recent book i’ve done was “the jungle” by sinclair,
    where i found and fixed over 1,000 errors in the p.g. e-text.

    i will post links to some of the books that i’ve mounted,
    which show my “continuous proofing” interface, where
    i’ve applied that term to the proofing done by the public
    after an e-text has been released upon attaining accuracy
    projected at a rate of less-than-one-error-every-10-pages.

    i’ll post those links in a separate message, since links will
    often cause a post to be delayed for human-spam-checks.

    but yeah, that’s what i suggest. find a scanset over at o.c.a.,
    where you can grab the images and a copy of the o.c.r., and
    then do _lots_ of aggressive clean-up on the o.c.r., and then
    compare the p.g. e-text with the cleaned-up o.c.r. from o.c.a.,
    resolving the differences by making reference to page-scans.
    at the end of that process, you’ll have a _very_ clean product…

    -bowerbird

  • EditorJack

    bowerbird wrote:

    “project gutenberg e-texts are relatively accurate, in general.”

    “Relatively accurate” means they’re *inaccurate.*

    Years ago I gave Michael Hart some grief for not caring about the “fiddly bits,” as he called such items as misplaced punctuation, typographical errors, and so on. Well, it’s those fiddly bits that make a text reliable and useable. For example, one of the PG texts for Hamlet used to have this line in the graveyard scene where Hamlet is holding up Yorick’s skull: “My gorge rimes at it.” That, of course, is meaningless; it should read “My gorge *rises* at it.” I have since seen the line “corrected” in another online version to “My gorge *rims* at it,” which isn’t meaningless but has now been rendered into nonsense *because* of the existence of the typo in the PG text.

    That’s the problem with inaccuracies in online texts–errors are perpetuated ad infinitum as they’re copied from one site to another, used in CD collections, typeset for printed books, and so on. So it’s *important* to get the texts *right* to begin with. “Relatively accurate” doesn’t cut it.

    I agree with bowerbird that the solution to fixing existing texts is to electronically compare several different versions and then resolve differences by referring to a professionaly published edition (or a scan of one). But really, this should be done *before* the text is ever posted online. Posting (for general consumption) a crappy version of a book with the hope of cleaning it up “someday” is bad, bad practice.

    Best wishes,
    Jack Lyon

  • bowerbird

    jack-

    i feel your pain. deeply.

    i walk a very fine line between defending project gutenberg
    from those people who would like to see it _destroyed_ and
    use the errors in p.g. e-texts as a “good excuse” to kill them,
    and criticizing p.g. in-house because it’s doing such a bad job.

    i don’t blame michael hart, though. he started at the right spot,
    believing the future would deliver people to perfect his efforts…

    the problem is, the people who came along didn’t do that job,
    nor did they put into place the mechanism that _would_ do it…

    over 5 years ago, in december of 2003, when project gutenberg
    reached its original goal of 10,000 e-texts, i pleaded with the
    people in charge of the project to put new digitization on hold
    while they went back and brought the library up to a standard,
    but they didn’t listen. when they hit 15,000 e-texts, i pleaded
    once again, and again when they hit 20,000, and then 25,000.

    they’re approaching 30,000, and i’m going to plead once again.
    but i don’t expect to have any more success than i ever had…

    oh, and just to remind people of the original point, there is
    absolutely no sense in proofing _all_ of the sentences when
    only a small percentage of those sentences contain any errors.
    that’s why i had said that p.g. e-texts are _relatively_ accurate.
    it wasn’t to exonerate p.g. e-texts; it was a comment on the
    bad cost-effectiveness of an approach proofing all sentences.

    -bowerbird