"Bite-Size Edits" from BookOven
Hugh McGuire's startup BookOven has opened up an alpha version of a project they're calling the Gutenberg Rally, an attempt to harness collective intelligence Mechanical-Turk style to proofread Project Gutenberg texts for typos and OCR (Optical Character Recognition) errors. In "divide and conquer" style, the system presents just one small snippet of text at a time (with some surrounding context), effectively breaking down a mountain of a task into easily managed molehills:

I had a nice chat with Hugh on Wednesday morning, and what he told me about what's to come from BookOven was quite exciting (though apparently still very much in development).
This isn't the first attempt to harness eyeballs for finding and fixing OCR errors (see ReCaptcha), but reviewing the text in context is a much more satisfying experience, and left me wanting to read more of several of the books I was seeing only in snippet form.
- Stay Connected
-

TOC RSS Feeds
News Posts
Commentary Posts
Combined Feed
New to RSS?
Subscribe to the TOC newsletter. 
Follow TOC on Twitter. 
Join the TOC Facebook group. 
Join the TOC LinkedIn group. 
Get the TOC Headline Widget.
- Search
-
- TOC In-Depth
-
Impact of P2P and Free Distribution on Book Sales This report tests assumptions about free digital book distribution and P2P impact on sales. Learn more.
The StartWithXML report offers a pragmatic look at XML tools and publishing workflows. Learn more.
Dive into the skills and tools critical to the future of publishing. Learn more.
- TOC Community Topics
-



April 3, 2009 3:43 AM
i've spent years now doing research on
improving project gutenberg e-texts...
and, technically, this ain't how to do it.
plus, it's not merely a technical problem...
there are logistical and political issues too.
i suppose if this is some "startup" trying to
bring "web 2.0" to the p.g. library in hopes
of attracting money out of the sky, there's
little hope of deflecting it, and we will just
have to let it twitter itself into the oblivion
that besets efforts that can't maintain the
interest of volunteers over the long-term...
but it sure would be nice if people spent a
little bit of time doing some research first,
at least enough to see what has been done.
-bowerbird
April 3, 2009 4:00 AM
@bowerbird Any chance of some links to your research? I'm interested to read more.
April 3, 2009 4:10 AM
what's especially puzzling here is that hugh mcguire
should know -- from his experience with librivox --
exactly why this methodology will not work correctly.
(for those of you who don't have hugh's experience
-- not that it seems to have done him any good --
the "typos" in p.g. e-texts are fairly easy to locate,
since an ordinary spell-check will pin-point them.
the difficulty comes when questions involve a word
that needs to be evaluated against the book itself,
or a scan of the page, because the context is either
insufficient or unclear... more extreme difficulties
arise about judgments on if the p-book contained
a publisher/author error, typographic or otherwise.)
and again, that is just on the _technical_ matters...
-bowerbird
April 3, 2009 5:07 AM
charlie said:
> some links to your research?
sure thing, charlie. just google this:
> bookpeople bowerbird "comparison methodology"
and you will get some things, which point to other things.
in essence, the best success boils down to this:
1. find the differences in two separate digitizations, and
2. resolve the discrepancies by referring to the scan-set.
o.c.r. ain't perfect; however, aggressive post-o.c.r. cleaning
can take it very close to a level that's surprisingly accurate.
but sadly, the big players aren't doing that clean-up. yet...
(google does it in their labs, but won't give us clean text.)
i've "encouraged" o.c.a. to do it for years, and will persist.
especially since so many books enjoy multiple digitizations,
comparison methodology is eminently workable nowadays,
so word-by-word proofing is grossly inefficient today, but
distributed proofreaders is too set in their ways to change.
-bowerbird
April 3, 2009 12:56 PM
Crowdsourcing these small chunks of text make the task of reviewing PG's library much more approachable.
Moreover, wrapping the text inside small snippets of context, as well as allowing the user to skip an entry if she's uncertain, are both nice interface decisions. I think this is great way to approach the problem!
April 3, 2009 3:46 PM
travis said:
> I think this is great way to approach the problem!
um, no.
and i'd expect that you too, travis, should be familiar enough
with project gutenberg e-texts to know why this won't work...
but again, for those of you who do not have such familiarity,
project gutenberg e-texts are relatively accurate, in general.
framed yet another way, almost all of the words are correct.
so the vast majority of these "bite-size" edits will be correct.
and time spent verifying correct text is time that is "wasted".
what we really want to do is _hone_in_ on suspected errors...
so i'm not saying the "rally" won't find any errors. but they
will expend far too much human effort to find any they do;
that cost/benefit ratio won't maintain long-term volunteers.
moreover, many of the errors that _do_ occur in p.g. e-texts
simply cannot be ascertained with sufficient certainty without
reference to the page (scan) itself, so this effort is _doomed_.
(i could give you tons of examples, but anyone who did any
research on these issues will find 'em without even looking.)
what is actually needed is a project that links a p.g. e-text
to an extant scan-set of the same edition of that book and
locates suspicious text for people to check against the scan.
there are two things needed here. one is the routines that
highlight possible errors. i've done a lot of work on these.
the second is the gruntwork of matching p.g. e-texts with
extant scan-sets. we're just now coming to a point where
we can _expect_ that an "average" p.g. e-text will be found
in the scan-sets that have been done by google or the o.c.a.,
which is why i haven't built such a system before this time...
after that, we'll need a companion project that encourages
people to actually report the errors they find in p.g. e-texts.
some errors will _only_ be found by interested humans who
have the full context obtained by reading through the book.
the problem is _not_ that there haven't been enough eyeballs
on these e-texts. the problem is so few people report errors.
this gets into the fact that p.g. is lousy about fixing errors --
they readily admit they are _years_ behind on some reports,
and they are downright hostile to suggestions for speed-up
-- which takes us to the "logistical and political" arenas that
i mentioned above. a big part of the reason why people are
largely unmotivated to submit error-reports is because they
seem to garner no attention, so people wonder "why bother?"
so we need to create a friendly reading environment which
also encourages the person to report an error that they find,
after verifying from the scan that it is indeed an actual error.
and yes, i've been raising these issues since 2003 over at p.g.
(charlie, i should've pointed you to the p.g. listserve archives.)
aside from founder michael hart, no one wants to hear them...
-bowerbird
April 4, 2009 6:26 AM
Interesting conference from european project IMPACT in The Hague April 6th-7th:
OCR in Mass Digitisation
Challenges between Full Text, Imaging and Language
http://www.impact-project.eu/news/ic2009/
"Tuesday 7 April 2009: New advances in OCR technology, such as collaborative correction and adaptive OCR techniques, a possible way forward for future large-scale digitisation programmes."
April 4, 2009 3:17 PM
alain said:
> IMPACT
> New advances in OCR technology, such as
> collaborative correction and adaptive OCR techniques
let's hope they're smart enough to avoid adopting
all of the bad habits from distributed proofreaders,
which has developed a tremendously awful workflow.
-bowerbird
April 4, 2009 5:37 PM
Andrew, thanks for the kind article... Now, let me see if I can address some of bowerbird's concerns, which I totally understand, and for the most part agree with.
Our longer-term objective for Book Oven is to build tools for makers of books -- writers, editors, designers -- and a space where they can come together to make more books. So our prime concern is building tools for new works, rather than being an OCR-improvement service. Still, if we can serve more than one useful purpose, so much the better.
As we were sketching out Book Oven we were thinking about some of the pain points in getting a manuscript ready for publication. One of them is proofreading, so we started thinking about how we might make that process less painful, while harnessing the kind of latent cognitive energy that flows in great waves around the Internet at any moment, getting sucked up by Youtube and the like. What if we could find a way to make proofreading fun, without requiring the kind of engagement that usually comes with proofing?
And so was born the idea for Bite-Size Edits - a way to turn proofing into a kind of game, something that would be enjoyable, fun, easy. That was our development goal, and by the feedback we've gotten so far, I think we've succeed.
Now, as you mention, I have an intimate knowledge of Gutenberg texts from LibriVox. At LibriVox we make audio versions of those texts, so we pay great attention to every word; indeed some (not all) of our audio gets "prooflistened" against the text, and so we are perhaps more likely than most to see the errors. But it's never been a clear or easy process to report those errors to Gutenberg; and once they're reported, Gutenberg has trouble processing them. I have great sympathy with that particular challenge. At LibriVox we face the same problem: making an audio version of a text is difficult and time-consuming, but overall the people involved in LibriVox love doing it. Getting error reports, figuring out if they are legitimate errors, and then downloading the audio, fixing it, reuploading, is a difficult, thankless, and frankly uninspiring process. And we, like Gutenberg (and like Distributed Proofreaders) are run by volunteers. So all that gets done if and when volunteers decide to do it, which is much less exciting than finishing another book. Gutenberg will have, I am sure, similar problems. I'm deeply sympathetic.
Still, Michael Hart's December 2008 newsletter referenced a LibriVox reader, who had found 23 errors in a text, and sent them in to Gutenberg. So whether or not there are better ways to fix problems in Gutenberg texts, one way is to have people find the errors and submit them. (Getting the etexts corrected is another problem altogether, and I wish Gutenberg luck & courage in figuring out how to do that in a sensible, and efficient way).
This brings us back to Bite-Size Edits. Michael's letter trigged the idea for Gutenberg Rally. We had long planned to test Bite-Size Edits with public domain texts, but we thought we might be able to serve a more specific purpose: trying to perfect Gutenberg texts. So we decided on the Gutenberg Rally.
I know of Distributed Proofreaders, but I have never participated in the project. But I have spoken with numerous DPers, and had a long chat with Juliet about what we had in mind. And of course the challenges you highlight are indeed something we can't (right now anyway) address: that is, the DP process needs to compare the OCR text with the original scans to be sure to get the right outcome. So we can't help with most of the DP process. Where we can help is Smooth Reading, the last (optional) stage of DP's process. This is the stage where volunteers are asked to read through the text without reference to the scans, and point out any problems they find. The texts we're editing in the Rally are mostly from that pool. Our hope is that we can continue to help DP in whatever way we can, by putting Smooth Reading texts through Bite-Size Edits, and allowing the DP project managers to figure out if the errors we report need fixing or not.
We are not trying to change the DP process, or replace the usual Smooth Reading methods, but provide a new and different way to check those texts, which can be additional to everything the good people at DP are already doing.
So in short: Bite-Size Edits may not be the answer to all the problems of proofing scanned texts, but we do hope that can help improve the quality of free public domain etexts. We think we've built a new tool to address certain problems of proofreading, by decontextualizing text so that you're more likely to spot problems, and finally, and most importantly, by making it fun and engaging.
But given all that, if you have any ideas of how Bite-Size Edits could be more useful for Gutenberg, I'd love to discuss them: hugh@bookoven.com
[Andrew, sorry for taking up so much space!]
April 4, 2009 11:42 PM
first, hugh, thanks ever so much for joining the conversation,
and not just jumping to a too-easy response of "i'm insulted".
i greatly respect pioneers brave enough to engage in dialog...
***
> our prime concern is building tools for new works,
> rather than being an OCR-improvement service.
yes, the name of your site -- "bookoven" -- makes that clear,
so i knew this was just an "extension" of your main purpose...
(indeed, i thought you had said it specifically at some place,
or at least implied it, but perhaps i just inferred it, i dunno.)
> pain points
> proofreading
> harnessing the kind of latent cognitive energy
> that flows in great waves around the Internet
also very clear.
and in keeping with your tremendous success with librivox.
> What if we could find a way to make proofreading fun,
> without requiring the kind of engagement that usually
> comes with proofing?
i have made this suggestion myself, on the p.g. listserve.
> And so was born the idea for Bite-Size Edits -
> a way to turn proofing into a kind of game,
> something that would be enjoyable, fun, easy.
> That was our development goal, and by the
> feedback we've gotten so far, I think we've succeed.
um... ok... but i'd think it's still much too soon to tell,
and the people who don't agree probably won't say so.
from my experience, any joy in doing proofreading is
finding and fixing errors... and the fact remains that
most of your volunteers will do that only very rarely...
but if you can maintain volunteers long-term, i will be
more than happy to tell you i am wrong wrong wrong.
(but we're all just eating the dust of genius luis von ahn.
and even _he_ couldn't make proofreading all that fun;
recaptcha is about as much "fun" as pulling out a tooth.
the best even luis can say in its defense is that "at least
the time that's being wasted is going to a good cause."
which -- let's be honest here -- is a far cry from "fun".)
> But it's never been a clear or easy process to
> report those errors to Gutenberg; and once they're
> reported, Gutenberg has trouble processing them.
lord knows i know _exactly_ what you're talking about...
i've been trying to get them to wake up for over 5 years.
> And we, like Gutenberg (and like Distributed Proofreaders)
> are run by volunteers. So all that gets done if and when
> volunteers decide to do it, which is much less exciting
> than finishing another book.
well, actually, the p.g. difficulties with error-corrections
are _not_ that they're "run by volunteers". the problem
is that there is an inside circle -- called "whitewashers" --
who have decided that only they can act on error-reports.
and they've then given that task an extremely low priority.
if p.g. really opened up the error-correction process to its
volunteers, it'd get done _a_lot_ more quickly than now...
(and that's what michael was trying to do with his "call"...
but even michael can't wrestle it from the whitewashers.)
> I'm deeply sympathetic.
i was too, for a year or two. after experiencing hostility,
i came to realize that they really don't deserve sympathy.
it's a huge problem they have brought upon themselves...
> whether or not there are better ways to fix problems
> in Gutenberg texts, one way is to have people
> find the errors and submit them.
error-reports are _great_. so to the extent that you will
generate them, i can be very supportive of your efforts...
> We had long planned to test Bite-Size Edits
> with public domain texts, but we thought we
> might be able to serve a more specific purpose:
> trying to perfect Gutenberg texts.
i understand the appeal, i really do! it just won't work...
you won't do much _harm_, and will likely do some good,
don't get me wrong. but the amount of human time and
energy that end up being spent will not be cost-effective.
moreover, you will only catch a minority of the errors...
(which means yet _another_ effort to get all of the rest.)
> I know of Distributed Proofreaders, but
> I have never participated in the project.
let me be perfectly clear that i am not recommending d.p.
they have an absolutely awful workflow, precisely because
they too provide a horrid return on the investment of time
and energy that is _donated_ to them by their volunteers,
and i have documented the problems with their workflow...
i could rip apart d.p., but this has already grown too long.
but i will address this one thing:
> Where we can help is Smooth Reading, the last
> (optional) stage of DP's process. This is the stage
> where volunteers are asked to read through the text
> without reference to the scans, and point out
> any problems they find.
indeed, smooth-reading is somewhat similar to your rally.
the main difference is one i mentioned above, where some
errors can best be found only by a human reader who has
interested knowledge of the whole book. smooth-readers
have that, in general, whereas your rally volunteers do not.
on the other hand, neither smooth-reading nor your rally
has access to the scans. and that's a problem with _both_.
it's a problem that smooth-readers shouldn't have to have,
since d.p. has the scans right there, so it's just another flaw
in their procedures. but with your rally, the flaw is built-in.
one of my points is that scans should _always_ be available.
simply put, you can't do error-checking without the scans...
all you do is raise flags about things that might be wrong,
and then somebody has to consult a scan to see if it _is_...
one of the things that the p.g. whitewashers use as their
"defense" in assigning error-reports a low priority is that
half the "errors" that are reported are _not_ in fact errors.
if you don't have the scan in front of you, telling you that
"no, indeed, that _is_ what the p-book actually printed, so
it is _not_ a transcription glitch or an o.c.r. bug", then you
are going to make a lot of "false" error-reports, which are
a waste of everyone's time. (and even if i wouldn't call this
"doing any harm", it's obviously not a _good_ thing either.
especially when the whitewashers whose time gets wasted
are the very people who were "too busy" in the first place.)
on the other hand, if you _do_ have the scan in front of you
that informs you that "yes indeed, the p-book has it printed
differently than the transcription", you _know_ it is an error.
> by putting Smooth Reading texts through Bite-Size Edits,
> and allowing the DP project managers to figure out
> if the errors we report need fixing or not.
well, as i always say, the proof is in the pudding.
so if this process ends up working, more power to you, hugh.
but my guess is that your unsupported-by-viewing-the-scan
"error reports" will be wrong nearly as often as they're right,
and the d.p. post-processors might find that the time that is
wasted in checking the false-reports is a considerable drag...
of course, i think your rally volunteers will peter out _first_...
> if you have any ideas of how Bite-Size Edits
> could be more useful for Gutenberg
i laid it all out above, hugh.
the best way to correct errors in a p.g. e-text
-- or any digitized book, for that matter -- is to:
1. find an independently-done digitization, and
2. compare the two to find where they differ, and
3. resolve differences by referring to the page (scan).
this saves scads of time, because you simply assume that
the places where the digitizations _agree_ are _correct_...
and yes, theoretically, it could be the case that they just
have the same error in common, but in practice, according
to extensive research that i've done, that's extremely rare.
so this "comparison methodology" is actually quite good.
the p.g. listserve archives (gutvol-d) have many examples
of cases where i document research where i have used it...
the most recent book i've done was "the jungle" by sinclair,
where i found and fixed over 1,000 errors in the p.g. e-text.
i will post links to some of the books that i've mounted,
which show my "continuous proofing" interface, where
i've applied that term to the proofing done by the public
after an e-text has been released upon attaining accuracy
projected at a rate of less-than-one-error-every-10-pages.
i'll post those links in a separate message, since links will
often cause a post to be delayed for human-spam-checks.
but yeah, that's what i suggest. find a scanset over at o.c.a.,
where you can grab the images and a copy of the o.c.r., and
then do _lots_ of aggressive clean-up on the o.c.r., and then
compare the p.g. e-text with the cleaned-up o.c.r. from o.c.a.,
resolving the differences by making reference to page-scans.
at the end of that process, you'll have a _very_ clean product...
-bowerbird
April 9, 2009 2:47 PM
bowerbird wrote:
"project gutenberg e-texts are relatively accurate, in general."
"Relatively accurate" means they're *inaccurate.*
Years ago I gave Michael Hart some grief for not caring about the "fiddly bits," as he called such items as misplaced punctuation, typographical errors, and so on. Well, it's those fiddly bits that make a text reliable and useable. For example, one of the PG texts for Hamlet used to have this line in the graveyard scene where Hamlet is holding up Yorick's skull: "My gorge rimes at it." That, of course, is meaningless; it should read "My gorge *rises* at it." I have since seen the line "corrected" in another online version to "My gorge *rims* at it," which isn't meaningless but has now been rendered into nonsense *because* of the existence of the typo in the PG text.
That's the problem with inaccuracies in online texts--errors are perpetuated ad infinitum as they're copied from one site to another, used in CD collections, typeset for printed books, and so on. So it's *important* to get the texts *right* to begin with. "Relatively accurate" doesn't cut it.
I agree with bowerbird that the solution to fixing existing texts is to electronically compare several different versions and then resolve differences by referring to a professionaly published edition (or a scan of one). But really, this should be done *before* the text is ever posted online. Posting (for general consumption) a crappy version of a book with the hope of cleaning it up "someday" is bad, bad practice.
Best wishes,
Jack Lyon
April 9, 2009 7:08 PM
jack-
i feel your pain. deeply.
i walk a very fine line between defending project gutenberg
from those people who would like to see it _destroyed_ and
use the errors in p.g. e-texts as a "good excuse" to kill them,
and criticizing p.g. in-house because it's doing such a bad job.
i don't blame michael hart, though. he started at the right spot,
believing the future would deliver people to perfect his efforts...
the problem is, the people who came along didn't do that job,
nor did they put into place the mechanism that _would_ do it...
over 5 years ago, in december of 2003, when project gutenberg
reached its original goal of 10,000 e-texts, i pleaded with the
people in charge of the project to put new digitization on hold
while they went back and brought the library up to a standard,
but they didn't listen. when they hit 15,000 e-texts, i pleaded
once again, and again when they hit 20,000, and then 25,000.
they're approaching 30,000, and i'm going to plead once again.
but i don't expect to have any more success than i ever had...
oh, and just to remind people of the original point, there is
absolutely no sense in proofing _all_ of the sentences when
only a small percentage of those sentences contain any errors.
that's why i had said that p.g. e-texts are _relatively_ accurate.
it wasn't to exonerate p.g. e-texts; it was a comment on the
bad cost-effectiveness of an approach proofing all sentences.
-bowerbird