A Glimpse into Google's Book Scanning

Google doesn't divulge specifics about its proprietary book scanning set-up, but the Associated Press offers a brief look into the manual scanning process used for old/fragile titles:

... the temperature is always in the 60s ... Each technician has a slightly angled table with a flexible middle that cradles books and holds them still while two overhead cameras photograph the pages. ... Once the images reach the computer, the women [featured in the AP story] use the book scanning software Omniscan from Germany's Zeutschel GmbH to clean them up. A final click of the mouse sends each digitized book to Google for optical character recognition processing, which makes the text searchable. Google then returns a copy of the images and data to the library and posts another to the Web.

(Via Publishers Weekly)

Publishing News

2 Comments


bowerbird said:
April 26, 2008 8:08 PM

> A final click of the mouse
> sends each digitized book to Google
> for optical character recognition

that's telling -- that _final_ click of the mouse
occurs _before_ the o.c.r. is done. if, instead,
google has the o.c.r. done right then and there,
the operator can re-do the scan when the o.c.r.
results turned out to be less-than-satisfactory...

yes, it would take a little bit more time, but
far less than what it takes to re-fetch the book
to re-do the bad pages later...

that's just bad workflow. and you'd hope
that -- after scanning millions of books --
google would have caught on by now...

and that's just the half of it...

google _could_ -- and _should_ -- also use
a _pre-scan_ to align the cameras such that
every image is shot straight and cropped...
yes, you can straighten later, but then you've
got two sets of images floating around, and
the straight, cropped images have artifacts
of the straightening-via-software process,
while your images _without_ artifacts are
the crooked, uncropped ones... it's so sad.

-bowerbird

Zolak said:
June 4, 2008 8:45 PM

Do you have any clue how long it takes to OCR a 300 page book? What is the book scanner supposed to do while he sits and there and waits for the OCR and clean process to occur? He/she would be sitting there twiddling there thumbs. It would be far, far more efficient to back end that process in a batch.

[Editor's Note: This comment has been edited for content. Disagreements are fine, but personal shots don't benefit anyone.]

Leave a comment


TOC Comment Guidelines






Stay Connected
RSS TOC RSS Feeds
 News Posts
 Commentary Posts
 Combined Feed
 New to RSS?
Newsletter Subscribe to the TOC newsletter.
Tarsier Icon Follow TOC on Twitter.
Newsletter Join the TOC Facebook group.
Newsletter Join the TOC LinkedIn group.
TOC Widget Get the TOC Headline Widget.
Search
TOC In-Depth

Impact of P2P and Free Distribution on Book Sales Impact of P2P and Free Distribution on Book Sales

This report tests assumptions about free digital book distribution and P2P impact on sales. Learn more.


StartWithXML: Making the Case for Applying XML to a Publishing Workflow StartWithXML Research Report

The StartWithXML report offers a pragmatic look at XML tools and publishing workflows. Learn more.


Tools of Change for Publishing tutorial DVDs TOC 2008 Tutorial DVDs

Dive into the skills and tools critical to the future of publishing. Learn more.

Tag Cloud
TOC Community Topics