• Print

A Glimpse into Google's Book Scanning

Google doesn’t divulge specifics about its proprietary book scanning set-up, but the Associated Press offers a brief look into the manual scanning process used for old/fragile titles:

… the temperature is always in the 60s … Each technician has a slightly angled table with a flexible middle that cradles books and holds them still while two overhead cameras photograph the pages. … Once the images reach the computer, the women [featured in the AP story] use the book scanning software Omniscan from Germany’s Zeutschel GmbH to clean them up. A final click of the mouse sends each digitized book to Google for optical character recognition processing, which makes the text searchable. Google then returns a copy of the images and data to the library and posts another to the Web.

(Via Publishers Weekly)

tags: , , , ,

Comments: 2

  1. > A final click of the mouse
    > sends each digitized book to Google
    > for optical character recognition

    that’s telling — that _final_ click of the mouse
    occurs _before_ the o.c.r. is done. if, instead,
    google has the o.c.r. done right then and there,
    the operator can re-do the scan when the o.c.r.
    results turned out to be less-than-satisfactory…

    yes, it would take a little bit more time, but
    far less than what it takes to re-fetch the book
    to re-do the bad pages later…

    that’s just bad workflow. and you’d hope
    that — after scanning millions of books —
    google would have caught on by now…

    and that’s just the half of it…

    google _could_ — and _should_ — also use
    a _pre-scan_ to align the cameras such that
    every image is shot straight and cropped…
    yes, you can straighten later, but then you’ve
    got two sets of images floating around, and
    the straight, cropped images have artifacts
    of the straightening-via-software process,
    while your images _without_ artifacts are
    the crooked, uncropped ones… it’s so sad.


  2. Do you have any clue how long it takes to OCR a 300 page book? What is the book scanner supposed to do while he sits and there and waits for the OCR and clean process to occur? He/she would be sitting there twiddling there thumbs. It would be far, far more efficient to back end that process in a batch.

    [Editor’s Note: This comment has been edited for content. Disagreements are fine, but personal shots don’t benefit anyone.]