Lisa Laughy – Archives Assistant
The September 12th issue of Science, recently out on the shelf in Ohrstrom Library’s periodical room, features a cover article about the combination of new tech and old books. Five researchers have tested the effectiveness of the CAPTCHA web security measure to pick up the slack in OCR book digitization. If you regularly browse the web, you have encountered a CAPTCHA – asking you to decipher a difficult to read section of text and type the letters into a box. Now researchers are finding a way to re-purpose your small efforts into something rather useful. Science describes the project:
“Millions of books written before the computer era are being digitized for preservation. Because the ink has faded, optical character recognition software cannot decipher many words. Through a repurposing of an existing online security technology called CAPTCHA, these words are being manually transcribed by millions of Web users.”
Here is the abstract from the published paper:
“CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are widespread security measures on the World Wide Web that prevent automated programs from abusing online services. They do so by asking humans to perform a task that computers cannot yet perform, such as deciphering distorted characters. Our research explored whether such human effort can be channeled into a useful purpose: helping to digitize old printed material by asking users to decipher scanned words from books that computerized optical character recognition failed to recognize. We showed that this method can transcribe text with a word accuracy exceeding 99%, matching the guarantee of professional human transcribers. Our apparatus is deployed in more than 40,000 Web sites and has transcribed over 440 million words.”
The article estimates that over 100 million CAPTCHAs are typed a day, amounting to hundreds of thousands of human hours. Taping into that resource to accomplish such a useful task as the digital preservation of old books is a fascinating prospect. Come into Ohrstrom Library’s periodical room and read the full text of the article in the September 12th issue of Science, starting on page 1465.
Terry Wardrop
This is driving me crazy – I have read about this somewhere else and I can’t remember where (NB: it was NOT in Science Magazine…)
Lisa Laughy
Hi Terry –
I read about the lead researcher from this project, Luis von Ahn, in a Wired magazine article in June of last year ( http://www.wired.com/techbiz/it/magazine/15-07/ff_humancomp ) He is the one who developed the CAPTCHA – and he is doing the same sort of thing as this book digitization project with images, sounds and language. Plus he is a MacArthur Genius – so there are probably a lot of articles out there about him. Thanks!