SearchEngineWatch announces “Google Opens Tesseract OCR Software”, which is exciting news for those of us who scan or want to covert a lot of documents to text:
The Google Code Blog announced that Google has “re-released” the Tesseract OCR software to the open source community. OCR, optical character recognition, is the technology for converting text on a physical paper into computer based text. So if you have a ton of papers you typed up in your college days and you want them stored in digital format, you can use OCR to translate those documents for you.
OCR (Optical Character Recognition) converts image scans of documents into text. Bitmaps, TIFF, and other image scans can be imported into the program and the software crawls through the images to detect recognizable letters of the alphabet.
There are limitations to OCR programs, but their ability to detect and generate nearly accurate results is amazing. With Nuance’s OmniPage Pro (formerly by ScanSoft), I was able to scan hundreds and hundreds of pages typed on an old manual typewriter by some of my relatives about their life stories, and though it tried to make letters out of ink marks and the occasional coffee stain, the results were quite accurate. Even down to the misspelled words which were left intentionally misspelled, allowing me to choose which spelling I wanted, keeping their phonetic attempts at spelling or not. This saved months of agonized retyping of stories and documents I want to post on my genealogy blog.
Sourceforge.net has the download site for Tesseract OCR, and I’ll be installing it soon and putting it through it’s paces. I’ll report on how it does, though if you have used it or are familiar with OCR programs, I’d love your input and experiences.