Skip navigation

Google Releases Tesseract OCR Open Source Software

SearchEngineWatch announces “Google Opens Tesseract OCR Software”, which is exciting news for those of us who scan or want to covert a lot of documents to text:

The Google Code Blog announced that Google has “re-released” the Tesseract OCR software to the open source community. OCR, optical character recognition, is the technology for converting text on a physical paper into computer based text. So if you have a ton of papers you typed up in your college days and you want them stored in digital format, you can use OCR to translate those documents for you.

OCR (Optical Character Recognition) converts image scans of documents into text. Bitmaps, TIFF, and other image scans can be imported into the program and the software crawls through the images to detect recognizable letters of the alphabet.

There are limitations to OCR programs, but their ability to detect and generate nearly accurate results is amazing. With Nuance’s OmniPage Pro (formerly by ScanSoft), I was able to scan hundreds and hundreds of pages typed on an old manual typewriter by some of my relatives about their life stories, and though it tried to make letters out of ink marks and the occasional coffee stain, the results were quite accurate. Even down to the misspelled words which were left intentionally misspelled, allowing me to choose which spelling I wanted, keeping their phonetic attempts at spelling or not. This saved months of agonized retyping of stories and documents I want to post on my genealogy blog.

Sourceforge.net has the download site for Tesseract OCR, and I’ll be installing it soon and putting it through it’s paces. I’ll report on how it does, though if you have used it or are familiar with OCR programs, I’d love your input and experiences.


Site Search Tags: , , , , , , ,
Copyright Lorelle VanFossen, member of the 9Rules Network

Member of the 9Rules Blogging Network

2 Comments

  1. Bikash
    Posted November 9, 2006 at 5:38 am | Permalink

    I was just working with the open source Tesseract OCR. After recognition I found that
    it does not recognizes umlaut, beta and any German characters.
    Is it the problem with the classifier file i.e., tessdata\tessconfigs\batch?

    Is there any other classifier file that supports German language.

  2. Posted November 9, 2006 at 10:25 am | Permalink

    You’ll have to ask elsewhere. I’ve only done the smallest test in English with it. I know nothing about the recognition factors.


Post a Comment

Follow

Get every new post delivered to your Inbox.

Join 20,112 other followers

%d bloggers like this: