TesseractIndic @ foss.in 2008

October 15, 2008 at 3:08 pm (Uncategorized) (, , )

All my work is documented in detail on http://debayanin.googlepages.com/hackingtesseract . The latest entry is specifically for people who want to join the effort. Please go through and comment:

Note: TesseractIndic is Tesseract-OCR with Indic script support. This will remain a separate project untill Tesseract-OCR actually decides to accept patches and merge Indic script support. TesseractIndic can be found here.

So lets see where we stand. We have Tesseract-OCR, which works great for english. I managed to apply “maatraa clipping” (which is a new term/approach in the world of OCR i think!) successfully as a proof of concept to the image being fed to the Tesseract OCR engine. Accuracy obtained by this method, along with some really crappy training, stands at about 85%.

A standard OCR process contains the following steps:

(1) Pre-processing, involving skew removal, etc. Pretty much
language-independent, though features like the shirorekha
might help here.
(2) Character extraction: Again, largely language-independent,
though language dependency might come in because of
features like shirorekha.
(3) Character identification: Language independent, maybe with
specialised plugins to take advantage of language features,
or items like known fonts.
(4) Post-processing, which involves things like spell-checking to
improve accuracy.

The current available version of Tesseract OCR does steps 3, and 4 above for any language. But that it can only do if it can do step 2 properly, which it cant for connected script like Hindi, Bengali etc. So the approach is to take the scanned image, apply some pre-processing to it, and then do the “maatraa clipping” operation on it. Now feed this image to Tesseract-OCR engine.

In detail, the things to do are:

(1) Pre-processing: Skew removal, Noise removal. Skew removal in particular is key for the “maatraa clipping” code to work.

(2) “maatraa clipping” : This enables the Tesseract-OCR engine to treat Devnagri connected script like any other script.

(3) Training: Very Important for getting good results. But well documented. Good tools exist for training Tesseract-OCR.

(4) Web Interface: We need to create a web interface so people can freely OCR their documents online. No big deal.

Now my intention is to implement skew removal using Hough transforms. Hough transforms are really good in finding staright lines (among other shapes) in images. So all we need to do is, find the “maatraas” and calculate thier slope. We have the skew angle, and we just rotate the page to correct the skew.

I had implemented “maatraa clipping” using projection based methods. It seems there is a better digital image processing method called “Morphological Operations” that is a better way of doing it. Well, actually i am not that sure about it yet. Still researching and trying out stuff.

Now, I had done all this work in C++, as the Tesseract-OCR code is also in C++. But, of late, i have been mesmerised by the simplicity and power of Python , and the Python image library. All the work i am doing now, including Hough transfroms, is in Python. So now we have 2 options:

(1) Do the pre-processing and “maatraa clipping” in Python and feed the page to the Tesseract-OCR (will be easy and quicker to implement)

(2) Do the entire thing in C++ (will execute much faster)

Again, we will probably end up doing both. In foss.in, I will probably bring along Python code that already works, and ask people to port it to C++ and merge upstream to TesseractIndic. Or we could ask people to implement algorithms of their choice in the language of their choice on a common set of test images and then shall convert that stuff to C++ and add.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: