skew~de-skew… ufffff

June 13, 2008 at 4:19 pm (Uncategorized)

I have been working on adding Indic script support to the Tesseract OCR engine since the last 2 weeks or so. I have maintained a more or less detailed log of progress made so far at http://debayanin.googlepages.com/hackingtesseract .

On Mr. Sankarshan’s advice, (who btw is mentoring me), i did start a developers account at code.google.com. Hence i will soon start using the hosting space @ http://code.google.com/p/tesseractindic/ properly, till my patches get accepted by Tesseract maintainers.

The key to the project was the maatraa clipping code. But for maatraa clipping to work, the page image must be absolutely straight, ie, there must be no skew. Hence i set out writing the code for finding the skew angle of a page, and then de-skewing it. I ultimately did manage something on my own, but the results are less than satisfying.

On Googling, i found that there many research papers on how to find skew angles of scanned images, but no code. Almost all use Hough transforms. I did not have the time or the patience to understand it in detail. Will do it later though. Hence i thought up an algorithm on my own, which is not that roust, but works for most images.

The de-skewing part was relatively easier. I had originally imagined just the opposite though.

Well, it was tough, but had fun. Will have to study/understand/implement Hough transforms for this as well as de-italicising the image later. For now, will live with this and will assume that the images are not skewed and will proceed to train the engine with Bengali fonts. This step will finally make it usable.

Will now go ahead and update the details.

Advertisements

2 Comments

  1. extelopedia said,

  2. jinsbond007 said,

    hi debayan,

    I simply went through the hacker documentation of the tesseract. It appears like there is a deskewing code already. I haven’t tried it, but just check how it works also.

    cheers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: