June 13, 2008

I have been working on adding Indic script support to the Tesseract OCR engine since the last 2 weeks or so. I have maintained a more or less detailed log of progress made so far at .

On Mr. Sankarshan’s advice, (who btw is mentoring me), i did start a developers account at Hence i will soon start using the hosting space @ properly, till my patches get accepted by Tesseract maintainers.

The key to the project was the maatraa clipping code. But for maatraa clipping to work, the page image must be absolutely straight, ie, there must be no skew. Hence i set out writing the code for finding the skew angle of a page, and then de-skewing it. I ultimately did manage something on my own, but the results are less than satisfying.

On Googling, i found that there many research papers on how to find skew angles of scanned images, but no code. Almost all use Hough transforms. I did not have the time or the patience to understand it in detail. Will do it later though. Hence i thought up an algorithm on my own, which is not that roust, but works for most images.

The de-skewing part was relatively easier. I had originally imagined just the opposite though.

Well, it was tough, but had fun. Will have to study/understand/implement Hough transforms for this as well as de-italicising the image later. For now, will live with this and will assume that the images are not skewed and will proceed to train the engine with Bengali fonts. This step will finally make it usable.

Will now go ahead and update the details.



    I simply went through the hacker documentation of the tesseract. It appears like there is a deskewing code already. I haven’t tried it, but just check how it works also.


