Jyoti Choudhury's diary: Started with Tesseract

Right now, all the Assamese news papers that are available online are in image format (e.g. www.asomiyapratidin.in). This means they are not searchable. So I had this idea to convert the image based news paper to an actual text based news paper. So I started checking around for an optical character recognition tool (by which you can convert letters in an image to a text). Found that the best OCR tool for both training and using purposes is Tesseract OCR tool which is maintained by Google. So I checked in Tesseract library and found that Assamese is not supported by Tesseract.

So now my little idea has grown bigger and more work. I will have to train Tesseract in a new language first, and then use it to convert the images to text. Fortunately, Bengali is available as a library, so I will not have to remove the entire top layer, I will have to only retrain in only a few characters.

So much for today. Let's see how it goes. Got some office work too.

[Edit] The OCR tool is now available at https://ocr.jyotichoudhury.com

Saturday, May 20, 2017

Started with Tesseract