23

OCR Image Preprocessing

I recently have been working on a cool enhancement to our document imaging system. The enhancement is to add OCR and bar code functionality to the package for the purposes of indexing. Historically the system has worked with live operators indexing and the occasional direct interface where we push images directly into the system using back office integrations - mostly though we have live operators reviewing the images and attaching index information.

In order to expand the implementation by adding additional page counts (more stuff) we really needed to figure out how to reduce the page touches. In a review of the images we were indexing it became clear that a significant number of images were machine printed and of decent quality - a good candidate to try OCR on.

This post really isn't about the OCR implementation but I should give credit where it is due - I use tesseract because it is free and there exists a windows port called tessnet2. I love the tesseract tool for two reasons... one - it works & two - it is free. It works quite awesome if you take the time to read through the various configuration parameters and train the engine to work on your images. There is one "problem" with the engine though - and that is how the engine locates "blobs" on the canvas. Through a process involving magic and pixie (sp?) dust the engine elects items to group together for text recognition. Whatever magic the engine uses works quite well when you have black text on a white page with no noise. However, when the page has noise on it or handwritten markings the results can be less than desirable. I have learned through the google group that this is because the engine is looking to normalize the height of the objects and when there are objects in the same horizontal plane with varying heights bad things happen.

Now this leads me to my problem. By the time my imaging system sees the images the last person to handle them was a truck driver. Now I love truck drivers just as much as the next guy - my life revolves around stuff they bring and I am paid to help them make it happen - but the quality of their documents is not typically high on the list of priorities. The images come through with markings on them or handwritten numbers and often at some point have been faxed. The end result is that a lot of my images have noise on them, and this noise jacks with the OCR engines ability to perform.

I suppose I could have started by looking for a "better" engine but in this case I really felt like the problem here was the quality of the original crop that was being passed to the engine - thus I elected to figure out how to solve the problem.

I opted to author some routines that evaluate the objects on the canvas before sending them to the OCR engine. In its evaluation I need the routine to discard what it determines is noise - keeping only what is good data. I did this with the help of an amazing set of tools from Inlite Research - if you are remotely interested get the demo. Inlite has some great cleanup tools that work quite well but in this case they were unable to remove the noise I needed them to remove because what I was after was a specific implementation.

In short my process looks through the objects on the canvas and discards anything shorter than a profile based value. After this step it will create a collection of items that are left in the canvas and discard items that do not have a neighbor in the same horizontal plane that is closer than a profile option dictates. After this we have a much smaller set of objects to consider and we discard anything that is outside the standard deviation of the height of the items (safe because my values will mostly be numeric). What is left is a much cleaner version of the image that can be passed to the OCR engine to do its work.

Here is a sample of the before and after...

The end result of these enhancements is that our imaging department can process significantly higher volumes of documents without increasing staff. The enhancement was a fun process to work on and is very much still a work in progress. If you have any questions or thoughts about how this works feel free to reach out to me.

Posted in: Neat Projects
Share |

Comments


There are currently no comments, be the first to post one.

Post Comment


Name (required)

Email (required)

Website

CAPTCHA image
Enter the code shown above in the box below

Copyright © 2017 Copyright 2010 by Austin Henderson