When you have a problem regarding the extraction of a text from images, you expect Google to have the best solution. They provide an OCR(optical character recognition), but even such a great company didn’t have the best approach processing the data, so, we created a solution ourselves.
We needed some OCR feature one of our projects in NodeJS, so we chose Google Vision OCR. As it was described on their API page, they did implement some intelligent segmentation method to sort the extracted text into blocks, paragraphs, etc. I’m not going to lie, it was good, but we needed something a little different. This is an example of how Google’s algorithm works:
We had some distinctive type of documents with lots of whitespaces to extract information from to, and we needed to pair sentences inline to auto-complete data for better user experience. At first, it looked easy, but due to lack of consistency in different documents, crappy scans, noise in images from clients who were trying to use the feature, we thought more about how to process the original content, not to emphasize accent on post-processing. We found a package for NodeJS which said that it solves the problem with aligning text from the image and understands which words should be returned as paired or as a sentence. We tried that, it was a start, but was far from what we expected.
Finally, we came with a solution in NodeJS: we made a workaround on Google’s algorithm, approached differently all the data provided from their OCR. This is how it works:
-Merge words/characters which are very close: the first stage concatenate nearby characters to form words and sentences with characters which have bounding polygons almost merged. This phase helps to reduce the computation needed for the next steps.
-Creating bounding polygon: stage two creates an imaginary system of coordinates with each word/sentence in a polygon(as in the image below).
-Combine bounding polygon: the third stage parses through the data and inline the elements. The algorithm tries to fit words into single lines, creating a bigger polygon for each line. (image below)
We had lots of documents, with lots of whitespaces, so we found a way to pair our content. We did this for our purpose at first, but then came with the idea to make it public. Also, the algorithm can eliminate from the original content all the words/sentences you need to get rid of before analyzing the data. OCR is not perfect, so we also implemented “Levenshtein’s distance algorithm” to help match the desired case.
At this moment, it is not the most sophisticated algorithm on processing the data returned from an OCR, but we plan to make it. We are going to implement extended calculations to match any text orientation in the image to pair accurately, extended analysis for boundaries angles, to work with any inserted image, even with a simple photo of a document on your table. As a big goal, we want the algorithm to perform the perfect grouping of text, to tell you where exactly it is and if it’s a page/title/paragraph/block or anything.
Bellow you can find repository: