OCR

Perhaps there is a very easy solution and I am simply missing something (at least I hope so).

Invariably, we have customers that come in with forms/menus/documents that they want edited and reproduced, but do not have it on file anymore.

Although I am I relatively quick typist OCR appeals to me in more that one way (it is still faster, and arguably more accurate e.g., word omissions and such).

Anyways, my issue is importing OCR scans into (or exporting to) Corel Draw. We have been using ABBYY Fine Reader v9.0 (tried 10 but it gave me a few problems).

Saving scans to a PDF never seems to work well, because Corel seems to be incapable of importing PDFs without screwing things up, regardless of whether I import text as "text" or "curves," the general formatting causes the file(s) to be worthless due to widespread the amout of reformatting necessary.

Creating a M$Word document poses serious import issues with Corel Draw as well (although if copy/pasted in small sections into Corel can work on smaller jobs).

And of course, less sophisticated formats (e.g., rich text format) do not help as far as formatting goes.

My Question:

Is there a program/macro/setting that will create a Corel Draw ready OCR scan when I do not have to reformat just about everything?

Thank you.

Nick

Parents

No Data

Reply

arronlee over 10 years ago

Hi,

I am a new learner about OCR tools, here I have some information about OCR tools to share with you. And I hope it helps:

Actually, there are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters.
Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching" or "pattern recognition". This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly.
Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. These are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR software. Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match.
Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).You can refer to some professional OCR SDKs for help. And you'd better try its free trial package according to its tutorial page about OCR using C#.NET. first then choose one whose way of processing is simple and fast. It can save a lot of time for you. I hope you success. Good luck.

Best regards,

Arron
Cancel
Up 0 Down

Reply

Cancel

Children

harryLondon over 10 years ago in reply to arronlee

One thing to be very careful about when using OCR -- often, the character recognition is relatively poor, but is enhanced by comparing the results with common dictionary words to decide what a fuzzy character within the word is most likely to be.

This works well for paragraph text but is likely to give errors in numeric data that appears withing the text because a 6 may have only a few pixels difference from an 8 and there is no equivalent way to decide if one is more likely than the other.

People are often fooled seeing the apparent accuracy of the recognised text into assuming that the result is a faithful reproduction of the document, but in practice text from OCR needs to be proofread even more carefully than typed-in text.
Cancel
Up 0 Down

Reply

Cancel