OCR'ing a dash

OCR'ing a dash

Postby hjanum » Sun Feb 22, 2009 7:25 pm

I have a document with strings with dashes, for example "1-1".
I get good character separation when stepping through the definition in recognizer, but the dash comes out as a '1' or 'F'. The probablility bar for the dash is zero.

I do have the dash/minus in my accepted chars.

MASKTAGS 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ-

I tried to upload screen shots to this forum, but I got the error "Sorry, the board attachment quota has been reached."
hjanum
Participant
 
Posts: 8
Joined: Wed Sep 20, 2006 9:29 am

Re: OCR'ing a dash

Postby stephen.bottomley@kofax.com » Mon Feb 23, 2009 7:03 am

Make sure your minimum character height is higher than the dash. A dash is considered punctuation and must not fall within the minimum size permitted for a normal character.
Stephen Bottomley
Senior Product Specialist
Tel: +44 (0)1223 226012
stephen.bottomley@kofax.com
Participant
 
Posts: 675
Joined: Mon Jul 11, 2005 8:31 am
Location: Cambridge

Re: OCR'ing a dash

Postby hjanum » Mon Feb 23, 2009 5:13 pm

Thanks for your suggestion. The character is being detected, but not recognized.

Image
Image

Here are the probabilities for each character.

Image
Image
hjanum
Participant
 
Posts: 8
Joined: Wed Sep 20, 2006 9:29 am

Re: OCR'ing a dash

Postby stephen.bottomley@kofax.com » Tue Feb 24, 2009 3:46 am

Hi hjanum,

I'd urge you to check your minimum character height. The screenshot you posted demonstrated that this is probably set incorrectly.

If the character is being recognised, then your minimum character height cannot possibly be higher than the dash. If your minumum character height was higher than the dash, then the dash would not be the minimum required height (for a character), could not get read as an "F" and would instead be considered as punctuation.
Stephen Bottomley
Senior Product Specialist
Tel: +44 (0)1223 226012
stephen.bottomley@kofax.com
Participant
 
Posts: 675
Joined: Mon Jul 11, 2005 8:31 am
Location: Cambridge

Re: OCR'ing a dash

Postby hjanum » Tue Feb 24, 2009 6:22 pm

In the screen shots above I had not defined a minimum character size, and the character separation seemed to work. Today I tried a minimum character size of 2x2 with the same result.
Is there anything else that I can try?

Here is a sample of the (redacted) page that I am trying to OCR.
http://www.mediafire.com/?sharekey=f4937435789d54770f83d91f6dff7c38e04e75f6e8ebb871

Thanks for your help.
hjanum
Participant
 
Posts: 8
Joined: Wed Sep 20, 2006 9:29 am

Re: OCR'ing a dash

Postby stephen.bottomley@kofax.com » Wed Feb 25, 2009 4:22 am

I've been trying to explain that 2x2 is an incorrect minimum size for this field!

Please read and note carefully: Your minimum height must be greater than the height of your hyphen. The minimum character height describes the minimum height of your characters. A hyphen is not considered a "character". It is considered a "punctuation mark". Your minimum character height must be greater than the height of a punctuation mark.

1. In this example, I measured the hyphen and it is about 5-6 pixels high.
2. Your minimum character size is 2x2. Because 2 is not greater than 5, this is incorrect.
3. Your minimum character size should be something like 3x15 (3 high, 15 high).
4. Because 15 is greater than 5, this is now correct.

Here are the sizes I used:

Code: Select all
  MINSIZE 3 15 53 13
  EXPSIZE 23 34 11 5
  MAXSIZE 40 50 60 0

I was able to read this character as a hyphen with no problems.
Stephen Bottomley
Senior Product Specialist
Tel: +44 (0)1223 226012
stephen.bottomley@kofax.com
Participant
 
Posts: 675
Joined: Mon Jul 11, 2005 8:31 am
Location: Cambridge

Re: OCR'ing a dash

Postby hjanum » Thu Feb 26, 2009 4:52 am

Your character sizes fixed the issue. Thanks.
You are right I did use the wrong char size. I guess what confused me was the char outline marks in the definer. If you look at my screen shot from earlier it outlines the chars nicely (with no char sizes set), thus I thought I was not having an issue with character separation and below are the outlines with your char size settings.

Image
Image

They look pretty much identical. Perhaps I am wrong about meaning of the character outlines when debugging.
hjanum
Participant
 
Posts: 8
Joined: Wed Sep 20, 2006 9:29 am

Re: OCR'ing a dash

Postby stephen.bottomley@kofax.com » Thu Feb 26, 2009 4:55 am

You never had any issues with character separation, and still don't. Your characters were all separated correctly. Separation (segmentation) is about the horizontal - drawing lines between each character. The misonfiguration here was on the verticial.

The only difference is that you had told INDICIUS that even very very stumpy blocks of pixels could still be characters, so it was trying (and failing) to figure out what this weird "-" thing looked like. Its best guess was an "F" but that was still pretty far out - not a surprise, since "-" doesn't look anything like any real characters.
Stephen Bottomley
Senior Product Specialist
Tel: +44 (0)1223 226012
stephen.bottomley@kofax.com
Participant
 
Posts: 675
Joined: Mon Jul 11, 2005 8:31 am
Location: Cambridge


Return to Indicius General Discussion

Who is online

Users browsing this forum: No registered users and 1 guest