Xtrata Recognition - Document Separation Problems

Xtrata Recognition - Document Separation Problems

Postby » Thu Nov 08, 2007 3:04 pm

Hey,
I'm new to the forums and new to Kofax and Xtrata. The manual and Labs gave me the raw knowledge, and I've been spending a few weeks on here reading your posts and struggling through some concepts.

I am doing a project that has two basic different types of forms that are being scanned in, these forms signify a break in the document. It wasn't working at all with Kofax Zone ID recognition, but with Xtrata it works GREAT. Most of the time.

Hopefully y'all can help me out with this. The Xtrata server recognizes and separates the two forms, and its OK if it misses a few. It picks up the first form type almost without fail, it also picks up the second form type fairly well, but seems to miss around 30% of the time on sheets that seem pretty identical to my sample pages. I have no idea why its doing so well on one and not on the other. I've looked on here, I've looked in the other resources that are available (there don't seem to be many). I even switched the secondary sample pages and lowered the confidence level to 65% from 70% and got even better separation, but it STILL missed the exact same sheets of the second form type. Is the primary sample page more important than the 5 secondary samples? Do more samples provide better results or does it narrow the field too much?

Any ideas?
Thanks,
Martin Bidegaray
Production Manager
doc2e-file
Pasadena TX
Participant
 
Posts: 51
Joined: Thu Nov 08, 2007 2:53 pm
Location: Clear Lake, Texas

Postby » Thu Nov 08, 2007 8:54 pm

The following is largely my opinion from working with Xtrata. It appears that it has a rather simple model to use in trying to match a scanned image with the samples. It can handle shift and scale, but it can't handle "warp".

Where are the forms coming from? If it's from different printers, that's fine, but if they are in fact different forms that are designed to look alike, Xtrata will trip up on it as one form will have slightly thicker lines or slightly bigger spaces. While Xtrata can deal with most issues that come up between printing and scanning, the original form must be identical. It can't come from different software packages.

Another issue is how much of the back pixels are form and how much is data? Some forums are difficult because too much of the form is data which changes from page to page. A business letter would be very hard since the only thing consistent is the letterhead.

I've been told that Xtrata does work better of one of the sample forms is blank. Since I work in a service beau, I tend to used a scanned form and blank out the data in a image editor.

Another tool you can use is import and use the "XtrataDefinition" batch class. It will take in everything you scan and separate it what Xtrata sees as separate forms (I think it puts it in AscentSV\XtrataDB\DefClasses). Scan in a large sample of the form that giving you trouble and then use one sample from each classification as a sample in your real job.
Participant
 
Posts: 3374
Joined: Wed May 17, 2006 12:53 pm
Location: USA

Thanks

Postby » Wed Nov 14, 2007 5:41 am

Yeah, I think the problem is that a very low percentage of the form is actually constant. Much of it will change very often. I am going to try to put a blank form in the sample images anyway.

A few more questions, if you don't mind:

1. Do more sample pages = better classification

2. Lets say I decide there are too many variances and I decide I want Kofax to divide the documents by using separator sheets, is there some way to combine separator sheet separation and have Xtrata still recognize the index zones directly on the page right after that separation sheet? (if I do this, I will have prep separate batches based on form type as well as documents so there will be two batch classes per box, one for each form type).

3. I have pored through the Xtrata Manual, and even tried the Xtrata slide show I found on these forums, are there any other resources anyone knows about?

Edit: I actually started playing with the XtrataDefinition Batch class, here's a few more questions.

If I have multiple form types on one batch class, will that classify the pages it recognizes as different forms into different documents even though they are all under one "document class"?

If I "Assign" a sample, how can I "unassign" it and get it back on the "unassigned" list. It seems like every time I mess up I have to start over all the way from the beginning.
Thanks Again,
Martin Bidegaray
Production Manager
doc2e-file
Pasadena TX
Participant
 
Posts: 51
Joined: Thu Nov 08, 2007 2:53 pm
Location: Clear Lake, Texas

Re: Thanks

Postby » Fri Nov 16, 2007 2:41 pm

MartinB wrote:1. Do more sample pages = better classification


Maybe. If you give it all the variations, it would help it classify the form type.


is there some way to combine separator sheet separation and have Xtrata still recognize the index zones directly on the page right after that separation sheet?


From your other posts, I gather you have fixed that.


3. I have pored through the Xtrata Manual, and even tried the Xtrata slide show I found on these forums, are there any other resources anyone knows about?
None that I know of. In my case our sales engineer gave me my starting run though.


Edit: I actually started playing with the XtrataDefinition Batch class, here's a few more questions.


The batch class is just a tool to help you understand what Xtrata "sees". Many time it will create separate classes for the same document type. But by taking a sample from each of those and adding to my Batch Class, I can get Xtrata to see it all as a single form type.

If I "Assign" a sample, how can I "unassign" it and get it back on the "unassigned" list.


There might be a better way, but I'm sure you can delete the sample from inside Xtrata, but that doesn't make it unassigned for use in a different form type.
Participant
 
Posts: 3374
Joined: Wed May 17, 2006 12:53 pm
Location: USA

Postby » Fri Nov 16, 2007 4:07 pm

Yeah, I've actually pretty much given up on the Xtrata Definition batch class as that great of a tool, in my case it provided too many sample groups; it seemed the more samples I gave, the less luck I had with recognition, even if the samples varied. Also it was when I did the Definition batch that my classification percentages dropped to 0%. I was only able to get it back when I upgraded from 1.7 to 1.7 SP1 and what frustrates me is I still don't know why.

In my case I did better with your typical 5 sample pages, setting up the zones and using separator sheets to break up the documents. I lower the classification confidence to like 50% and it manages to identify 80 - 95% of all the document cover sheets (the rest are "unclassified") and it also recognizes a slightly higher percentage of the recognition zones (after I classify the unclassified docs in QC). For this particular project, it seems to work very well. I have one more day of tweaking left before I toss it out into production. I just hope Xtrata doesn't manage to freak out on me before then.

Thanks again,
Martin Bidegaray
Production Manager
doc2e-file
Pasadena TX
Participant
 
Posts: 51
Joined: Thu Nov 08, 2007 2:53 pm
Location: Clear Lake, Texas


Return to Ascent Xtrata General Discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron