Recognition - PDFText vs FR or RS

Recognition - PDFText vs FR or RS

Postby Hayden » Sun Nov 20, 2016 1:52 pm

Hi All.
We've finally upgraded KTM to post 5.5 version (6.1) that allows us to use the native PDFText that comes with a PDF document instead of using FineReader or RecoStar OCR. And I have a new customer coming on board that might make use of this, but I have a few questions that I hope someone can help me with.

1. We will have a mix of PDFs with text & without text - how do we recognise this and only OCR the PDFs without text?
2. Also, we will have PDFs that have been scanned & OCR'd outside of Kofax, and have poor OCR results as the text in the PDF - again, we need to be able to say "that's not good enough, OCR in KTM please" - how?
3. There are also scenarios where we receive spreadsheets that are converted to PDF in KIC. Some of those columns aren't wide enough to fit all the chars, but the conversion seems to be able to convert the full column text - even tho it's not visible - and the word area (shown in the xdc browser) overlaps the next columns' text. Potential problems here?

I believe these are all close to the same question, but with a few different takes on it. At the end of the day, I'd like to work out - on the server, not with user intervention - if I should/can use the PDFText or I need to OCR the image. Has anyone else had to deal with this?
Cheers
- Hayden
Hayden
Participant
 
Posts: 45
Joined: Tue Dec 14, 2010 8:42 pm
Location: Sydney, Australia

Re: Recognition - PDFText vs FR or RS

Postby Hayden » Mon Nov 21, 2016 3:05 pm

So to start to answer my own questions - but I'm still keen to hear from anyone who has the same issues - I'll update this...
Some tests have shown that when a PDF doc with text is "loaded" into KTM (I know that's not the correct term), the text is created effectively immediately. I stopped my processing in Document_BeforeProcessXDoc, inspected the xdoc, and those with native PDF text had it as a Representation on the CDOC - same place as the OCR results would go.
For those docs that didn't have text, or had text from a scan (yes, there seems to be a difference), no text was loaded.

At this point I can only assume that where no text is loaded, the KTM Server will do the OCR, and where text is loaded, it'll be skipped.

This would be helpful except that one of my examples is a landscape PDF that is in Portrait. And it's been created with the text. So the text goes from bottom to top of the document - it all is correct. But KTM can't rotate a PDF, and if I use Acrobat to roatate it, then put it into KTM, the text is all incorrect.

And even tools like KIC with VRS won't rotate the PDF if the output is a PDF.

If the first time the document is viewed is in Validation, then the only option would be to put it into another batch of a different class and reprocess it.

So, why use the native PDFText? And why process a PDF in KTM and not have it convert to a TIF? The only way I can see this working well is when the PDFs are from a known, controlled source.
Cheers
- Hayden
Hayden
Participant
 
Posts: 45
Joined: Tue Dec 14, 2010 8:42 pm
Location: Sydney, Australia


Return to Kofax Transformation Modules General Discussion

Who is online

Users browsing this forum: No registered users and 2 guests

cron