Table Extraction - General Hints

Table Extraction - General Hints

Postby » Wed Jun 04, 2008 6:40 am

Ever had an (invoice) project where table extraction was required?
You might feel in hot water, so I decided to start this thread to help you overcome those difficulties. Kofax Transformation Modules (AXP) can do much more than it seems to be at the first glance.

I'm currently working on this thread, so bear in mind that there might be more to come.
If you got another great idea, please share your thoughts!

How this post is organized:
    General Hints
      • Automatic Mode
      • Use columns form the pool
      • What's the language and global settings thing in the column pool?
      • The Language Pack
      • Use a Language Pack
      • Train your Language Pack
      • Setting up your table locator,
      • Test!
      • A short look at Templates
    Advanced Tips
      • How does it work?
      • Optional Lines?
      • Unit Price and Quantity mixed up
      • Using embedded columns
      • Interleaved Information (like order numbers, delivery note number)
      • How to "Train tables" in validation
      • Access Header Lines from script

How does table extraction work?
KTM uses a table locator to determine a valid table on a document. The locator can work two-level:
    • Automatic Mode, which tries to be a global table finder,
    • Manual Mode, which is highly configureable and preferably combined with a template.

General Hints
Below you may find some very general tips on how to extract table data. If you're already familiar with those, you may scroll down to the advanced section.

Automatic Mode
Use as many columns as possible!
Even if you don't need them, you should add as many columns as possible. They are linked to KTM's internal logic, and if a table contains columns which are not present in your table model, extraction may fail.
If you don't need them - hide them and set them to "Always valid".

Use columns form the pool
Don't create your own columns as they won't work with the internal logic. Use predefined ones.

What's the language and global settings thing in the column pool?
That's a feature which will be implemented in the future. Right now, don't bother about that.

The Language Pack
Language Packs contain essential information for the automatic table mode, as they will:
    • define the table start (usually the header)
    • define keywords for all columns
    • they are trainable - so train them!

Use a Language Pack
There's a pre-defined language pack shipping with KTM, feel free to use it at the beginning. You can find this language pack here (default location - just search for *.llp-files):
C:\Program Files\Kofax\Transformation\LanguagePackages

Train your Language Pack
If you open up your language pack, table header recognition will be performed on your test documents. There's a documentation so I won't go into details below you'll find the most important steps:
    • Use CTRL-Left Click to train a header
    • Use CTRL-Right Click to train a non-header (only necessary if the system has found a false-positive!)
    • In the tab "Headers" add all keywords to the correct columns
    • You can use OCR lasso here or just type them in, separated by semicolons. OCR errors here are not critical, as fuzzy search will be used.

Setting up your table locator (Automatic Mode)
Use a end of table dictionary - this will reduce the rate of false positives. Use very unique keywords in this dictionary as they will be matched fuzzy.

Setup a format locator for amounts - use a format locator here, I saw many people who tried to link the Amount Group Locator here. You need a specific format locator which can find all amounts, quantities, line numbers, etc.

Test!
Test your setup frequently and remember to train as many documents as possible.

A short look at Templates
Templates seem to have sort of bad reputation, so I want to go into this a little bit.
Some people may think that templates require a lot of manual setup time. That's not true. A template can often be set up in less than a minute - that's because KTM inherits everything from the base class.
You don't have to set up locators or fields again for the template, very often you just change one or two locators.

So, when is a template required?
When automatic table extraction is not sufficient because:
    • the table is to complex
    • none of the automatic algorithms is successful
    • it's not a typical invoice table

A template can be set up in 2 kind of ways:
    • during the regular project setup
    This requires you as the parter to have sufficient samples to generate a manual table locator and you as the end-user to gather enough of these samples, or
    • during running operations
    This is done by the end user - and it's very often not more than just a few mouse clicks.

I will describe Step 2 more detailed in the advanced section.
Last edited by wolfgang.radl@kofax.com on Wed Jun 18, 2008 7:33 am, edited 7 times in total.
Kind Regards,
Wolfgang Radl
KOFAX Sales Engineer
Participant
 
Posts: 453
Joined: Sun Oct 09, 2005 10:52 pm
Location: Vienna, Austria

Postby » Wed Jun 04, 2008 6:41 am

Advanced Section
Still not happy with your results? I bet so, as many tables are difficult to handle. Remember - scripting is your friend!

How does it work?
Okay - guess it helps explaining how automatic table detection works as this may, on the other hand, help you understanding results.
There are 5 algorithms for table detection:
    • Amount Based, which tries to calculate: 3 * 20 = 60, for instance (simplified). This algorithm goes for total price, unit price and quantity, sometimes it mixes quantity and unit price (see below how to deal with such problems)
    • Position Based, which goes for line item numbers to the left of a table (like 1, 2, 3, or 100, 200, 300, etc.)
    • Line Based, which looks for geometrical lines inside a table
    • Layout Based (not sure at this one), which works quite good if rows always look the same
    • Header Based, which goes for the header line (start of table) and uses the keywords you defined. This algorithm tries to match a unit price to the corresponding column, however it seems that this algorithm is not always used or combined with others.
As far as I know, KTM just picks the winner (I'm not sure here, please correct me, if this is wrong!!)

Optional Lines?
They can be a pain, especially if the amounts are always in the last line, like:
Image

Right now, these tables can be handled with the automatic mode only, as manual mode won't be able to handle the amounts.
There's a way to deal with such issues: scripting (and it is really easy, I'll provide a sample):
    Step 1 - fetch as much as possible from the table
    Image
    Step 2 - locate all amounts
    Image
    Step 3 - feel the magic!
    Image

How it works (the script):
MergeTableAmounts does all the magic. Provide the following:
• TableName = the table locator's name
• Amounts = format locator's name
• UnitPriceCellName = the unit price cell name, usually "Unit Price"
• TotalPriceCellName = same as above for the total price, usually "Total Price"
• pxdoc

Code: Select all
Private Sub MergeTableAmounts (TableName As String, Amounts As String, UnitPriceCellName As String, TotalPriceCellName As String, pXDoc As CASCADELib.CscXDocument)
   ' *** first: check how many rows are present and if (2*rows) amounts are present
   Dim myTableLoc As CscXDocLib.CscXDocFieldAlternative
   Dim myFormatLoc As CscXDocLib.CscXDocField

   Dim i As Integer

   Set myTableLoc = pXDoc.Locators.ItemByName(TableName).Alternatives.ItemByIndex(0)
   Set myFormatLoc = pXDoc.Locators.ItemByName(Amounts)

   If 2*(myTableLoc.Table.Rows.Count) = myFormatLoc.Alternatives.Count Then
      ' *** twice as much alternatives are present - seems fine!
      For i = 0 To myTableLoc.Table.Rows.Count - 1
         InitTableCellFromAlternative(myTableLoc.Table.Rows.ItemByIndex(i).Cells.ItemByName(UnitPriceCellName),myFormatLoc.Alternatives.ItemByIndex(i+i))
         InitTableCellFromAlternative(myTableLoc.Table.Rows.ItemByIndex(i).Cells.ItemByName(TotalPriceCellName),myFormatLoc.Alternatives.ItemByIndex(i+i+1))
      Next i

   Else
      ' *** more or less alternatives are present - that's not good.

   End If
End Sub


The next sub just maps the alternatives from the format locator to the table cells, including geographical info:

Code: Select all
Private Sub InitTableCellFromAlternative (myTableCell As CscXDocLib.CscXDocTableCell, myAlternative As CscXDocLib.CscXDocFieldAlternative)
   With myTableCell
      .Text = myAlternative.Text
      .ExtractionConfident = True
      .Width = myAlternative.Width
      .Height = myAlternative.Height
      .Top = myAlternative.Top
      .Left = myAlternative.Left
      .PageIndex = myAlternative.PageIndex
   End With
End Sub


Now just run the script:
Code: Select all
Call MergeTableAmounts ("TBLTABLE", "FTLSPECIFICTABLEAMOUNTS", "Unit Price", "Total Price", pXDoc)


For those of you who want to get the project, just send me an email.

Unit Price and Quantity mixed up
Sometimes, especially when the amount based algorithm wins, unit price and quantity may appear mixed up. A simple script can help here as well by just swapping them.
Here's the issue:
Image
Here's the result:
Image

And finally, that's the solution:
SwapQuantityAndUnitPrice swaps both cells, if your logic applies, but see below:
• myTable = your field.table (must be in after extract!)
• UnitPriceCellName = the unit price cell name, usually "Unit Price"
• QuantityCellName = same as above for the quantity, usually "Quantity"

Code: Select all
Private Sub SwapQuantityAndUnitPrice (myTable As CscXDocLib.CscXDocTable, UnitPriceCellName As String, QuantityCellName As String )

   Dim i As Integer

   ' *** we need at least 1 row
   If myTable.Rows.Count > 0 Then
      For i = 0 To myTable.Rows.Count - 1
         ' *** check for a comma or a semicolon in unit price and qty
         ' *** case 1: QTY contains separator, unit price does not
         If (InStr(myTable.Rows.ItemByIndex(i).Cells.ItemByName(UnitPriceCellName).Text, ",") = 0 Or _
             InStr(myTable.Rows.ItemByIndex(i).Cells.ItemByName(UnitPriceCellName).Text, ".") = 0) And _
            (InStr(myTable.Rows.ItemByIndex(i).Cells.ItemByName(QuantityCellName).Text, ",") > 0 Or _
             InStr(myTable.Rows.ItemByIndex(i).Cells.ItemByName(QuantityCellName).Text, ".") > 0) Then
            ' *** swap
            Call SwapTableCells(myTable.Rows.ItemByIndex(i).Cells.ItemByName(UnitPriceCellName), myTable.Rows.ItemByIndex(i).Cells.ItemByName(QuantityCellName))
         Else
            ' *** your own logic can go here, for example: quantity > unit price or whatever..
         End If
      Next i
   End If

End Sub


The next sub swaps two table cells.

Code: Select all
Private Sub SwapTableCells (myTableCellA As CscXDocLib.CscXDocTableCell, myTableCellB As CscXDocLib.CscXDocTableCell)

   Dim myTempCell As CscXDocLib.CscXDocTableCell
   Set myTempCell = New CscXDocLib.CscXDocTableCell

   With myTempCell
      .Text = myTableCellA.Text
      .ExtractionConfident = True
      .Width = myTableCellA.Width
      .Height = myTableCellA.Height
      .Top = myTableCellA.Top
      .Left = myTableCellA.Left
      .PageIndex = myTableCellA.PageIndex

   End With

   With myTableCellA
      .Text = myTableCellB.Text
      .ExtractionConfident = True
      .Width = myTableCellB.Width
      .Height = myTableCellB.Height
      .Top = myTableCellB.Top
      .Left = myTableCellB.Left
      .PageIndex = myTableCellB.PageIndex
   End With

   With myTableCellB
      .Text = myTempCell.Text
      .ExtractionConfident = True
      .Width = myTempCell.Width
      .Height = myTempCell.Height
      .Top = myTempCell.Top
      .Left = myTempCell.Left
      .PageIndex = myTempCell.PageIndex
   End With
End Sub


Using embedded columns
The Embedded Column feature simply rocks if you want to extract things that "hide" themselves, like here:
Image
3 easy steps:
    1 - set up your table locator and make sure the whole description goes to the column
    2 - set up a format locator which finds all article codes (45\d{4})
    3 - set up a new embedded column like here:
    Image
    The result:
    Image

Interleaved Information (like order numbers, delivery note number)
If you receive an invoice for e.g. 2 different orders, than it could be possible that the first 2 items in your table are related to the first order while the next 8 belong to the second one.
Have a look at this sample document:
Image
Now KTM comes with a nice feature called Interleaved Order Numbers or Interleaved Delivery Note Numbers, but if you read carefully then you'll see that there is an OR statement. That's correct, by default you can only read delivery note OR order numbers.
What if you want to read both?
What if you want to extract something else in that kind of way that is neither an order number nor a delivery note number?
Here's the script again, just copy and paste it and you'll be fine (works even on multipage! ;-))
This script allows you to extract anything that can be found with an format locator and to copy that to any table cell of your choice:
• pxdoc
• myTable = your Table Locator
• myInterleavedValues = your Format Locator
• myCellName = the Cell Name where the Value will be copied to

Code: Select all
Private Sub CopyInterleavedValuesToTable(pXDoc As CscXDocument, myTable As CscXDocTable, myInterleavedValues As CscXDocField, myCellName As String)

   Dim i As Integer
   Dim iActiveAlternative

   iActiveAlternative = 0

   ' *** at least 1 row should be present and at least 1 alternative needs to be present
   If myTable.Rows.Count > 0 And myInterleavedValues.Alternatives.Count > 0 Then
      For i = 0 To myTable.Rows.Count - 1

         If myInterleavedValues.Alternatives.ItemByIndex(iActiveAlternative).Top < myTable.Rows.ItemByIndex(i).Cells.ItemByIndex(0).Top Or _
            ((myInterleavedValues.Alternatives.ItemByIndex(iActiveAlternative).Top > myTable.Rows.ItemByIndex(i).Cells.ItemByIndex(0).Top) And _
              myInterleavedValues.Alternatives.ItemByIndex(iActiveAlternative).PageIndex < myTable.Rows.ItemByIndex(i).Cells.ItemByIndex(0).PageIndex) Then
            ' *** the current row is below the e.g. order number
            ' *** OR the current row is not below, but it is on the following page!
            ' *** now check if another alternative is present
            If myInterleavedValues.Alternatives.Count > iActiveAlternative + 1 Then
               ' *** another alternative present!
               If (myInterleavedValues.Alternatives.ItemByIndex(iActiveAlternative + 1).Top < myTable.Rows.ItemByIndex(i).Cells.ItemByIndex(0).Top) And myInterleavedValues.Alternatives.ItemByIndex(iActiveAlternative + 1).PageIndex = myTable.Rows.ItemByIndex(i).Cells.ItemByIndex(0).PageIndex Then
               ' *** we are already below the next alternative (but still on the same page)!
                  iActiveAlternative = iActiveAlternative + 1
               End If
            End If
         End If
         InitTableCellFromAlternative(myTable.Rows.ItemByIndex(i).Cells.ItemByName(myCellName), myInterleavedValues.Alternatives.ItemByIndex(iActiveAlternative))
      Next i
   End If

End Sub


Of course, you need this little helper again:

Code: Select all
Private Sub InitTableCellFromAlternative (myTableCell As CscXDocLib.CscXDocTableCell, myAlternative As CscXDocLib.CscXDocFieldAlternative)
   With myTableCell
      .Text = myAlternative.Text
      .ExtractionConfident = True
      .Width = myAlternative.Width
      .Height = myAlternative.Height
      .Top = myAlternative.Top
      .Left = myAlternative.Left
      .PageIndex = myAlternative.PageIndex
   End With
End Sub


And finally, you need to call the script. This sample assigns delivery note and order numbers at the same time.

Code: Select all
Private Sub TABLE_BeforeExtract(ByRef pXDoc As CASCADELib.CscXDocument, ByRef pField As CASCADELib.CscXDocField, ByRef pSkip As Boolean)
      Call CopyInterleavedValuesToTable(pXDoc, pXDoc.Locators.ItemByName("TBLTABLE").Alternatives.ItemByIndex(0).Table, pXDoc.Locators.ItemByName("FTLORDERNO"), "Order Number")
      Call CopyInterleavedValuesToTable(pXDoc, pXDoc.Locators.ItemByName("TBLTABLE").Alternatives.ItemByIndex(0).Table, pXDoc.Locators.ItemByName("FTLDELIVERYNOTENO"), "Delivery Note Number")
End Sub


The result:
Image

How to "Train tables" in validation
Now image a scenario where a customer wants to extract tables from hundreds or thousand suppliers. As I described in the section "A short look at Templates" sometimes this can be impossible - so there's an option to "train" invoices.

Here's the basic overview:
- the user scans the documents
- inside validation, the user is not happy with the results:
Image
- the user deletes the table and uses auto table completion:
Image
Image
- the user decides to train the document
Image

The document will now be processed and released. In Version earlier than 3.5 you need to setup the release script for the new samples, in version 3.5 or later you need to have the KB learning in queue.
Now the document is already prepared for manual table recognition. There are just some more steps required until the template is ready:
Image
Just select the document and click:
Image

Here's an overview what KTM has done for you now:
• created a template
• used the image classifier and added the document as a sample
• changed the table locator method to a local, manual table mode
• set up the manual table mode accordingly to one of the validation user

Things that you need to do:
• check if you're happy with the results
• TRAIN the project (KTM does not do that automatically)
• re-synch and re-publish

Image

Access Header Lines from script
Sometimes it might be interesting to know where a table header has been found. This could be the first step to your own table detection algorithm!
Image
Image

Place this script whereever you need it.
Code: Select all
   Dim myTableHeaderLine As CscXDocLib.CscXDocTableHeaderLine
   Dim lLeft, lTop, lGlobalColumnID, lStartLineIndex As Long

   ' *** just a dialog for allowing the user to enter a column ID

   Begin Dialog UserDialog 310,56 ' %GRID:10,7,1,1
      TextBox 170,7,130,21,.txbxColumnID
      OKButton 170,28,130,21
      Text 10,7,150,14,"Enter Column ID here",.Text1
   End Dialog
   Dim dlg As UserDialog
   Dialog dlg

   On Error Resume Next

   ' *** set the header line object

   Set myTableHeaderLine = pXDoc.Locators.ItemByName("TBLTABLE").Alternatives.ItemByIndex(0).Table.HeaderLines.ItemByIndex(0)

   lGlobalColumnID = CDbl(dlg.txbxColumnID)

   ' *** now get the line number, the pixels from left and above

   lLeft = myTableHeaderLine.Cells.ItemByIndex(lGlobalColumnID).Left
   lTop = myTableHeaderLine.Cells.ItemByIndex(lGlobalColumnID).Top
   lStartLineIndex = myTableHeaderLine.StartLineIndex

   MsgBox ("Table Header starts at line: " & lStartLineIndex & vbCrLf _
          & "Column ID no. " & CStr(lGlobalColumnID) & " found at coords: " & vbCrLf _
          & "LEFT (px): " & lLeft & vbCrLf _
          & "TOP (px):  " & lTop)


   Set myTableHeaderLine = Nothing




.... to be continued ...
Last edited by wolfgang.radl@kofax.com on Wed Jun 18, 2008 7:32 am, edited 4 times in total.
Kind Regards,
Wolfgang Radl
KOFAX Sales Engineer
Participant
 
Posts: 453
Joined: Sun Oct 09, 2005 10:52 pm
Location: Vienna, Austria

Postby » Thu Jun 05, 2008 7:10 am

Updated and extended. Please let me know if there's something missing - or if you think you got a table that you can't handle ;-)
Kind Regards,
Wolfgang Radl
KOFAX Sales Engineer
Participant
 
Posts: 453
Joined: Sun Oct 09, 2005 10:52 pm
Location: Vienna, Austria

Postby » Mon Aug 11, 2008 11:19 am

Wolfgang:

Thanks for the nice write-up on how to use table locators! I do have one question. I originally made the mistake you mentioned of trying to use an amount group locator for the locator in the table setup. I am now using a format locator designed to locate amounts, but in your description, you mention that the locator should locate item #'s, quantity, etc. - not just amounts. To design a locator like that would be to pretty much find anything on the invoice, no? Could you elaborate a bit on this?

Thanks,

Dan
Participant
 
Posts: 51
Joined: Wed Sep 19, 2007 9:40 am

Postby » Thu Aug 14, 2008 4:23 am

Hi Dan,
according to my information this locator was used to detect line numbers, quantities and everything else as well (however this might be outdated information - maybe the LCI guys could shed some more light).

I always used the locator in that kind of way and never had any problems (like misinterpreted amounts). The only issues I had were related to quantity/unit price mixups, however there's a workaround available for that.
And don't bother about finding to much on a document, I've seen even some invoices where amounts where missing any comma.. and you need to cover them as well.
Participant
 
Posts: 44
Joined: Tue Aug 05, 2008 11:38 pm
Location: Vienna, Austria

Postby » Tue Sep 30, 2008 4:16 am

It would be good if they made <a href="http://forums.kofax.com/viewtopic.php?t=13601">this</a> a built in function for the next version don't you think as it seems the most sensible use of the table results to compare against another value. Perhaps version 4!

Thanks again for the article and say Hi to Vienna! Got engaged there 2 years ago and fell in love with the city.


David

<b>Moderator Note: Tidied up this thread and clarified this post with a hyperlink</b>
Participant
 
Posts: 35
Joined: Tue Sep 19, 2006 5:54 am
Location: Warwick, UK

Postby » Tue Sep 30, 2008 4:46 am

DaveG wrote:It would be good if they made <a href="http://forums.kofax.com/viewtopic.php?t=13601">this</a> a built in function for the next version don't you think as it seems the most sensible use of the table results to compare against another value. Perhaps version 4!

It's already there, but the documentation is really hazy.
Best Regards,

Daniel Kekesi
DocSoft Hungary
Image
Participant
 
Posts: 2569
Joined: Thu Dec 08, 2005 12:56 am
Location: Budapest, Hungary


Return to Kofax Transformation Modules General Discussion

Who is online

Users browsing this forum: No registered users and 1 guest