BEST PRACTICE: Regular Expressions in Validation Scripts

BEST PRACTICE: Regular Expressions in Validation Scripts

Postby stephan_mayer@kofax.com » Mon May 07, 2007 10:31 am

Using Regular Expressions in Recognition or Validation scripts

Have you ever had the need to check if an index field value matches a certain pattern? While there are many
ways to do this using traditional code, regular expressions have been specifically designed for pattern matching -
they are much more powerful and flexible. In the following I will show you how you can enhance your validation
scripts using regular expressions.

Sample Use Cases
There are three typical use cases for regular expressions:
  • Check if a string matches one or more pattern
  • Search for a substring inside a string
  • Replace all substrings matching a pattern with another string (or nothing)
A pattern in the context of a regular expression defines sets of characters and their relative position. For example,
if a field should start with two capital letters followed by a 5 or 6-digit number, the corresponding pattern would be
"^[A-Z]{2}[0-9]{5,6}$". This translates to the following matching instructions: at the begin of input (^) check for
uppercase characters ([A-Z]), exactly two of these are needed ({2}), then look for a number ([0-9]) with minimum
5, but maximum 6 digits ({5,6}). Then the input needs to end ($).
If you omit the ^ and $ from the pattern, the string "ABC1234567" would also match - so be careful when you
create the pattern for matching a complete field to not forget about these.
If you want to use the regular expression to find occurrences of a pattern within a string, you would have to omit
the ^ and $ - regardless of whether you want to get the match back or replace it with another string.
For common sub-patterns you can also use abbreviations, e.g. \d (digit) is the same as [0-9], \s (space) indicates
any white space including blank, tab, form feed etc. Please refer to the Microsoft documentation for details.

The Regexp Object
Now how can you use a regular expression with our recognition or validation scripts? Although SBL does not
include this functionality, luckily Microsoft includes it with every operating system they sell today. It is contained in
the Microsoft VBScript library and exposed as a COM object (for the following samples we assume version 5.0 or
higher is installed, which is the case if you have at least Internet Explorer 5.0 installed - XP for example includes
version 5.6).
All you need to do is to create a new instance of this object and use it.
In SBL, the code for this is very simple:
Code: Select all
Dim oRegexp As Object
Set oRegexp = CreateObject("VBScript.RegExp")

Now you have an instance of the regular expression evaluator that you can use right away. Before you look into
the details you should take care about cleaning up your COM object in a proper way, so as soon as you no longer
need a reference to the object, please do not forget to free it using the following statement:
Code: Select all
Set oRegexp = Nothing

Properties and Methods
The Regexp object only has a few properties and methods. Our next step is to initialise some properties to make it
behave the way you need:
Code: Select all
'Search case sensitive
oRegExp.IgnoreCase = False

Code: Select all
'Search whole string
'-> do not stop on first match
oRegExp.Global = True

Then you need to set the search pattern. You simply do this by assigning a string to the pattern variable:
Code: Select all
'Search for number with 10-12 digits
oRegExp.Pattern = "^[0-9]{10,12}$"

Now you can check if a string matches this pattern. This is done by a call to the Test method:
Code: Select all
bMatchFound = oRegExp.Test("1234567890")

The return value is True for a match, otherwise False.
When you want to look for a pattern within a string, you have to use the Execute method. It returns a collection of
match objects, one for each match that has been found.
Let's assume you are looking for a numeric date. you want to find 6.12.2001 as well as 01/05/99 or 03-10-2005. The
corresponding regular expression looks as follows:
Code: Select all
oRegexp.Pattern = "\d{1,2}\s?([\.\- /])\s?\d{1,2}\s?\1\s?(\d{4}|\d{2})"

Now you can search a string for date occurrences:
Code: Select all
Dim oMatches As Object
Set oMatches = oRegExp.Execute("6.12.2001 01/05/99 03-10-2005")

The matches collection now contains 3 match objects and you can use the Value property to retrieve the string that
matched your expression. As SBL does not support For...Each loops, a normal For...Next will do it:
Code: Select all
Dim i As Integer
For i = 0 To oMatches.Count
  MsgBox oMatches(i).Value
Next

Putting everything together
Now you know all the basics and can start to implement your own routines. Let's start with a function that filters
unwanted characters from a string:
Code: Select all
Option Explicit

Function FilterCharacters(strSearch As String, strAllowedChars As String) As String

   Dim oRegEx As Object

   On Error Resume Next

   '*** Create new instance of RegExp object
   Set oRegEx = CreateObject("VBScript.RegExp")
   '*** Search case sensitive
   oRegEx.IgnoreCase = False
   '*** Search all string
   oRegEx.Global = True

   '*** Build the search pattern
   oRegEx.Pattern = "[^" & strAllowedChars & "]"

   FilterCharacters = oRegEx.Replace(strSearch,"")
End Function


To test the routine you can add a Sub Main(), call the function and test it in SBL by pressing [F5]
Code: Select all
Sub Main()
   MsgBox FilterCharacters("Kofax Image Products - VRS 4.1", "A-Z0-9 ")
End Sub


The result is as expected ("K I P VRS 41") - only upper case characters, numbers and blanks are returned,
everything else is filtered.
In the next example we want to check if a string is valid:
Code: Select all
Option Explicit

Function RegexValidate(strSearch As String, strPattern As String) As Integer
   Dim oRegEx As Object

   On Error Resume Next

   '*** Create new instance of RegExp object
   Set oRegEx = CreateObject("VBScript.RegExp")
   '*** Search case sensitive
   oRegEx.IgnoreCase = False
   '*** Search all string
   oRegEx.Global = True

   '*** Build the search pattern
   oRegEx.Pattern = strPattern
   RegexValidate = oRegEx.Test(strSearch)
End Function


To test this function, you can use the following Sub Main():
Code: Select all
Sub Main()
   Dim strIn As String
   Do
      strIn = InputBox("Please enter a 5-digit " & _
                "number or uppercase-" & _
                "characters (nothing to exit)")
      If strIn = "" Then Exit Do
      If RegexValidate(strIn, "^[0-9]{5}|[A-Z]$") Then
         MsgBox strIn & " is valid", 64
      Else
         MsgBox strIn & " is not valid", 48
      End If
   Loop
End Sub


And the last example shows how to find a pattern within a string and return the match:

Code: Select all
Option Explicit

Function RegexSearch(strSearch, strPattern) As String
   Dim oRegEx As Object
   Dim oMatches As Object
   
   Set oRegEx = CreateObject("VBScript.RegExp")
   oRegEx.IgnoreCase = False
   oRegEx.Global = True
   oRegEx.Pattern = strPattern
   Set oMatches = oRegEx.Execute(strSearch)

   If oMatches.Count > 0 Then
      RegexSearch = oMatches(0).Value
   End If
End Function


Code: Select all
Sub Main()
   MsgBox RegexSearch("Test String 01.01.2006 contains date", _
          "\d{1,2}\s?([\.\- /])\s?\d{1,2}\s?\1\s?(\d{4}|\d{2})")
End Sub

This can be particularly useful in Recognition Scripts to find a value in the recognition result (e.g. for a larger OCR zone).

Now it is up to you to create your own regular expression searches. Just a few lines of code allow you to implement
powerful check, search and replace routines.
Best Regards,<BR>
Stephan Mayer
Presales Manager EMEA
Kofax Image Products
stephan_mayer@kofax.com
Participant
 
Posts: 350
Joined: Wed Jan 28, 2004 3:40 am
Location: Germany

Postby dkekesi » Tue May 08, 2007 12:45 am

And, of course, if anyone is coming from the PERL world then there's always the PCRE (PERL Compatible Regular Expression) library that is also freely available as a DLL and can be similarly used to the example above. Naturally, the syntax of regular expressions will be different. PCRE could be useful if you already have complex RegExps used in UNIX systems and do not wish to remake them to be VBScript compatible. Also some consider PCRE library to be faster in many scenarios.
Best Regards,

Daniel Kekesi
DocSoft Hungary
Image
dkekesi
Participant
 
Posts: 2569
Joined: Thu Dec 08, 2005 12:56 am
Location: Budapest, Hungary


Return to Kofax Capture General Discussion

Who is online

Users browsing this forum: Google [Bot] and 3 guests

cron