Regular Regex Extraction Plugin

Overview

This plugin extracts index field values based on the pattern defined for that field.  A semicolon-separated collection of one or more words followed by a regular expression can be defined for the pattern.  The system will search each page for the regular expression.  If a match is found, the system will look to the left of the match and see if all of the preceding words in the pattern can be found.  If all of the words are found (in order), the value will be extracted.  If only a subset of the words are found, or if none of the words are found, the value will not be extracted.

Examples

Consider the following text defined for the pattern field of the InvoiceDate index field:  Invoice;Date;\d{1,2}[/]\d{1,2}[/]\d{2,4}

Example 1

Text string in document:  Invoice Date 21/03/2012

Result: “21/03/2012” will be extracted for the InvoiceDate index field.  This happens because “21/03/2012” matches the regular expression pattern, with “Date” found to its left, and “Invoice” found to its left.

Example 2

Text string in document:  Date 21/03/2012

Result:  Nothing will be extracted for this index field.  Even though “21/03/2012” matches the regular expression, and “Date” is found to its left, the word “Invoice” is not found to the left of “Date.”

Plugin Configuration

The REGULAR_REGEX_EXTRACTION plugin can be configured in the following UI:

Plugin Configuration

Properties Description

Configurable property Type of value Value options Description
Regular Regex Extraction Switch List of Values
  • ON
  • OFF

 

This property determines if the plugin will run or not.

Default value is ON.

Regular Regex Confidence Score Integer 0 – 100 Acts as a multiplier for the confidence score calculated by matching regex.

 

The semicolon-separated set of words and regular expression can be entered in the Pattern column for each index field:

3

Troubleshooting

The following table lists possible error messages that could appear, and an explanation of what each error message means.

Error message Possible root cause
Invalid input pattern sequence. The pattern entered is not a valid regular expression, or doesn’t match the proper format.
No FieldType data found from data base for document type The FieldType column doesn’t contain a valid value.